Insights / Uncategorized

PUBLISHED: Nov 6, 2013 4 min read

How a Search Engine Indexes a Page

There are ~250 million registered domain names on the internet today, billions of subdomains, and trillions of distinct web pages.  Search engines collect all of the text these pages contain by combining millions of specialized computers into what are known as search engine spiders which download all web pages they can reach, parse the content of those pages, and store them in large databases located all across the world.  They are then tasked with taking all of the content stored in those databases and finding ways to use it to rank web pages that match any possible keyword or phrase a user enters in order of relevance.

Parsing the Contents of a Web Page

A search engine tries to look at a web page from the perspective of a human user, but must make inferences as to what words or phrases on the page are most important when determining what the page is about. Web pages contain HTML markup, and terms on web pages may be given more weight when indexing based on factors such as font size, placement on the page, and font readability.  A web page may also specify the language the content is in, but most search engines are now able to perform language recognition to automatically determine the language of a web page.  Additionally, terms that appear on the page may be subjected to a process called stemming, which takes terms like “fighting”, “fights”, and “fighter”, and reduces them to their stem word of “fight”.

Creating an Inverted Index

Most search engines use an inverted index to store the content of web pages.  A good way to think about how an inverted index stores and handles all that content is to think of a search engine’s index as the index in the back of a textbook.  A book’s index contains a list of the words used in the book and the pages they appear on (ex: a biology book’s index might contain Osmosis: 65, 573-578, 654 to let you know the word “osmosis” is discussed on those page numbers).  If you were to make a list of all unique terms that appear on all web pages, that list would be much smaller than the length of the content of all web pages combined because most terms appear across multiple pages.

For example, consider three short documents:

(1)   New York City Public Schools

(2)   Bars in New York City

(3)   Events in City Schools

The index for these three documents looks like:

Search Engines How To







Storing Billions of Keywords and Phrases

Searching a list of all unique terms that appear on the internet is much faster than searching the full content of all web pages (which amounts to petabytes of data) but that list is still too large to use to match websites to the keywords that appear on them in real-time.  The solution to this is to store the contents of a web page as n-grams with substrings of n length, and most search engines likely use tri-grams to do this.  A tri-gram representation of a document is that document broken up into all of its 3 character combinations, for example:

“Sweater” = {swe, wea, eat, ate, ter}

Since a term can contain any combination of 26 letters, 10 digits and ~10 symbols, the total number of unique trigrams that can exist is (26+10+10)^3, which equals 97,336, creating a significantly smaller list of terms to search in real-time than a list of all unique terms on the internet which would be in the hundreds of billions.

Optimizing Content for Important Keywords

To ensure a web page shows up in the list of results returned for a given keyword, the single most important thing you can do is place that keyword on the page in a visible position, followed closely by using it in page headers, meta information, the title of the page, and within body text in relevant context.  Other factors that might impact how a search engine determines the relevance of a term on a page are the frequency with which that term appears, the inverse document frequency (rare terms have more weight than common terms), and length normalization (ex: a term found in a document containing 100 words has more weight than a term in a document containing 1,000 words, because it is 1% of the document’s text).

The Search Engine Optimization process can seem overwhelming. AmsiveDigital can provide the direction you need to make the SEO process simple and results obtainable. We have custom-tailored packages for clients at every stage in their business.

For a free marketing consultation, fill out our express form or call 800-680-4304 today.