~ Web Technologies


Search engine indexing

The following information is provided by google itself.

>>Before you search, web crawlers gather information from across hundreds of billions of webpages and organize it in the Search index.

The crawling process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As our crawlers visit these websites, they use links on those sites to discover other pages. The software pays special attention to new sites, changes to existing sites and dead links. Computer programs determine which sites to crawl, how often and how many pages to fetch from each site.

We offer webmaster tools to give site owners granular choices about how Google crawls their site: they can provide detailed instructions about how to process pages on their sites, can request a recrawl or can opt out of crawling altogether using a file called “robots.txt”. Google never accepts payment to crawl a site more frequently — we provide the same tools to all websites to ensure the best possible results for our users.

Finding information by crawling

The web is like an ever-growing library with billions of books and no central filing system. We use software known as web crawlers to discover publicly available webpages. Crawlers look at webpages and follow links on those pages, much like you would if you were browsing content on the web. They go from link to link and bring data about those webpages back to Google’s servers.

From: https://www.google.com/search/howsearchworks/crawling-indexing/

 

Organizing information by indexing

When crawlers find a webpage, our systems render the content of the page, just as a browser does. We take note of key signals β€” from keywords to website freshness β€” and we keep track of it all in the Search index. The Google Search index contains hundreds of billions of webpages and is well over 100,000,000 gigabytes in size. It’s like the index in the back of a book β€” with an entry for every word seen on every web page we index. When we index a web page, we add it to the entries for all of the words it contains.


Text is available under the Creative Commons Attribution-ShareAlike License https://en.wikibooks.org/wiki/A-level_Computing