« What We're Reading: Summer 2008 | Main | Twitter and the "Black Box" of Reputation Management »

The Difference Between Crawling and Indexing

August 08, 2008

Erik Dafforn

I talk a lot about crawling and indexing (to the point that we have a dedicated category), but I think it's worthwhile to back up and describe some of what's going on.

The terms crawling and indexing (and indexing's cousin, caching) are frequently used together, but you should not consider them synonyms.

Exact definitions probably differ from person to person, but following is how I explain the processes:

Crawling is the process of an engine requesting -- and successfully downloading -- a unique URL. Obstacles to crawling include no links to a URL, server downtime, robots exclusion, or using links (such as some JavaScript links) from which bots cannot find a valid URL.

Indexing is the result of successful crawling. I consider a URL to be indexed (by Google) when an info: or cache: query produces a result, signifying the URL's presence in the Google index. Obstacles to indexing can include duplication (the engine might decide to index only one version of content for which it finds many nearly identical URLs), unreliable server delivery (the engine may decide to not index a page that it can access during only one-third of its attempts), and so on.

What's the difference between crawling and indexing, in terms of time? Here's a recent example. I recently watched a newly introduced URL to see when it would be indexed. I monitored the text cache query of the URL every four hours starting when the URL went live on July 2. (This URL was one of a number of URLs linked to on a new site map.)

On July 17, the text cache showed results and finally stopped saying "Your search - cache:[URL] - did not match any documents." But what was interesting is that the cached file showed the results of the URL "as retrieved on 8 Jul 08." So make special note that the URL was crawled and cached over a week before it appeared in the index.

A better, more comprehensive test would be to watch server logs and see how many times the file was requested, and with what frequency, between the original request date and date at which the cache query showed results. Additional testing would try to detect ways to shorten that time by increasing the number (and prominence) of incoming links and so on.

All posts by Erik Dafforn
posted by Erik Dafforn at August 8, 2008 12:04 PM
Intrapromote: [ Case studies | SEO services | Bios ]

Printer-friendly version

Trackback Pings

To TrackBack this entry, use the following URL:
http://seoblog.intrapromote.com/mt-tb.cgi/564

Comments

I think both crawling and indexing are important. I think if the homepage or a main section has a change the crawler can check it and see if there are any changes, if not its not worth indexing a site.

Posted by: chakali at August 13, 2008 08:42 AM

Important information to know, thanks

Posted by: Michael Fairbairn Cordova at August 15, 2008 04:25 AM

I would like to know how often crawling and indexing happenns in a month. is there any fixed time interval between crawls and indexing. its natural the dynamic content sites get more bot visits, but how often?. this is something i would like to know

Posted by: hosting plans 4 all at August 19, 2008 03:26 PM

well but i am still confused boes google start ranking the webpage once it crawled it or after properly indexing it and caching it,

Posted by: software at September 11, 2008 01:34 AM

Important information to know, thanks

Posted by: sohbet at October 3, 2008 04:43 PM

Post a comment




Remember Me?


(you may use HTML tags for style)

Copyright 2005-2008 Intrapromote, LLC