« Ten Reasons Why Good Link Builders Fail | Main | Why Good Link Builders Fail: Reason One »
Google Indexes Its Own Toolbar Content(?)
November 23, 2009
I don't think this is a particularly big deal, but I am fascinated by crawler behavior and the wheres and whys of crawlers not honoring sites' specific robots directives.
And it makes it even more interesting when the robot and the site belong to the same company.
A few weeks ago, I was trying to find out exactly when Google overtook Yahoo in the race for search engine market share. (It's not important why, but it will help you understand why I was searching for such an odd phrase.)
I ended up searching for this query:
["google passes yahoo" "search market share" 2004]
And the results page looked like this:

If you click over, you can clearly see that we're in the /archivesearch portion of the toolbar.google.com site:
![]()
If you go to the Google Toolbar site's robots.txt file, however, you'll see that this portion is supposed to be off-limits to Googlebot:

(Note: This robots.txt file also has certain "allow" commands, but none that should pertain to this particular page.)
But wait. Couldn't this just be an "uncrawled reference" -- that rare-but-easily-recreated instance where Google indexes pages based on incoming links, but doesn't actually crawl the page, so therefore still honors the robots.txt exclusion protocol?
No, I don't think so, at least in this case. Uncrawled references are generally don't have snippets attached to them, and if you look at the SERP above, you'll see a snipped pulled from deep within the actual page:

I'm not claiming to know each subtle nuance of uncrawled references, but I study robots exclusion pretty closely, and this is the first instance I've seen of a section from within an excluded page being used as its snippet.
I'm certainly willing to concede that Google just happened to find this information somewhere else and attribute it to this page, but part of me making that concession is someone proving that it actually happened. I'm not tied to any particular outcome; I'd just like to learn more about why this happens.
All posts by Erik Dafforn
posted by Erik Dafforn at November 23, 2009 05:22 PM
Intrapromote: [ Case studies | SEO services | Bios ]
Trackback Pings
To TrackBack this entry, use the following URL:
http://seoblog.intrapromote.com/mt-tb.cgi/623
Comments
A couple colleagues at Google checked into this, and we crawled the page before the robots.txt directive was put in place. When we try to recrawl that page we'll see that it's blocked in robots.txt and won't crawl it after that.
Posted by: Matt Cutts at November 24, 2009 11:12 AM
Great article, this is some intresting stuff i would like to find out more regarding this and be intrested in peoples comments.
Posted by: Kenneth at November 24, 2009 12:19 PM
Thanks for the information about Google Indexes Its Own Toolbar Content... I learn lots of things
Posted by: Wax Fajardo at November 25, 2009 02:29 AM

