Crawling and Indexing Articles by SEO Speedwagon
November 23, 2009
Google Indexes Its Own Toolbar Content(?) 
I don't think this is a particularly big deal, but I am fascinated by crawler behavior and the wheres and whys of crawlers not honoring sites' specific robots directives.
And it makes it even more interesting when the robot and the site belong to the same company.
A few weeks ago, I was trying to find out exactly when Google overtook Yahoo in the race for search engine market share. (It's not important why, but it will help you understand why I was searching for such an odd phrase.)
I ended up searching for this query:
["google passes yahoo" "search market share" 2004]
And the results page looked like this:

If you click over, you can clearly see that we're in the /archivesearch portion of the toolbar.google.com site:
![]()
If you go to the Google Toolbar site's robots.txt file, however, you'll see that this portion is supposed to be off-limits to Googlebot:

(Note: This robots.txt file also has certain "allow" commands, but none that should pertain to this particular page.)
But wait. Couldn't this just be an "uncrawled reference" -- that rare-but-easily-recreated instance where Google indexes pages based on incoming links, but doesn't actually crawl the page, so therefore still honors the robots.txt exclusion protocol?
No, I don't think so, at least in this case. Uncrawled references are generally don't have snippets attached to them, and if you look at the SERP above, you'll see a snipped pulled from deep within the actual page:

I'm not claiming to know each subtle nuance of uncrawled references, but I study robots exclusion pretty closely, and this is the first instance I've seen of a section from within an excluded page being used as its snippet.
I'm certainly willing to concede that Google just happened to find this information somewhere else and attribute it to this page, but part of me making that concession is someone proving that it actually happened. I'm not tied to any particular outcome; I'd just like to learn more about why this happens.
Google Indexes Its Own Toolbar Content(?)
Posted by erik at 05:22 PM
| Comments (3)
| TrackBacks (0)
Printer-friendly version
November 06, 2009
Social Media URL Duplication and the Canonical Link Element 
When most people discuss the canonical link element, they describe its usage in the context of duplicate content mitigation, such as www vs non-www content, print-friendly pages, and so on. This is entirely appropriate. But the ways that we're all creating duplicate content are constantly growing and changing, which means that even if you think you don't need to canonicalize your pages, you might be wrong.
This post discusses how using the canonical link element might help you even if you don't think you need it.
Quick question: Should you use the canonical tag on your pages even if you're not sending out multiple versions of them?
Absolutely.
Why?
Because someone else might be creating versions of your pages that you don't even know about.
Here's an example: When I share something in my Google Reader, here's what happens:
- Twitterfeed grabs my Google Reader "public" RSS feed, which is how my shared items are dispersed.
- Twitterfeed takes the URL I'm sharing and appends two UTM tags to it -- "source" and "campaign".
- Via Twitterfeed, Bit.ly shortens the long URL (including UTM tags) that I'm sharing.
- Twitterfeed shoots the title of the post and shortened URL out over the @intrapromote Twitter stream.
In other words, I might read this URL:
http://searchengineland.com/blocking-and-tackling-10-fundamentals-of-local-seo-29115
But when I share and tweet it, it ends up looking like this:
http://searchengineland.com/blocking-and-tackling-10-fundamentals-of-local-seo-29115?utm_campaign=ipshare&utm_source=reader
Basically, I've created a duplicate URL for Search Engine Land, which they didn't ask for and probably don't know about. But the crew over there has anticipated this, because when you look at the source code for the page I created, you see this code:
link rel="canonical" href="http://searchengineland.com/blocking-and-tackling-10-fundamentals-of-local-seo-29115" /
This tag tells engines that no matter what tags I (or anyone else, including SEL) puts on those pages, this one is the authority.
UTM tags, of course, are primarily for measuring the effectiveness of your own social media endeavors on your own content, but the idea of someone appending tags to your content isn't far-fetched. Don't rule out people wanting to measure everything -- including their effect on other sites' traffic. Agencies use it to measure their efforts to a variety of client sites, and ad-selling sites use it for case study purposes to illustrate their reach.
Search Engine Land likely uses the canonical tags to consolidate authority because of their own tracking tags. But in this case, I've shown how someone on the outside can splinter your authority. It's pretty easy to add this tag to your pages (despite the obvious fact that I haven't done it on this blog yet), and the more ways you distribute your content, the more sense it makes to find the time to do it.
Social Media URL Duplication and the Canonical Link Element
Posted by erik at 07:33 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
May 27, 2009
Dafforn First to Discover Google Changes Profile Hop from 302 to 301 
I am happy to report I am not the only one so oddly obsessed; money quote:
About three weeks ago, [John Lustina] noted that Google numerical-based URLs were redirecting to custom profiles, but they were using a 302 instead of a preferable 301. Today, however, I'm happy to note that's changed. As of this writing, the 302 has changed to 301.Mark the time, SEO Friends; Google is listening to our Social World.
And with the step toward doing what they tell us to do, me Google Profile hops another steep up, to 5:

Is this why they wanted to 302 Hop[e], originally?
Dafforn First to Discover Google Changes Profile Hop from 302 to 301
Posted by john at 08:19 PM
| Comments (3)
| TrackBacks (0)
Printer-friendly version
May 18, 2009
Google Profiles Now Above the Fold? 
In spite of that odd numerical URL that persists and 302 hops, my Google Profile has proven to indeed be a climber, for the first time breaking above the fold for the vanity search I have been vainly keeping my own eye on from day one:

Now, as is normally the case with a non-temporary 302--THE problem with a non-temporary 302 you might conclude--I don't know whether to link to http://www.google.com/profiles/John.Lustina or http://www.google.com/profiles/116187582762783426547 when I am referring to it.
Google?
Google Profiles Now Above the Fold?
Posted by john at 10:07 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
May 04, 2009
Google Profiles Doing the 302 Hop? 
Continuing in my vainglory of days past, I was surprised today while exploring below the fold to see my Google Profile suddenly appear as a tenth result along with the standard extra bottom result (with smiling picture) for a vanity search:

Even more surprising, if you look at the yellow highlighted rectangle, is that Google choose to show www.google.com/profiles/116187582762783426547 as the URL for the result, rather than the www.google.com/profiles/John.Lustina vanity URL that I selected as my preference, when offered, in the initial setup of my Profile.
Now, I was happy to see that it redirected to http://www.google.com/profiles/John.Lustina when clicked, yet wondered why the numerical URL would yet list if the redirect were a 301. My SEO senses tingling, I went to Rex The Answer Man to find this:

Why a 302 temporary redirect? Why not just keep them numerical rather than vain if they are not going to implement a proper 301?
Isn't that what they would have us counsel our clients in a similar scenario?
Google Profiles Doing the 302 Hop?
Posted by john at 09:10 PM
| Comments (3)
| TrackBacks (0)
Printer-friendly version
April 25, 2009
Bonfire of the Vanity Search 
Although unveiled in the innocuous position last--always, mind you--of the first page for your name, it seems more likely ever-prescient Google has a larger share in mind than the 10th result on a page; namely, a cover page for Socially skitzophrenic above-the-fold situations like the following:

Are they actually after, rather, One Profile to rule them all?
Bonfire of the Vanity Search
Posted by john at 10:23 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
October 20, 2008
Error in Google's robots.txt Docs 
Update: This was fixed rapidly; see Riona's comment.
I don't want to get too deep into the complexities of robots.txt parsing (if you want that, try this, this or this), but I found something odd at the bottom of this page, one of Google Webmaster Help's many pages on robots.txt.
The page says:
URLs are case-sensitive. For instance, Disallow: /private_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file1.asp.
Here's a picture just so you trust me:

This is wrong in a lot of different ways. Let's look at them with my comments following in bold.
URLs are case-sensitive.So far, so good.
For instance, Disallow: /private_file.asp would block http://www.example.com/junk_file.aspIt would? How?
..., but would allow http://www.example.com/Junk_file1.asp.
I suppose Disallow: /private_file.asp would allow /Junk_file1.asp, but not because of capitalization style. It's because /Junk_file1.asp has nothing to do with the excluded file, /private_file.asp
So what did they mean? If they're anything like me, this was a paragraph started, edited a few times, and never really finished. It appears to try to cover a variety of the issues covered on the page, including cap style, pattern matching, and wildcard characters. Here are a couple alternatives I'd suggest:
URLs are case-sensitive. For instance, Disallow: /private_file.asp would block http://www.example.com/private_file.asp, but would allow http://www.example.com/Private_file.asp.
or, to continue along the pattern-matching theme also discussed on the page, this would work:
URLs are case-sensitive. For instance, Disallow: /private_file*.asp would block http://www.example.com/private_file.asp, but would also block http://www.example.com/private_file1.asp. It would not, however, block /Private_file1.asp.
This is a pretty minor detail at the bottom of an esoteric page, but if you're looking for specific information on cap style and robots.txt, it could cause some head-scratching.
Error in Google's robots.txt Docs
Posted by erik at 07:30 AM
| Comments (5)
| TrackBacks (0)
Printer-friendly version
August 08, 2008
The Difference Between Crawling and Indexing 
I talk a lot about crawling and indexing (to the point that we have a dedicated category), but I think it's worthwhile to back up and describe some of what's going on.
The terms crawling and indexing (and indexing's cousin, caching) are frequently used together, but you should not consider them synonyms.
Exact definitions probably differ from person to person, but following is how I explain the processes:
Crawling is the process of an engine requesting -- and successfully downloading -- a unique URL. Obstacles to crawling include no links to a URL, server downtime, robots exclusion, or using links (such as some JavaScript links) from which bots cannot find a valid URL.
Indexing is the result of successful crawling. I consider a URL to be indexed (by Google) when an info: or cache: query produces a result, signifying the URL's presence in the Google index. Obstacles to indexing can include duplication (the engine might decide to index only one version of content for which it finds many nearly identical URLs), unreliable server delivery (the engine may decide to not index a page that it can access during only one-third of its attempts), and so on.
What's the difference between crawling and indexing, in terms of time? Here's a recent example. I recently watched a newly introduced URL to see when it would be indexed. I monitored the text cache query of the URL every four hours starting when the URL went live on July 2. (This URL was one of a number of URLs linked to on a new site map.)
On July 17, the text cache showed results and finally stopped saying "Your search - cache:[URL] - did not match any documents." But what was interesting is that the cached file showed the results of the URL "as retrieved on 8 Jul 08." So make special note that the URL was crawled and cached over a week before it appeared in the index.
A better, more comprehensive test would be to watch server logs and see how many times the file was requested, and with what frequency, between the original request date and date at which the cache query showed results. Additional testing would try to detect ways to shorten that time by increasing the number (and prominence) of incoming links and so on.
The Difference Between Crawling and Indexing
Posted by erik at 12:04 PM
| Comments (11)
| TrackBacks (0)
Printer-friendly version
July 01, 2008
Google to Index Flash Content ... Again 
In a post last night entitled "Improved Flash Indexing," the Google Webmaster Tools blog reports that
We've improved our ability to index textual content in SWF files of all kinds. This includes Flash "gadgets" such as buttons or menus, self-contained Flash websites, and everything in between. ... In addition to finding and indexing the textual content in Flash files, we're also discovering URLs that appear in Flash files, and feeding them into our crawling pipeline—just like we do with URLs that appear in non-Flash webpages. For example, if your Flash application contains links to pages inside your website, Google may now be better able to discover and crawl more of your website.
This brings up several satellite issues:
- Since it's been so difficult to index Flash content, a virtual cottage industry sprang up with ways to circumvent that disability, including methods like SWFObject, sIFR, user-agent-based delivery of plain text vs. Flash content, and so on. With these techniques becoming more sophisticated and easy to implement, is it likely that sites will abandon them soon?
- It appears that for now, Flash files spawned when users fail a JavaScript test will still be uncrawlable, since engines too typically fail a JS sniffer.
- If you have a SWF file embedded as only a part of a larger HTML page, trust me that you do NOT want only that SWF file being returned in search results. It typically looks awful, lacking both the size requirements you implemented, as well as the critical navigation that resides in your HTML. The Webmaster Central post didn't say that SWF files would be returned in SERPs, so I'm not saying that's what will happen. But I've tested client sites by searching for strings of text that only appear in Flash files, and I've seen it happen. So test with your own site and cross your fingers.
I chose a somewhat sarcastic post title because ever since search engines and Flash have butted heads, the ability for engines to index text embedded in Flash files has been "just around the corner." In 2002, for example, hearts were briefly aflutter about the Macromedia Flash Search Engine SDK, which was going to be the end of engines' inability to index Flash content. Hear that? The end. 2002.
So I enter into this new era with guarded optimism. Optimistic because Google never releases anything "new" until it's been tested in the wild for months or years. Guarded because the "right" recommendation for clients is never quite as black and white as people think it will be.
Google to Index Flash Content ... Again
Posted by erik at 09:18 AM
| Comments (2)
| TrackBacks (0)
Printer-friendly version
June 04, 2008
A Guide to Robots Exclusion Protocol 
Google's Prashanth Koppula wrote a ready-to-bookmark post over at the official Webmaster Tools blog, showing tons of different robots-exclusion protocol (REP) directives that can be implemented in various ways. Following is a listing of directives discussed and the methods of implementation:
Directives for the robots.txt file:
- Disallow
- Allow
- $ Wildcard
- * Wildcard
- Sitemaps location
Meta tags for insertion into HTML:
- NOINDEX
- NOFOLLOW
- NOSNIPPET
- NOARCHIVE
- NOODP
Of special note are the two different wildcard uses; the post links to usage models for each. One additional funny bit is in the explanation of NOARCHIVE, in which the post describes the tag's usage as "Do not make available to users a copy of the page from the Search Engine cache." Contrast this with "Do not cache the page," which I believe is most people's idea of the tag's effect. I love little semantic hooks like that.
The post notes that the directives above are observed by Google, Yahoo, and MSN/Live, which is a nice bonus. In addition, the post discusses some directives that only Google honors, such as UNAVAILABLE_AFTER (which I discussed about a year ago), NOIMAGEINDEX, and NOTRANSLATE.
I appreciate what engines are doing with the REP advancements. It's the equivalent of the basic Robotstxt.org protocol being the vehicle, but the engines have become after-market accessory specialists, showing you how to get additional mileage, power, and stunts out of your car.
A Guide to Robots Exclusion Protocol
Posted by erik at 08:07 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
May 23, 2008
Client Hits Home Run With: URLs Matter! 

On this Friday before the long Memorial Day weekend, I thought I would share something very smart that a client recently said to me and a group of his colleagues.
In a nutshell, the discussion was about potentially rewriting dynamic URLs.
He said:
"One thing we can't forget is that URLs are marketing assets....they matter....they need to be friendly to both users and search engines."
For a moment, I felt like I was at Progressive Field, about to rise out of my chair and high five complete strangers around me after a Grady Sizemore home run.
I couldn't have said that better myself. Bravo!!!
Client Hits Home Run With: URLs Matter!
Posted by doug at 02:01 PM
| Comments (4)
| TrackBacks (0)
Printer-friendly version
April 30, 2008
Real and Imagined Errors in Google Sitemap Feeds 
When you upload your XML sitemap feed to your server -- especially if it's GZipped -- don't expect it to look pretty. I got a nervous call from a client because when he called the XML feed URL in his browser, he saw this:

While it looks like an error, it's really not. Not in the traditional sense, at least. The error here is that your browser (in this case, Firefox) isn't able to view the file without a little help -- specifically, a stylesheet that tells it how it should look to human viewers.
The bottom line is that this message doesn't mean that engines can't read your XML feed -- only that you can't see it. To see whether Google can process it, for example, check the Sitemap Summary report. For some reason, this report isn't in the main GWT left nav. To find it, you need to click the "Details" link at the far right of the Sitemap Overview report. When you click that link, here's what you see:

Real sitemap errors do exist, even in the example I used above. In this case, I've inadvertently included in the sitemap a URL that I also excluded via robots.txt. So I'm sending Google a mixed message there. Fortunately, the robots.txt file overrides the URL's inclusion in the sitemap, so it ends up being more of a gentle nudge than a true, crippling error. If the error doesn't specifically say that the sitemap is invalid and unreadable, then it's probably not.
Real and Imagined Errors in Google Sitemap Feeds
Posted by erik at 08:52 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
April 23, 2008
Update on Google Showing Excluded URLs as Sitelinks 
A little over a month ago, I wrote about Google showing robots-excluded URLs as Sitelinks. Here's a shot of what Google showed for the query [seo speedwagon] in mid-March:

The ip login link was (and is) excluded via robots.txt. A month prior (in February), a link to one of our monthly archives -- a page with the robots "noindex" meta tag -- appeared as a Sitelink also.
Since then, the SERP has been cleaned up. I use the passive voice because I don't exactly know who to thank. Either the algo picked it up on its own, or someone hand-washed it. Either way, it looks better now:

I'm not sure if we're an isolated case, so if you have any examples of excluded URLs still showing up in Sitelinks, please let us know in the comments.
Update on Google Showing Excluded URLs as Sitelinks
Posted by erik at 08:12 AM
| Comments (4)
| TrackBacks (0)
Printer-friendly version
March 04, 2008
NYT Traffic Doubles, Revenue Grows Since Killing Subscriptions 
John wrote a few times last fall about the NY Times tearing down its paid subscription wall and allowing spiders in.
Now, in an interview at The Deal, Google's David Eun (on p. 5) confirms that it was a good idea:
We have some partners that have made very bold steps, such as The New York Times, which went from a pay model to a free model. After they went free, the traffic they got from us alone doubled. Their math says they make more money by offering content free to consumers, but stimulating demand and making it work with advertising. The Financial Times did the same thing, and at least early on in the process they experienced at least a 100% growth in traffic.
Don't hold your breath waiting for further breakdown of the math, especially for the NYT example. Note that while Eun says traffic doubled, he was less specific about the money, saying only that "they make more" under the current scenario.
It should be no surprise that it's Google -- not the Times -- telling us the good news about expanded indexation. After all, Google has more to gain from all of us knowing about it, because it now gets a slice of the pie:

Thanks to BeetTV via SearchCap.
NYT Traffic Doubles, Revenue Grows Since Killing Subscriptions
Posted by erik at 10:51 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
January 23, 2008
Duplicate Content - Thinking Inside the 'Big Box' Stores 
SEOs (myself included) love to preach the Organic Gospel as if knowledge is the barrier, not implementation. "If people only knew this or that," we say, "they'd be saved."
But a lot of times, knowing the right stuff only leads to the next barrier, which is, "How the heck to we DO it?" Various facets of site maintenance -- user experience and tracking, to name only two -- frequently compete with SEO techniques for front-burner attention.
As an example, I want to look at Circuit City's site and show a couple examples of things they're doing that aren't optimal from an organic SEO perspective, but that are probably necessary for other reasons. (I should note that I LOVE the CC site from a user's perspective. They -- and a lot of the Big Box stores, to be fair -- have a great way to narrow and expand the choices to help you find exactly the product(s) you're looking for.)
The first example, however, is one I'm fairly critical of, because I feel like whatever benefit they're gaining probably isn't worth it. The CC home page has two links to the main "TV & Home Entertainment" page -- one in the top nav, and one in the lower set of links. You can see the links in the following screen shot, and I've listed them afterward:

Here are the respective links. I've bolded the points at which the dynamic strings diverge:
http://www.circuitcity.com/ccd/categorySpecial.do?catOid=-12866&N=20012866&c=1
http://www.circuitcity.com/ccd/categorySpecial.do?catOid=-12866&N=20012866
&SESSIONLINK&cm_re=011308%20HOME%20PAGE%20A-_-navboxes%20
TV-_-TV%20and%20Home%20Entertainment
The content of the pages is identical, so you know where this is going. They're diluting the potential power of each by having a similarly named mirror.
Let's look at a more complicated example. Consider two more URLs on the Circuit City site, and I'll contrast the query strings and the page contents. I've stripped out the "http://www.circuitcity.com" portion of the URL for brevity, but I've linked to them so you can see for yourself if you want. In addition, after each URL, I've shown the breadcrumb navigation from each page so you can see the subtle difference.
Page 1: TVs that cost $500-$999 that are Sony:
URL: /ssm/Televisions/sem/rpsm/c/1/catOid/-12867/Ns/net_price
|0||accm_grs_mgn_dllr|1/link/ref/N/20012866+20012867+312867003+40000229
/link/ref/rpem/ccd/categorylist.do
Breadcrumb trail:

Page 2: TVs that are Sony that cost $500-$999:
URL: /ssm/Televisions/sem/rpsm/c/1/catOid/-12867/Ns/net_price
|0||accm_grs_mgn_dllr|1/link/ref/N/20012866+20012867+40000229+312867003
/link/ref/rpem/ccd/categorylist.do
Breadcrumb trail:

The on-page content from these two URLs is identical. After all, no matter in what order you query the database, it should theoretically produce the same products (in this case, three specific TVs). But notice (in bold) how the order of two parameters is swapped in the two URLs, in effect causing a duplication. The content (and breadcrumb navigation) is generated based on the order in which the user selects search criteria. This makes a fantastic user experience -- no doubt about it. But it's hurting them subtly because engines either crawl too many pages and dilute each one's unique potential for ranking well, or, more likely, the bots hit a nav scheme like this, turn a few corners and crawl a handful of pages, then bail because they can recognize what a sinkhole it is.
About a month ago, a WebmasterWorld thread ($upport required) discussed a topic similar to this. Member PageOneResults discussed a client's site, which offers multiple paths and entry points to specific product URLs, with the final product URL varying based on the entry point used and the path taken to that product. Following is a response to his original post, followed by his reaction:
>>If the URL depends on the route taken through the site, then you have a major problem to figure out and fix.Yes, we have a major challenge ahead of us in regards to the one example provided where there were 10 access points for one product. That takes into consideration 5 under www and 5 under non www which is what is happening.
I'm all for as many access points as can be possibly provided as to not hamper the visitor experience. And, as pointed out, as long as that product leads to the same URI from all access points, life is sweet. But, that is not the case...
One important note is that PageOneResults is aka Edward Lewis, who runs SEO Consultants (of which Intrapromote is a proud member). Edward has probably forgotten more about SEO in the last 12 hours than I have ever known, so when he asks for input, it's not due to lack of knowledge. The bottom line is, this stuff can get extremely complicated regardless of your SEO knowledge level.
Duplicate Content - Thinking Inside the 'Big Box' Stores
Posted by erik at 11:35 AM
| Comments (2)
| TrackBacks (0)
Printer-friendly version
December 19, 2007
Supplemental Index, We Hardly Knew Ye 
The Google Webmaster Central Blog put another nail (the final one?) in the coffin of Google's infamous Supplemental Index just now by declaring it fully immersed into the main index:
We improved the crawl frequency and decoupled it from which index a document was stored in, and once these "supplementalization effects" were gone, the "supplemental result" tag itself -- which only served to suggest that otherwise good documents were somehow suspect -- was eliminated a few months ago. Now we're coming to the next major milestone in the elimination of the artificial difference between indices: rather than searching some part of our index in more depth for obscure queries, we're now searching the whole index for every query.
This is, in my opinion, much more significant than the prior act of simply removing the "Supplemental Index" label. The main problem has never been the label applied (or not applied) to URLs, but the fact (or at least the fear) that SI pages were being given short shrift in their efforts to contend for queries. So what's the intended result?
From a user perspective, this means that you'll be seeing more relevant documents and a much deeper slice of the web, especially for non-English queries. For webmasters, this means that good-quality pages that were less visible in our index are more likely to come up for queries.
Of course the onus doesn't fall entirely on Google here. SI pages were SI for a reason. If you think they're worth ranking for, the old rules still apply. Make sure you remove any obstacles to crawling and indexing that may remain, and try to get some additional links -- internal and external -- pointing to them.
Supplemental Index, We Hardly Knew Ye
Posted by erik at 12:29 PM
| Comments (2)
| TrackBacks (0)
Printer-friendly version
November 28, 2007
Are You A Canonical Fascist? Stand Tall! 
We are sticklers with our clients when it comes to issues of content duplication, sometimes to the point, I think, of being viewed as Canonical Fascists. This can be annoying, much like fascism mostly can be annoying, so it is gratifying to see Mr. Google himself lay out just why such annoyance is worthwhile advocacy, even approaching the subject of PageRank Splitting in the process:
When I did a wget from the Googleplex, I eventually got a 301 from the seomoz.com url to the seomoz.org url. But look at the timestamps: " --09:28:33-- " was the initial fetch and "--09:32:41--" was when the 301 came over the wire. Assuming that I'm reading right, that means almost a four minute delay on getting the 301 from seomoz.com to seomoz.org. Googlebot will wait around for several seconds for a page, but it won't wait four minutes. Instead, the connection will time out and we'll treat those urls as separate (and think that we couldn't fetch the seomoz.com url). So if a bunch of people are linking to your article, and some link to seomoz.org and some link to seomoz.com, that PageRank is getting split between two urls, and the long delay on the 301 response can cause Google to believe that the urls are separate and therefore cause dupe issues.
Hat tip to Randfish for calling forth such manna in his heavily commented comments area.
Are You A Canonical Fascist? Stand Tall!
Posted by john at 03:45 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
November 15, 2007
Tagging The Site Organic 
We spend a great deal of time on site structural issues with our clients, and one of the first things we usually do with a new client is try and transition them from thinking of SEO as a page-level concern to more of a holistic, organic discipline, one where we must try and understand the site architecture in its interdependent relationship between the whole and its parts. After all, organic is ultimately the moniker that won the day.
Almost invariably the large sites that we recognize are not living up to their potential are what we call top-heavy architecturally, in that the TLD so dominates all things search that even the main folder levels are all but invisible, let alone deeper, longer-tail-rich pages. As we explain the phenomenon we often find ourselves referring to blog structure, and how we might borrow some of the structural characteristics of a blog in discovering how to flatten out the top-heavy site. There are reasons blogs are so eminently crawlable.
One of those reasons is tagging, and I was pleased this morning to find a fellow tag-appreciator in Stephan Spencer, explaining his tag appreciation more eloquently than I have yet seen done to date:
Tagging isn't just a tool for usability (even though it's typically mostly thought of in those terms), it's also a powerful weapon for search engine optimization. That's because tagging allows you to rejig your internal hierarchical linking structure, flowing the link juice more strategically throughout your site. And because those links are textual and keyword-rich, a tag cloud is far superior in terms of SEO to the traditional graphical navigation bar.
Bravo, Stephan. Long live tag conjunction!
Tagging The Site Organic
Posted by john at 07:41 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
October 22, 2007
HTML Form Selections Being Indexed 
I had a client write in and ask a question:
"How do you prevent an engine from indexing a huge list of form selections?"
Initially I started to put together my own list of different ways to block, move or get rid of huge amounts of worthless text but then I dug a little deeper and realized this particular client's situation wasn't a problem at all.
For one thing the pages in question are buried deep in the SERP’s and only show in the index for this query because they support one instance of the “keyphrase”.
My reply…
The reason the form selection(s) are being indexed in these SERP's is because the form HTML is the only place the "keyphrase" is mentioned on those pages. Google is taking the only text around that “keyphrase” and using it in the SERP. If the "keyphrase" wasn't in that form selection (which is read by Google as text), that page would never even come up for that query.
1. Create content on the page that supports the “keyphrase” and Google will use it instead.
2. Change form application to a non-crawled script such as java. Then you can hold the text in a separate file. This option will decrease your indexed page count for that query because the one instance of the word that Google was using for the SERP will now be gone. Unless by chance there is some obscure link out there pointing to the page with “keyphrase” inserted.
3. Not a huge concern. We are more worried about what comes up in the first few pages of the SERP’s… not the last few.
Yahoo has code to stop the indexing of text snippets but Google still does not.
As complicated as stuff can get sometimes, it’s nice to see the simple things come across my desk every once in a while.
HTML Form Selections Being Indexed
Posted by james at 07:32 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
October 12, 2007
Site Down Time Causing Drops in the Google Index 
It's been a very busy week in Campaign Director land and now I'm ready to enjoy the weekend with my family. I've had to deal with many issues over the past few days but one particular problem I dealt with was a client's site down time and the affect this can have in the Google index.
Recently this client had been going through server changes and routine maintenance. Occasionally the site would be down from between a few minutes to a few hours at a time. Over the course of a few weeks they would lose and then regain positions (and indexed pages) in Google. The process of crawl, server down error, starting to be removed from the Google index, reappearance in the Google index, was only a few days.
This topic was discussed earlier this year (Site Down Time Can Remove You From Google Index) - (Site Downtime Can Delist You From Google) and it was understood then that:
1. Google can remove your pages from the index if they are down even temporarily.
2. "Googlebot will try a few times before the pages drop from the index." ~ Vanessa Fox
3. "As for how long it takes for a page to get back in once the site is back up, that really depends on a number of factors, such as how often the site is crawled in general." ~ Vanessa Fox
4. "Sometimes temporarily down pages turn into truly-gone-forever pages, so we have to drop those pages at some point. But it’s also true that we go back and revisit those pages pretty often and try to recrawl them in case the site comes back up." ~ Matt Cutts
I know this isn't anything new but it's an issue that I currently deal with and will probably continue dealing with in the future. Site down time has been a factor when trying to diagnose index problems with client sites and will stay in my checklist. It took my client a few days to reappear in the Google index and in those few days they lost major positions and traffic. For some Ecommerce sites, this "little" problem can turn into a major situation in a hurry.
Monitor your uptime, try to keep the downtime to a minimum, make sure your server returns the proper "server status" error.
See ya Monday, enjoy!
Site Down Time Causing Drops in the Google Index
Posted by james at 03:24 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
October 01, 2007
Google Search Results Already Finding Columnist Articles 
Frank and Maureen and Thomas, oh my!
The chipped cement still has yet to be cleaned up fully from the wall being torn down at that historical error known as TimesSelect, and already we are seeing NY Times columnists able to commune with readers freely at point of search, at least at the Frank and Maureen level:


As internet titan Alan Meckler noted in his posting of the Times e-mail to subscribers, search results like these were the driving force:
Since we launched TimesSelect, the Web has evolved into an increasingly open environment. Readers find more news in a greater number of places and interact with it in more meaningful ways. This decision enhances the free flow of New York Times reporting and analysis around the world. It will enable everyone, everywhere to read our news and opinion - as well as to share it, link to it and comment on it.
Sharing it, linking to it, and commenting on it are the currency of being able to find it in search, and that might be important to a newspaper if, as the latest surveys indicate, 91% of adults use a search engine to find information and 72% get news therefrom.
Ya think?
LATE UPDATE: We just noticed that similar to 1989, another Eastern Block Web Site is about to topple...
Google Search Results Already Finding Columnist Articles
Posted by john at 04:15 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
September 24, 2007
Topix TLD Migration -- Six Months Later 
I'm a big fan of Rich Skrenta, co-founder of NewHoo (née Gnuhoo, which eventually took the more recognizable name DMOZ), co-founder of Topix.net, and -- what may be the coolest of all -- author of one of the first known computer viruses, one of the few to be written before the actual term "computer virus" was even coined.
So that's all very cool, but the search-related part of all this was how, six months ago, Topix finally purchased the .com version of its domain and decided to make the move away from .net. The Wall Street Journal, in a mainstream SEO article that actually managed to hit most of the salient points pretty accurately, highlighted Skrenta's anxiety at the global domain change:
Such a simple change, Mr. Skrenta has discovered, could have disastrous short-term results. About 50% of visits to his news site come through a search engine -- and about 90% of the time, that is Google. Some companies say their sites have disappeared from top search results for weeks or months after making address switches, due to quirky rules Google and other search engines have adopted. So the same user who typed "Anna Nicole Smith news" into Google last week and saw Topix.net as a top result might not see it at all after the change to Topix.com.
Like a lot of SEOs, at the time I wondered what was so "wrong" with the time-tested (at least in my experience) method of full-on 301 redirects from the old site to new -- especially since the code would be short and sweet, with each old .net URL going directly to its .com counterpart.
The Topix crew had apparently heard too many domain migration horror stories. On his own blog, Skrenta noted,
...there've been a whole bunch of the seo posts saying essentially "hey, it's easy to move a domain, you just 301 it." Of course I know about 301 and 302 redirects. The problem is that half of these people follow up and say "you'll only be out of the index for a few months". They also ignore the problems that big sites have. A redirect for a small site may work great, but if you have hundreds of thousands of pages or more, there are lots of cases where this caused some form of not-in-the-index-anymore doom.The number of seo consultants who claim to know how to move a 100k+ page site is much smaller than the number who have actually done it.
That last point is a good one. Conventional wisdom in SEO is frequently spawned by 5% research and 95% extrapolation, which is often the best you can do. The other dirty little secret of SEO is that when you start seeing the same sort of anecdote often enough, it's tempting to put it in the "research" column.
Still, I'd not seen or heard of the type of monumental tragedies that Skrenta was talking about (at least within the last few years), and neither had Danny Sullivan:
I still remain surprised that the 301 is that much of a problem for even a big site. I just haven't heard of that trouble, of half the people saying you'll be out or whatever. If that's what you had been hearing, I can understand your concern. But it seems a pretty straight-forward change, and it shouldn't even be a burden on the server in that you're not actually talking about 100,000 of physical redirects that have to be created and check but a change of one domain to the other.
Ultimately, Topix listened to its heart (or maybe its board) and surprised me by opting for all-out duplication, running identical content on the .net and .com sites, avoiding any sort of redirection plan. So to call it a TLD migration isn't quite accurate. it's more like Topix.net bought a summer house and called it Topix.com. Again, from the WSJ article:
Concerned about that [redirection] strategy, Topix has run its site at both Topix.net and Topix.com for awhile. One danger with that approach is that it is unpredictable; Google will see two versions of the same page and could choose to show the Topix.net page most prominently.
This course would appear to run contrary to the advice even Matt Cutts gave in the article:
Google's Mr. Cutts says the search engine should ultimately understand what is going on when a site changes its Web address. He says the best strategy is to move one section of the site to the new address and see what happens before switching the whole thing.
Skrenta has since left Topix, but the duplicated domain strategy hasn't. Go to any page on the Topix.net site, and you'll see that the exact same page exists on the .com version of the site. Currently, Google shows about 2 million Topix pages indexed on the .net TLD and about 1.2 million on the .com TLD.
So six months ago, had you asked any SEO "expert" about what to do, almost no one would have suggested the present course. But has it hurt anything? Maybe, maybe not. Topix.com has a PR6 home page with about 1.2 million inbound links, while Topix.net has a PR8 home page with almost 7.4 million inbound links. So at this point, in terms of raw accumulated power, consolidating those domains would create one very powerful site.
But would that be better than the current situation? Perhaps not. What about that "unpredictability" the WSJ (and zillions of SEOs) talk about with dupe domains? That even in the best case, Google will pick one page or the other at its own discretion -- and it might not be the one you want? Consider the SERP for [detroit local news]:

A page from Topix.net shows up in spot 8, and a page from Topix.com -- an exact mirror of the .net page -- comes in at spot 9. Not exactly Google choosing one over the other.
But if the pages are identical, why do they have different titles on the SERP? Because Topix.com is heavily linked to from DMOZ, and the .com page shown in this screen shot shows the title used on the Detroit News and Media category of DMOZ. This leads to a question: When Google pulls one page's title and/or meta description from DMOZ, does that override the duplication filter?
Detroit's hardly alone here. Do some searches yourself with a city name and "news" or "local news" and see for yourself.
There's really no moral to this story, other than every time something like this happens, the "guidelines" from engines lose more and more bite. I'm happy that (at least according to my superficial research) Topix is doing well; it's a great idea, smartly executed. But these SERPs are yet another frustrating case of mixed signals from engines.
In my opinion, in a case like this, once Topix owned the .com version, that was all that needed to happen. I don't think the .com even needed to have any content for the problem to be solved. Looking back, my advice would have been to keep all content on Topix.net and immediately set up a 301 from Topix.com to Topix.net -- the exact opposite direction that most people recommended. That way, the authority of Topix.net content would never have been in jeopardy, and any links that mistakenly found their way to Topix.com would immediately transfer link popularity (as well as the user) back to the main (.net) site. They could have even promoted the site as "Topix.com" with no major headaches. Almost no one would care (or even notice) if they were redirected, whether it was type-in traffic or someone that clicked over from a news story.
Topix TLD Migration -- Six Months Later
Posted by erik at 03:14 AM
| Comments (2)
| TrackBacks (0)
Printer-friendly version
September 18, 2007
Search Tearing Down Walls Like It's 1989 
We knew it was coming and we tried to bake a cake for Maureen Dowd more than a Month ago, yet we are still surprised at how search-friendly they are being in their explanation today:
What changed, The Times said, was that many more readers started coming to the site from search engines and links on other sites instead of coming directly to NYTimes.com. These indirect readers, unable to get access to articles behind the pay wall and less likely to pay subscription fees than the more loyal direct users, were seen as opportunities for more page views and increased advertising revenue.
If you have any doubt that this is the SEO equivalent of 1989 scroll a bit further down the page for this money quote:
The Wall Street Journal, published by Dow Jones & Company, is the only major newspaper in the country to charge for access to most of its Web site, which it began doing in 1996. The Journal has nearly one million paying online readers, generating about $65 million in revenue.Dow Jones and the company that is about to take it over, the News Corporation, are discussing whether to continue that practice, according to people briefed on those talks. Rupert Murdoch, the News Corporation chairman, has talked of the possibility of making access to The Journal free online.
Mr. Murdoch, tear down that wall!
Search Tearing Down Walls Like It's 1989
Posted by john at 03:53 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
September 12, 2007
302 Redirects: A Guide to Search-Friendly Usage 
Here are a few of the "absolutes" in the SEO field:
- Never use Flash
- Never use cloaking
- Never use 302 redirects
These are, of course, wild oversimplifications. Equating Flash or cloaking with bad SEO is like classifying Hank Aaron as a poor hitter because he batted cross-handed.
So when is a 302 redirect better than a 301 redirect? Here are a few examples. Keep in mind that in these scenarios, the 302 is not the only possibility to achieve the desired results. Plenty of CMS options likely exist that will do the same thing. It all depends on your setup. But in each of these cases, the 302 is certainly better than a 301. Let me say that one more time. Each of these scenarios likely has several non-redirect options that perform the same task. But the point of this post is that the 301 is not always the redirect you want, and sometimes it can hurt you if you buy into the "301 is the only good redirect" argument.
Example 1: New Products, Fresh Content. You run a cell phone information site, and one of your targeted phrases is [newest cell phones]. You have a URL called /newest-cell-phones.php, which is your users' go-to page for the latest in cellular technology.
For the last few days, the /newest-cell-phones.php page has redirected (via 302) to /lg-vx8350.php, which is the latest phone you've torn apart and reviewed. At the same time, you also have a static link in the LG portion of your site directly to /lg-vx8350.php, because you want to get that content crawled on its own as well. You're not particularly worried about dupe content issues, because tomorrow, you'll be done with your /nokia-2610.php page, and it will then become the target URL for the /newest-cell-phones.php redirect.
Example 2: Restaurant Content, Lazy-Susan Style. You run a popular restaurant with a large user base who check in each morning to see the day's menu. Because you buy local and fresh, you often have only a few days' notice about what you'll serve on any given day. You've done some user testing and found that your customers HATE the dreaded "giant PDF menu download" they find at most restaurant sites. Conseqently, you have a link on your home page directly to a URL called /todays-menu.htm. In addition, you have seven other pages:
- /sunday-menu.htm
- /monday-menu.htm
- /tuesday-menu.htm
- /wednesday-menu.htm
- /thursday-menu.htm
- /friday-menu.htm
- /saturday-menu.htm
On Monday, for example, /todays-menu.htm uses a 302 redirect to /monday-menu.htm. The next day, it redirects to /tuesday-menu.htm, and so on.
Would a 301 work here too? Absolutely not. You want /todays-menu.html to remain in the indexes and rank for terms like [restaurant name menu]. You do NOT want a URL like /wednesday-menu.htm to become associated with [restaurant name menu], because it's counterintuitive for such a URL to appear in SERPs (at least six-sevenths of the time!), and you have no control over what URL the engines will choose to replace /todays-menu.htm in the index, since you don't control what day they crawl it.
So what do these examples have in common? Here are some guidelines when deciding whether 302 is the way you want to go:
You could consider using a 302 when, for the following redirect:
URL A ---> URL B
- ...it is important that URL A be indexed and remain indexed.
- ...it's not critical that URL B be indexed, but its content is very helpful for users.
- ...you have several different URLs that might fit logically in the URL B spot above.
- ...you have put some resources into strong internal and directory linking for URL A.
As I said before, a 302 is not the only way to achieve this. Plenty of dev environments and independent scripts will enable you to achieve this also, but depending on your setup, a 302 might be the easiest way.
A little background on the often-misunderstood 302: If you spend any time slogging through HTTP header documentation (and who doesn't?), you've probably noticed that the 307 -- not the 302 -- is the true "temporary redirect." But older clients apparently don't always know how to handle a 307 properly, and the 302 does more or less the same thing.
Plenty of development environments use 302 redirects to direct users to the "appropriate" version of the home page when the root is called. For example, if you go to www.adidas.com, you'll be redirected (twice, actually) to the appropriate language and country version of the home page -- in my case, /us/shared/home.asp. This, in my opinion, is not the best environment for a 302 redirect. (See Bill Slawski's excellent analysis of 302s at domain roots from earlier this year.)
The downside to the whole approach I've laid out here is that it's sometimes hard to get good external links to "URL A," because by the time users click it, they've been redirected, and if they copy the URL from their browser, it's "URL B." Some good directories will allow you to submit a URL that redirects, as long as it redirects to a page on the same domain. But to get people to link to URL A, you'll need to make a conscious effort to give them the correct URL.
302 Redirects: A Guide to Search-Friendly Usage
Posted by erik at 04:55 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
September 05, 2007
Boost Visibility With XML Sitemap Submission to Ask.com 
We have been keeping a close eye on Ask.com and are seeing traffic increases from Ask.com for some of our clients. In fact, one e-commerce client over this summer has seen transactions more than double from Ask.com referrers. A 100% increase in transactions - now that really gets our attention!
We've shared with Wagon readers about submitting an XML sitemap to Google, to Yahoo and updated readers about the potential of submitting to MSN this Fall. I thought I would remind readers that you can also submit your XML sitemap to Ask.com and provide some simple how to's.
There are two ways to submit your XML sitemap to Ask.com:
1. Use the auto-discovery directive in your robots.txt file:
SITEMAP: http://www.yoursitemapurl.xml
2. Submit your sitemap via Ask.com's ping URL:
http://submissions.ask.com/ping?sitemap=http://www.yoursitemapurl.xml
Lastly, make sure you're using the accepted sitemaps protocol.
Boost Visibility With XML Sitemap Submission to Ask.com
Posted by doug at 03:44 PM
| Comments (5)
| TrackBacks (0)
Printer-friendly version
August 30, 2007
New MSN Tools Coming Includes Sitemaps 
Back in April, I updated Wagon readers on the latest in Sitemap and Sitemap Protocol news and now it's time for a quick update.
Last week on MSN Live Search's blog, MSN trumpeted their new Webmaster Portal that will allow sitemap creation and submission. A beta program has been started and the Wagon has applied for a test drive.
Along with these sitemap features, MSN also announced that the Webmaster Portal will also include crawling and indexing tools as well as statistics about web sites. As we've said in the past, these statistics can be very helpful.
Stay tuned for news on the beta program. Official launch of the Webmaster Portal is expected in early Q4.
New MSN Tools Coming Includes Sitemaps
Posted by doug at 09:15 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
August 15, 2007
NY Times Select(s) Death over Charade 
As you probably know, the NY Times has been the most prominent experiment in the paid content-behind-a-firewall-yet-at-least-partially-indexable model, and they are indeed now, finally, announcing via trial ballooning they are no longer going to put their most popular columnists behind that magic curtain one has to pay to sweep aside. After the magic show ends and the same fingers which initially drew the curtain are finished being pointed this way and that, this failed experiment will have had much to do with the principles of Link Building.

A party-goer cloaks her content as Maureen Dowd. Found on Flickr. Copyright 485i
First a great quote that helps explain the decision's relevance to our industry:
But the truth of the matter is that you get far more eyeballs when you're not locking away your content from the general public. The reality of Web 2.0 news is that people a rising tide raises all the ships. If you've got good content, and the Times does, people will link to it. When people read a technology blog like Engadget or a political blog like Daily Kos and find links to articles at the New York Times, everybody wins. Keeping your archives, op-eds, and other content locked up means that blogs and news sites won't link to you, won't give you credit for finding a story first, and won't drive up your traffic.
This lack of inbound links to the content-behind-the-firewall damaged traffic to the site not only through a paucity of visitors being able to click on these links to the columns themselves...:
...the share of traffic that the NY Times sends to NY Times Select has been decreasing over the past year – down by 16% year-on-year in July. With NY Times Select receiving more than two thirds (67%) of its US traffic from NYTimes.com, the decline had an impact with US visits to NY Select down 22% in the past year.
...in having to rely far too heavily on the parent site rather than third party links for traffic, but also in the residual effect such had in these columns' search engine visibility. With few third party inbound links accumulating with each new column, in fact from a deliberate online community decision not to link to content-behind-a-firewall, it is also very difficult for each new column to be judged more relevant than similarly themed columns emerging on the same topic that immediately acquire inbound links in the form of the same online community recommending them. It's no wonder the Times Select had to rely so heavily on clicks from the parent site for visits, as a great many of those visits were likely already subscribers. In that situation it is difficult to grow at the rate of the internet. Try these two simple searches for Frank and Maureen alone: nary a column to be found. Haven't they written quite a few?
I think everyone likely to read this blog knew this would happen. But to say we knew it would happen ultimately is not to say we are not happy to see even giants felled by an algorthm rejected, not select(ed).
NY Times Select(s) Death over Charade
Posted by john at 03:35 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
August 08, 2007
Download all query stats for this site (including subfolders) 
I get the feeling that most people, even in our industry, using Google Webmaster Tools for themselves or a client aren't scrolling far enough on the Query Stats page to reach this link:
![]()
What you get if you click is rather unwieldy, sure, especially if you are dealing with a very large site, but the payoff is simply as large by the same degree. We are beginning to view it more and more here as a kind of matrix for how Google views your site architecturally, especially in light of GSI now having been moved to an undisclosed location. Actually, now that I've said it I'm a bit afraid it, too, will be taken away...
Download all query stats for this site (including subfolders)
Posted by john at 02:59 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
August 01, 2007
Google Supplemental Index Label Formally Dropped 
Yesterday's announcement from Google's Official Webmaster Central Blog represents a formal declaration of what Matt Cutts hinted at in a recent SEOMoz post: The "Supplemental Result" label that used to appear for pages in Google's "junior varsity" index will no longer appear.
Do not mininterpret this as "Supplemental Result pages no longer exist." They still do, if you read the Google post with subtlety. The gist of it is that Google crawlers are now able to cruise through these pages with more frequency and reliability. This apparently negates the need for a special label, as while these two indexes are certainly not treated the same, the differences appear to be waning.
Personally, I don't really mind that this is disappearing, because even in the best cases, it wasn't always easy to determine if pages were "really" in it, and people seldom agreed on what caused pages to be there (despite several Googlers saying exactly what got you there). Still, the presence of Supplemental pages forced webmasters to figure out some better ways to organize their sites, which is the silver lining.
With a good analytics program, you have the capability of seeing how many pages on your site are performing well or not performing at all. Seeing them labeled as "Supplemental" was a shortcut to diagnosis, but its absence is hardly cause for panic.
Google Supplemental Index Label Formally Dropped
Posted by erik at 07:06 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
July 06, 2007
Google Weighs in on Image Replacement (sIFR) 
On the rare occasion when an engine expresses an actual opinion on a real technique, It's a welcome, welcome sight. So imagine my glee when I read Google Webmaster Central Blog's take on dealing with Flash.
While John lamented the mixed signals just last month, we've been asking the question internally for years: From a best practices standpoint, does image replacement stand safely in the DMZ of glorified CSS, or does it boldly encroach the characteristic of "showing engines one thing and users another"?
And that's just the beginning. The real problem, when you're using image replacement, is not the insertion of stylized copy, but instead, what you do with the HTML text you're replacing. Some systems simply let it lie underneath the script-spawned Flash layer, while some use "hidden" status in CSS, while still others pull it off the visible screen and hard-code it somewhere in the -5000px range -- each of which is detectable and grounds for a good spanking if your motives are anything but pure. Traditionally, that left us with the worries of trusting the algorithm to detect our motives.
But forget all that, because today we know, and we know it based on the way all such things are Known -- because it's mentioned in an official Google blog, midstream in a list of "practical suggestions" about how to deal with Flash:
sIFR: Some websites use Flash to force the browser to display headers, pull quotes, or other textual elements in a font that the user may not have installed on their computer. A technique like sIFR still lets non-Flash readers read a page, since the content/navigation is actually in the HTML -- it's just displayed by an embedded Flash object.
This proclamation, coming on the heels of Independence Day, is fitting, because no longer are we bound by the tyranny of not knowing on whose side of the fight sIFR truly sits.
Google Weighs in on Image Replacement (sIFR)
Posted by erik at 07:20 AM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
June 26, 2007
Checking Supplemental Index Status for URLs in Large Sites 
For sites with fewer than 1000 pages, it's possible (if not monotonous) to see which URLs are in Google's Supplemental Index. Simply run a site: command for your domain (example) and scroll through the results pages until you start to see "Supplemental Result" next to some of the URLs.
But what if your site has 50,000 pages and the supplemental results don't start until the final 10,000? Even the fairly common site:domain.com *** -view query isn't totally accurate, and it's still subject to the 1000 URL display limit.
Depending on which case you find yourself, it can be either tedious or impossible to detect whether a specific URL is Supplemental.
Using our blog site as an example, suppose I suspect -- but can't confirm -- that an old post about Yahoo Sitemaps is in the SI. A simple info: query doesn't tell you whether the URL is supplemental or not. For example, the following shot came from the query:
[info:http://seoblog.intrapromote.com/2006/11/an_update_on_ya.html]

Instead, a quick way to check Supplemental status is to pull a unique string from the URL in question (such as a folder or filename) and tack it into an inurl:-filtered site: query. In other words, the following shot came from this query, in which I added the filename (minus extension) into the inurl: command:
[inurl:an_update_on_ya site:seoblog.intrapromote.com]

In this result, note the Supplemental Index status.
The bottom line is to find an inurl: string that will quickly filter down the site: query results so that your specific URL shows up quickly.
Checking Supplemental Index Status for URLs in Large Sites
Posted by erik at 08:50 AM
| Comments (8)
| TrackBacks (0)
Printer-friendly version
June 14, 2007
The Google Supplemental Index Inbound Link(s) Threshold 
Wagon Rider Pat Fusco penned a great primer this Month on the issue of Google’s Supplemental Results as they are tied to duplicate content, if you’d like to orient yourself first. Our own Wagoneer Doug offers some points on Getting Out of Hell Free (that is, at least, without requiring direct payments to Google), the last method of which involved examining backlinks, and the fact that a great number of pages in the GSI we have examined across the massive sites we spend time with each day seem to share the commonality of zero inbound links.
We are finding, increasingly, that the distance from zero to one in terms of inbound links to a page seems to be much more of a threshold for exiting the Google Supplemental index than, say, 2 to 100. This is not to say that 1 gets you out, bada bing, but that there is a great more deal of love granted from Google on that single giant step from nil to 1 than there seems to be on the next link steps a page takes out of infancy.
A baby’s first steps are much more exciting and remarkable than the subsequent toddling around the room that follows, and it may be helpful to think of Google watching a page with no links in the same manner.
The Google Supplemental Index Inbound Link(s) Threshold
Posted by john at 08:49 PM
| Comments (4)
| TrackBacks (0)
Printer-friendly version
June 12, 2007
A Quick Route to the Google Text Cache 
I am constantly looking at cached copies of web pages through Google's cache. It's a wonderful way to double-check that your content is showing up the way you want it to and quickly tell if Googlebot has noticed any recent changes you've made. It's also a great way to sniff out some rather "brave" techniques on behalf of your competition.
As a primer, you can view the cached version of any web page by typing cache:URL in a Google search box, where URL is any full URL string. For example,
cache:www.united.com
will show a cached copy of the home page of United.com:

But beyond the normal cached version, I prefer to look at the text-only cache. The pink rectangle above shows the link to the text-only cached version. Clicking this shows you a version of the page much more like what a robot really sees.
But sometimes because of the site layout, the box at the top of the cached page doesn't appear. Consequently, the link to the "text only" version of the cached copy is hidden -- as in this sample shot from the BMW USA home page:

When this happens, the text-only cached version is still not hard to find. Simply append
&strip=1
to the URL of the regular cached version, and you'll see the text-only cache.
Now to take this to the next level (if you're a Firefox user ... and you are, right?), here's how to use Firefox Quicksearch Bookmarks to find the cache of a page so fast you'll amaze your friends and stun your competition.
(If the concept of Quicksearch Bookmarks is new to you, get some background here and here. It's a way to search any site from the Firefox address field. Trust me: If you search a lot, you will LOVE this.)
When you create a new Quicksearch bookmark, here's the data to use:
Name: Google Text Cache
Location: http://72.14.209.104/search?q=cache%3A%s&strip=1
Keyword: tc
So typing this in Firefox's Address field:
tc www.yahoo.com
will show you this. Pretty cool, huh?
A Quick Route to the Google Text Cache
Posted by erik at 02:24 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
June 05, 2007
SEO Case Study: Press Release Archives 
We recently worked on the press release archive for a pretty large company and I wanted to show an informal case study about what happened.
Here's the initial structure:
Each of the smallest document icons represents a specific press release. The issues with this architecture are fairly obvious when displayed graphically. It's a linear linking format, with the main press page showing the most current releases. You need to click a "next" button to hit a page linking to releases 11-20, 21-30, etc. etc. So the oldest releases were literally dozens of clicks away from the home page.
With this setup, here are the baseline stats. I can't use real numbers, but I'm hopeful that your algebra knowledge is current enough that variables will suffice:
- Total releases indexed: X (representing about 6% of total possible pages)
- Search traffic (monthly visits from search engines where the landing page is a press release): ~Y visits / month
Here's the modified structure:
The main press page now has child pages devoted to releases from specific years, as well as pages devoted to releases based on their subject area. Consequently, each release has links from at least two internal pages, and no release is further away from the home page than 3 clicks.
Stats 45 days after implementation:
- Total releases indexed: 16.8X (representing >98% of total possible press releases)
- Search traffic (monthly visits from search engines where the landing page is a press release): ~2.4Y visits / month
A couple important notes:
Don't fool yourself. This is content that people actually search for in pretty significant numbers. If you write press releases just to write press releases -- and your content doesn't have the pent-up demand to justify it -- don't expect results like these. Jill Whalen wrote a smart article about this very topic at Search Engine Land.
Sitemaps aren't a back door. For 6+ months prior to the change implementation, all press release files were included in a Google XML sitemap file. So don't expect a sitemap feed to dramatically increase your indexing if your site's architecture doesn't back it up.
SEO Case Study: Press Release Archives
Posted by erik at 03:50 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
June 01, 2007
Flash, Javascript, CSS, Ajax, sIFR, and Textual Image Replacement... Oh My! 
Not just because I am somewhat easily confused, but as our title today suggests, the overlap, literally and figuratively, among all of these web elements can and often is the nexus of confusion in advocating Best Practices SEO to any client development team, whether in-house or, um... out.
Comes now a Flash Engineer at Google (on the YouTube side) with the most elegant writing to date on the lines of demarcation in what he terms modern web development philosophy:
First off, you need to embrace web standards. Semantic markup and separating content from style and behavior is the only way you should be building your sites. Many web standardistas have been recommending this method of web development for years, and rightly so. However, this post isn’t the place to go into the whys of this type of development, so I’ll skip that part and just say this about how it’s done: There are three areas of front-end web development: Content, Style, and Behavior. You should always keep these three things separated as much as possible.
Content, Style, and Behavior as three separate things. Makes it all much easier to put in place and figure out where one stops and the other begins. The money quote helps even further:
Progressive enhancement is a method of web development that goes hand in hand with Web Standards. You start with your HTML (your content), then add CSS (your look and feel), then add in additional behavior (Javascript, Ajax, Flash, any other interactivity that isn’t handled automatically by the browser).
Content. Style. Behavior. Trot that out next time everyone is looking at each other confused.
Flash, Javascript, CSS, Ajax, sIFR, and Textual Image Replacement... Oh My!
Posted by john at 11:11 AM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
May 22, 2007
Google Hell: Is That Google Poo On Your Shoe? 
One of the hottest topics these days in SEO forums, message boards, and blogs is Google's Supplemental Index. We've had some hearty discussions here at Intrapromote about GSI!
If you have a lot of web pages in Google's Supplemental Index, then you may refer to it as "Google Hell". A kinder, gentler reference I've seen is "The Google Poo".
Hell or poo, both stink and no one wants to have anything to do with them.
Perhaps most interesting ..... I have a feeling that hundreds of thousands of web sites have stepped in the poo unknowingly.
Heaven or Hell
Google's Supplemental Index (GSI), as defined by Google:
Supplemental sites are part of Google’s auxiliary index. We’re able to place fewer restraints on sites that we crawl for this supplemental index than we do on sites that are crawled for our main index. For example, the number of parameters in a URL might exclude a site from being crawled for inclusion in our main index; however, it could still be crawled and added to our supplemental index. The index in which a site is included is completely automated; there’s no way for you to select or change the index in which your site appears.
Did you get through that without falling asleep?
If you want to see if you have web pages in the GSI, simply replace "www.abc.com" below with your web address and search at Google for:
site:www.abc.com *** -view
Welcome To Hades. Have A Nice Day.
Here's what you need to know if you find you have pages in the GSI. Searches conducted at Google, with rare exception, will return pages in Google's Main Index. So, your pages in the GSI will have little opportunity to ever be found by potential site visitors. Pages in the GSI simply do not rank well.
However, when it comes to the GSI, you'll be glad to know that the wages of sin are not always death.
Google Reconciliation
Here's what to do if you want to smell better to Google. Take a close look at your web pages in the GSI:
First, check the quality of your web page content. Especially look to see if your pages contain duplicate content from other web pages in Google's Main or Supplemental index. The majority of the pages we've seen in the GSI have been banished there because they contain duplicate content. So, one way out of Google's lake of fire is to make sure your web pages have unique content.
Second, check the quantity of your web page content. We've seen a lot of pages in the GSI that have little or no content. Again, make your web pages as unique as possible.
Third, check the backlinks to your pages. We've seen a lot of GSI pages that do not have any links to them. If your page(s) are linkless, securing quality internal and external links from pages not in the GSI may prove to be your online salvation.
Lastly, once you have addressed the items above, resubmit your HTML or XML sitemap to Google.
Google Hell: Is That Google Poo On Your Shoe?
Posted by doug at 02:29 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
May 15, 2007
Rosie Now Powerful Enough to Mock SEO Speedwagon in The SERPs 
My oh my. We have to admit we were a tad worried here at The Wagon of backlash when we exposed Rosie O'Donnell's exact same URL appearing as result #'s 1 and 2:

What we didn't know, though, was that her media influence extended to being able to poke us in the eye with a self-conscious rejoinder of a description in her now magically changed #2 result:

Never mind that she doth seem to protest too much in her description. The sheer SEO power of this woman is breathtaking.
Rosie Now Powerful Enough to Mock SEO Speedwagon in The SERPs
Posted by john at 05:25 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
May 08, 2007
SEO Best Practices - International or Region-Specific Sites and Domain Issues 
Suggested Best Practices:
A good first question to ask is “Who exactly are we targeting?�? If you are targeting a specific country, targeting a specific language-speaking audience, or your web site copy is specifically for a country or language-specific audience, use a ccTLD (country code top level domain) that relates to your target country rather than a general .com domain. For example, a ccTLD would look like www.domain.fr, www.domain.ca, www.domain.jp, or www.domain.co.uk. Always use ccTLDs for each language of your site.
Avoid having multiple language sites on the same domain, e.g., www.domain.com for English language content and www.domain.com/fr/ for French language content.
Make sure that there is not any duplicate content on your .com and any other sites.
Make sure your pages identify what language they are in, e.g., meta http-equiv="Content-Language" content="jp"
If you cannot use a ccTLD, use a subdomain, e.g., fr.domain.com. Google views a subdomain as a separate site.
Benefits Of Best Practices:
A ccTLD communicates to search engines the focus of your site.
A ccTLD is the quickest and most accurate way to communicate regionality to the search engines.
A ccTLD assigns more weight for local search. It allows your site to be more easily included in Google Canada, Google Mexico, etc.
Search engines tend to have higher confidence and often give a ranking boost to a ccTLD site for local searches. For example, Google France may give a more favorable ranking to a France-specific (.fr) site.
FAQ:
Q: What about using subdirectories such as www.domain.com/fr/? Can we do a 301 redirect from a subdirectory to a ccTLD, e.g., from www.domain.com/fr/ to www.domain.fr?
A: From a search engine perspective, it is always best to use a ccTLD. If a ccTLD is not possible, then consider using a subdomain. We do not recommend using subdirectories for international sites or language-specific sites.
Q: What does Google say about the use of TLDs, ccTLDs, subdomains, etc.?
A: “Use TLDs. To help us serve the most appropriate version of a document, use top level domains whenever possible to handle country-specific content. We're more likely to know that www.domain.de indicates Germany-focused content, for instance, than www.domain.com/de/." (Source: Google Blog)
SEO Best Practices - International or Region-Specific Sites and Domain Issues
Posted by doug at 11:26 AM
| Comments (2)
| TrackBacks (0)
Printer-friendly version
May 02, 2007
More Sitelinks Hijinks with Google Duality : A Tail of Two 301's 
It's not just Rosie enjoying the new Google Sitelinks Value Meal. FON, that guerrilla Wi-Fi startup knocking at Starbucks doors via their neighbors, is also now seated at the table:

But note in the above the two highlighted URLs, which are indeed the same page, do a bit of a pa de deux with how they serve the language, in this case English. The Sitelinks serving places the language as served from the en subdomain, yet that very URL redirects in this manner:
So that subdomain 301s from its English language subset that the subdomain indicates back to the non-language specific setting at the WWW level, whence it makes another direct turn back toward the language specific:

The whole trail of 301s serving to have moved the language specification in the URL from subdomain to folder level, with a waving pass through nothing. Quelle bonne idée !
Perhaps it is if it's another way to order a Google Sitelinks Value Meal.
More Sitelinks Hijinks with Google Duality : A Tail of Two 301's
Posted by john at 11:49 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
April 26, 2007
Rosie O'Donnell's Google Sitelinks Value Meal 
An interesting thing happened on Rosie's way out of the door of The View; her sudden, earth-shattering departure caused a quake of an anomaly in the Google result for her name -- namely, the exact same URL appearing twice as a result, both #1 and #2:

What gives? SEO purists might argue that as result #1 is in the Sitelinks formation and Rosie’s site itself links out in the main navigation to her blog, the URL highlighted above exists as both a shortcut that will save users time, per Google’s explanation of the criterion for URLs selected for the formation--
Our systems analyze the link structure of your site to find shortcuts that will save users time and allow them to quickly find the information they're looking for.
--and also exists on its own, as a blog, and thus merits a listing apart from one tied to Rosie's site, ergo the Value Meal Result.
But surely this rare achievement cannot be helped by the fact that Rosie's site itself argues against that very justification with its Title Tag. Aren't those supposed to be quite important, and importantly unique?
Does Rosie's massive influence extend even into the algorithmic sphere?
Rosie O'Donnell's Google Sitelinks Value Meal
Posted by john at 03:37 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
April 19, 2007
Friends Don't Let Friends 302 
Our 300 Series expert James sent out an important warning earlier in the year about 302's that aren't really temporary coming back to bite hard, and here at The Wagon we're starting to believe this may be a new Google theme for Spring.
The gist is there is a ticking clock on temporary, in that, we surmise, Google can tell when a 302 started, and it can certainly tell if it has yet to end. This makes sense. The unknown is what period between is given Google's blessing as truly "temporary" in temporal terms, and what then falls outside that window.
In 2007 so far, though, we are definitely seeing instances of the window slamming shut, loudly. And these are not spammers, no -- just, as can often be the case with a 302, used in a pinch with all intentions to return and fix, then forgotten. A promise written in the sand.
Please make sure any 302 you are using does indeed end, ultimately. If not, it likely will be ended for you.
Friends Don't Let Friends 302
Posted by john at 04:34 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
April 13, 2007
Sitemaps Protocol News / SEO Humor 
If you've had a backstage pass to the SEO Speedwagon for the last few months, you should know by now that we are official groupies for the Sitemaps Protocol. In fact, if we ever take SEO Speedwagon on tour, Sitemaps Protocol would open up for us.
There actually was some interesting news this week re: the Sitemap Protocol.
1. Ask.com has drank the kool-aid so you can now share your sitemap with them.
2. Although MSN isn't "...ready to consume sitemaps just yet", all three major engines announced the sitemap protocol will now include Autodiscovery.
Autodiscovery allows site owners to add a link to their sitemap within their robots.txt file. Here is what it should look like:
Sitemap: [sitemap URL here]
We highly recommend that you add this line to your robots.txt, especially since you will not have to resubmit your sitemap file when it is updated (which should be often if your site content is dynamic).
If you are a fan of the statistics, etc. provided by Google Webmaster Tools, then also be sure to submit your sitemap there. Along with statistics, you will also be able to see if there are any errors in your sitemap which can be very important, especially for large web sites (trust me.....been there).
SEO Humor:
An SEO guy walks into a bar and asks the bartender, "Can you submit a sitemap to MSN?" The bartender looks at him, scratches his head, and asks, "Why?"
Sitemaps Protocol News / SEO Humor
Posted by doug at 09:10 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
March 27, 2007
Did You Know WWW and Non-WWW are Two Different Sites? 
If you're an SEO you certainly do lest you are malpracticing. And if you're a Cutlett, you've likely concurred here just a little while ago but more likely immediately.
Yet in spite of immediate pick-ups of everything Matt posts and that fact that this is a day later, Good God, I want to highlight his explanation of why, if only to be able to link to this portion of it when I am asked why and do a poor job explaining why:
Some people ask “Why don’t you just assume www.example.com and example.com are the same?? The answer is that they don’t have to be, and for some websites they are different. For example, http://phpicalendar.net/ is a different page than http://www.phpicalendar.net/. This happens more often than you might think; FindWhat has different www vs. non-www pages, for example.
Best and simplest it's ever been put.
Am I now a Cutlett, too?
Did You Know WWW and Non-WWW are Two Different Sites?
Posted by john at 12:03 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
March 19, 2007
Google's Supplemental Index: Questions and Inconsistencies 
What I want to discuss in this post is Google's unreliable method of showing which pages are and are not in the Supplemental index.
What I don't want to get into with this post, beyond a very superficial level:
- Whether pages being in the Supplemental index is bad (or merely not good)
- How pages end up in the Supplemental index
Let's just say that all things being equal, I'd rather have 10,000 pages in Google's main index as opposed to its Supplemental index.
Using the method (that is fairly commonly accepted, in my opinion) of determining which pages from a site are in Google's Supplemental index (site:yoursite.com *** -view -- first read at SEOBook -- see References), I decided to pick a random URL -- in this case, our blog's article archives in the "Crawling and Indexing" category:

So clearly -- at least according to Google (who should be the authority) -- this page does sit in Google's Supplemental index. Right?
But let's refine the query to be a mere listing of the site contents -- site:yourdomain.com. To find the URL I discussed before, I needed to scroll to page 37 of the results:

This time, it doesn't show the "Supplemental" label. But is this a contradiction? Maybe in a straight-up site: query, the Supplemental label doesn't appear.
No, that's not it either, because if you click over to one more page of results -- page 38 in this case -- Supplemental results DO start to show up with the Supplemental label. Here (some on page 38, and all on page 39 and beyond), most of our pages are labeled as Supplemental:

So there's an irrefutable contradiction. Some Google results call this page Supplemental, and some don't. Why is that? Is it a data-center thing? Have some machines in the server farm not yet received the proper memo?
But let's cut to performance. For the query [crawling and indexing], the page ranks #1 at Google (even with no account sign-in). It doesn't bring a ton of traffic, but it does bring some:

So some would say, "There's your answer. It doesn't matter, because it shows up in SERPs and brings traffic. Quit overthinking it."
That's true -- IF the page is truly a Supplemental page. But because we have mixed signals, it's not so clear. Is this a Supplemental page that performs well despite its Supplemental status, or a Main index page that performs well and happens to be sometimes mislabeled as "Supplemental"? That's an important question I can't answer right now.
To follow up this post, I'll try to hit another angle: When a URL that is consistently labeled as Supplemental does not show up in search results, despite the fact that it's the most relevant post for that query.
Resources, Notes:
- Barry Schwartz first (to my knowledge) cracks the issue with this post last September. The technique worked briefly and sporadically.
- Adam Lasnik clarifies some Supplemental concepts at Google Webmaster Help (Google Groups).
- SE Roundtable brings up more questions.
- Aaron Wall brings it up again, with the added refinement of subtracting a nonsense string, which is more or less its current form.
- In pre-emptive response to one of my favorite audience niches (the beside-the-point nitpickers), I have adapted our robots.txt file in response to the third screen shot above. It now excludes search results and and stray comment previews.
Google's Supplemental Index: Questions and Inconsistencies
Posted by erik at 03:20 PM
| Comments (2)
| TrackBacks (0)
Printer-friendly version
February 14, 2007
Live from MSN: Canonicalization High Comedy 
A very smart lady makes smart fun of the undead at MSN Live:
MSN Live Search just isn't as bright when it comes to algorithmic adjustments. It indexes nearly 1,300 pages of "site:brewers.mlb.com" and six pages of "site:www.milwaukeebrewers.com". Its algorithms credit "link:www.milwaukeebrewers.com" with nearly 14,000 inbound links and "link:brewers.mlb.com" with over 14,000. MSN Live Search duplicates its own results by including the non-canonical URL in the results.Getting any bright ideas about MSN Live Search, subdomains, and temporary redirects? Small wonder MSN Live Search has its filters set to "high" to stop spamming itself and present any semblance of canonicalization.
Apart from the merrymaking, Pat steps through the canonicalization labyrinth in a quite deft and accessible manner for the uninitiated, perhaps 99.995% of the population. If you're outside that .005 it's well worth understanding these foundational indexation parameters for your site, especially if you are responsible for setting them but don't realize it -- a state in which we actually find most new clients to be.
Live from MSN: Canonicalization High Comedy
Posted by john at 11:42 AM
| Comments (2)
| TrackBacks (0)
Printer-friendly version
December 19, 2006
It's Key To Not Remove Yahoo Authentication File 
Now that Yahoo has agreed to accept the Sitemap protocol , I've been going through the process of getting client sites authenticated via Yahoo's Site Explorer so we can submit sitemap files.
To be knighted as an authenticated site by Yahoo, you have to create an authentication key file then upload it to the root of your site. Once uploaded, you can request authentication.
We're finding it takes about 24 hours to get the thumbs up or down from Yahoo.
Just a tip for any of you doing the same.....
Once you upload the authentication key file and your site is authenticated, don't remove the authentication key. Yahoo will check periodically for the presence of the key and if it's removed, your site will be unauthenticated.
It's Key To Not Remove Yahoo Authentication File
Posted by doug at 05:07 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
November 27, 2006
Evidence of Yahoo Crawling Google Sitemaps 
Given Yahoo's recent promise that it would begin to support the Google Sitemaps protocol, it's a bit anti-climactic to document evidence now, but I promised a follow-up.
Back in October, before the "big 3" officially admitted that they would read the same type of sitemap files as a benefit to site owners, I had my suspicions and ran a test to see if and when Yahoo would actually pull a URL from a Google sitemap and add it to the Yahoo index.
I created this orphan page and put it on the blog server. I added the URL to our Google Sitemap file and told Yahoo about the file via the YSE interface. Over Thanksgiving, using a text string query, I noticed that the file had been crawled by Slurp and was now appearing in the main Yahoo index:

Having been too busy to keep a close eye on it that week, I scurried over to YSE to check further and noticed that the file did indeed appear in the list of pages on our blog:

Note that the crawl date for the file - November 16 - is only a day after Yahoo announced its support for the protocol. That's impressive. I submitted the current sitemap file on November 7, and it was processed on the 8th. It's possible that my test file was crawled even before the 16th, since that's only the last crawled date - and I wasn't paying much attention to it during that week.
Regardless, hat's off to Yahoo for making good on their promise - and quickly.
Evidence of Yahoo Crawling Google Sitemaps
Posted by erik at 11:13 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
November 16, 2006
Sitemaps Protocol Redemption 
Nobody took Erik Dafforn seriously when he twice discussed Yahoo's ability to index URLs found in a Google-style sitemaps file. In fact, his incredibly kind and sweet mother even told him he was crazy. That will make this Thanksgiving especially sweet for Erik, as he returns home with news of collaboration on the common Sitemaps Protocol by Google, MSN and Yahoo!
Now Erik isn't known for gloating, so he may reconsider his plan for a public apology from his mother at a Thanksgiving event to which he has invited Three Musketeers Soundtrackers Bryan Adams, Rod Stewart, and Sting to play a rendition of "All for Love" rewritten to commemorate the new sitemaps collaboration.

That seems a bit much, doesn't it?
Sitemaps Protocol Redemption
Posted by tom at 10:41 AM
| Comments (3)
| TrackBacks (0)
Printer-friendly version
November 06, 2006
An Update on Yahoo Sitemaps Optimization 
Reaction to my recent post on optimizing Yahoo sitemaps has been mixed, ranging from "that's amazing" to "you're crazy and your testing methods are shoddy - that will never work!" (thanks for writing, Mom). So I'm trying to take an honest look at the actual probability that Yahoo is able to pull (and subsequently index) URLs found in a Google-style sitemaps file.
The original site I referred to in the post is still showing signs of increased indexing from Yahoo, and the sitemap file I've told Yahoo to use is the same sitemap.xml that I created for Google.
This alone, obviously, does not prove that Yahoo is pulling URLs from the sitemap.xml file. In addition, there are a few other reasons to be skeptical:
- About a month ago, in the YSE forum, the Yahoo rep ("Mr. Slurp") said flat-out, "we currently do not support Google's sitemaps protocol."
But does that mean that Yahoo can't even open the file, or merely that it doesn't recognize and work with the various tags within the file, such as <.lastmod>, <.changefreq>, <.priority>, etc.?
- Following on that point, on the feed submission page, Yahoo says "For any URL (directly submitted or obtained from a feed) our crawler will extract links and find pages we have not discovered already. We will automatically detect updates on pages and remove dead links on an ongoing basis."
So should this statement not apply to URLs such as www.site.com/sitemap.xml?
- When I submitted the sitemap file to Yahoo, it was "processed" within an hour of uploading and gave no indication of error or incompatibility.
But why should I expect such an error message? Sometimes all you get is an error if the page throws a 404, but little more.
I am currently running some tests that should prove definitively whether Yahoo can (and will) extract URLs from an xml sitemap. It could take a few weeks, but I'll certainly share my results here.
An Update on Yahoo Sitemaps Optimization
Posted by erik at 11:49 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
October 31, 2006
Optimizing Sitemaps Feeds for Yahoo 
If you're submitting sitemap feeds to Yahoo, consider using the exact same file you use for your Google feeds (often sitemap.xml or sitemap.xml.gz by default).
Until recently, I'd been using another of Yahoo's recommended formats, urllist.txt (due to its minimal file size), but I hadn't been watching the output code as closely as I should have. I'd been exporting the sitemap.xml file directly to urllist.txt.
As it turns out, this can create bloat even in a text file, because (depending on the program you use to create it), your Google sitemap.xml file contains many URLs you might not actually want to be crawled.
To clarify, I create many Google sitemap.xml files and tell Google to "check" but not "crawl" the incidental graphics files (used in design, nav, and so on). But upon export to urllist.txt, my program was simply listing these graphics files in the list to be crawled, just like all html files. That more or less tripled the size of the file, with two-thirds of the content being URLs I didn't even care about.
As a result, I deleted the reference to urllist.txt in Yahoo Site Explorer, and instead told it to fetch sitemap.xml, and within a week, the index count at Yahoo tripled. (Note that we've been working on a few other things for this site too, so I'm not necessarily claiming a 1:1 relationship here. But I know my change didn't hurt.)
Also follow this thread at YSE forums, where later, "Mr. Slurp" offers a user some keen insight into how Yahoo interprets typical "home" pages such as default.htm, etc. I guess the moral of the story is, canonicalization is in the eye of the beholder - never exclude when you can redirect.
Optimizing Sitemaps Feeds for Yahoo
Posted by erik at 11:59 PM
| Comments (3)
| TrackBacks (0)
Printer-friendly version
October 17, 2006
Google Earth Gets Duped - by Google Earth 
Conventional SEO wisdom has generally arrived at the point that claims duplicated content won't necessarily hurt you, but it won't really help you either. If you present Page X on two unique URLs, so goes the lore, engines don't know which version to pick, so they'll probably just pick one, although you don't necessarily have control over which one they'll decide to include.
That is, unless the engine owns Page X.
I was doing a little research on Google Earth, so I typed what I figured would be the correct URL: www.google.com/earth. Google is usually really good about guessing what people will type, and if that person is wrong, redirecting him to the proper page. But I wasn't wrong, because Google Earth did resolve at that address.
But I clicked around for a while, and wouldn't you know it, before long I was on the earth.google.com subdomain, and I was pretty sure I hadn't been redirected.
So www.google.com/earth/ and earth.google.com are identical. But, no big deal, right? After all, won't Google simply decide which of its pages to show in a query for [google earth]?
Not necessarily. Following is results page for that query (notice the listings in red boxes):
![The first and seventh results for [google earth] go to the same page](http://seoblog.intrapromote.com/google-earth-serp2.jpg)
The bottom line is, despite the fact that Google says "Don't create multiple pages, subdomains, or domains with substantially duplicate content," Google gives itself double the exposure on the results page because it has the same content sitting on two different URLs. Kids, if you try this at home, don't expect the same results. (And wear a helmet.)
And as a neat parlor trick, the two different entries for the page (again, boxed in red) even have faux differences. The top listing shows the DMOZ description for Google Earth. The lower listing shows copy pulled from the page body.
Oh, and notice that boxed in yellow, the exact same thing happens with Wikipedia. The only difference is an internal redirect on the Wikipedia site between /Google_Earth and /Google_earth. Get it? When you're Wikipedia, a character in lowercase is enough to get you a dual listing - to the exact same content.
Google Earth Gets Duped - by Google Earth
Posted by erik at 11:31 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
October 12, 2006
What Google Sitemaps Can and Can't Do 
On the heels of my complaint that Google is sending mixed messages about how it interprets the "nofollow" link attribute, I need to give credit where it's due. Earlier in the week, Matt Cutts asked his readers to report genuine SERP bugs. Not the type of "bug" characterized by your competitor's site ranking higher than yours, but results that return truly crazy results.
The post itself was fairly unremarkable because its purpose was merely to narrow his definition of "buggy." But in the comments section, when asked by a reader why the reader's site showed only two pages indexed at Google "when the site has several more pages that are search engine friendly and a Google Sitemap," Cutts dropped a nugget of gold:
The fact is that if you want Google to crawl you deeply (more than the 1-2 urls), you do need to have some links. Submitting a sitemap to Google lets us know those urls exist, but sitemaps are also not a back door; if no one at all in the whole web links to your domain at all, Google won’t crawl you as deeply.
Sounds like a great excuse to plug a link building service, but that would be crass. Instead, I'll just mention that a Google Sitemap is great for truncating the crawling time required for a site, but it's not a shortcut to ranking well, and as this quote states, it's not even a shortcut to getting indexed.
Per his request, I did leave a comment about the [therapy products] query that Sean caught in June. He said he'd pass it along.
What Google Sitemaps Can and Can't Do
Posted by erik at 11:25 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
October 02, 2006
Google Sitemap Feeds, Regular and Images, And Overlap, Or Lack Of 
We hear from a Googler an important clarification that to date had been a bit murky, especially if you are creating both regular Google Sitemaps and Google Images Sitemaps (or feeds) for clients. At issue:
Can a site use both a Google Image Sitemap Feed and Google Sitemap, or does one supersede the other? And if there's no overlap between the two feeds, if there are pages in the images feed you'd also want to be in the main index, should you have those pages in both feeds? Is that ok?
The answer, Wagon Friends, is no longer blowing in the wind:
There's no overlap between the feeds and indexes here. The stuff in the image sitemaps goes only into Images, stuff in the other sitemaps goes into the main index. So no superceding, no overlap... and therefore it makes sense to use both sitemaps. No need to put pages in both feeds. As I understand it, the main sitemaps file won't really take image files into account anyway.
So, images only in Google Image Sitemaps, and not images but just pages in regular Google Sitemaps, which makes perfect logical sense, thankfully.
And to boot, the Googler it comes to us from has proven to be a Trusted Feed.
Google Sitemap Feeds, Regular and Images, And Overlap, Or Lack Of
Posted by john at 05:41 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
September 27, 2006
Revisiting the Many Faces of Nofollow 
About a month ago, in my post about Del.icio.us cloaking its robots meta tags, I got into a great discussion about the relative functions of the "nofollow" robots meta tag vs. the "nofollow" link attribute.
Jason Dettbarn, in the comments, said
I know one is global and one is used for individual links. But end result, what is the difference? Both tell robots not to follow the links.The reason I bring this up, is because del.icio.us still has the "nofollow" link attribute on the individual links, regardless if your user agent is "normal" or Googlebot or whatever. So the link juice still doesn't pass, regardless of the meta tag.
to which I replied,
As for the nofollow meta tag and the nofollow link attribute having the same effect, that's not my understanding. As I understand it, the nofollow meta tag tells the bot to literally not crawl the target page, while the nofollow link attribute does NOT instruct the bot to avoid crawling the link, but instead, tells it to merely not pass link popularity (or PR, or however you want to think about it).So having a nofollow meta tag on a page does (or should) put a stop to indexing pages linked from that page, while having the nofollow link attribute enables indexing of links on the page but does not allow them to pass popularity.
Jason set me straight by pointing me to an earlier Cutts post that described the two as being similar in functionality. With that, I felt a little foolish, although it didn't negate the main point of my post, which remains that Del.icio.us is misleading its users.
Fast forward. I feel vindicated today, because while I still wasn't right (at least in terms of what Matt Cutts says, which I'll consider authoritative), I certainly wasn't the only one who believed that the two "nofollow" attributes have different purposes.
In an interview with John Battelle yesterday, Matt Cutts once again equated the attributes. This morning, Danny Sullivan asked for clarification, saying
Let's back up. You can put a meta robots tag on your pages with the value of "nofollow," as described here. This tag, about 10 years old now, long predates any concerns about link selling skewing search results or the nofollow attribute. It is supposed to tell a search engine not to follow any links on a page, for purposes of indexing those links....
Now on to the nofollow attribute. Created in January 2005, it was a way to flag particular links to search engines as those a site owner doesn't explicitly approve of. It was never defined as a means to telling search engines not to actually "follow" the link. It was more a way to say that you don't endorse the link. In fact, to my knowledge, Yahoo and perhaps others will still "click on" or follow links even if they make use of the nofollow attribute.
I doubt that Matt Cutts misunderstands Google's methodology in dealing with the two "nofollow" attributes, so I'm officially changing my beliefs on their usage. But I do feel better knowing that I'm not crazy, and that others (of significant influence and industry knowledge, no less) were lured into believing as I did.
Revisiting the Many Faces of Nofollow
Posted by erik at 10:05 AM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
September 13, 2006
Site Verification Headaches with Yahoo and Google Sitemaps 
In the spirit of Doug's most recent post about Google Webmaster Central, I wanted to add a few notes about both it and its Sunnyvale counterpart, Yahoo Site Explorer's recently updated webmaster area.
Yahoo Site Verification. I spent about an hour this morning preparing to verify about 20 sites for a very large client. One thing that's REALLY annoying about Yahoo's site verification process is that each site requires a unique text file - complete with unique filename and unique 16-character text string within the file - uploaded to the root.

Now, of course you can't create all 20 verification files, dump them into an email message, and send them to the client for uploading, because the client won't know what file goes with what site. So I created a folder for each file and zipped all the folders into one Zip archive.
I also added an Excel sheet with columns for the site, filename, and character string, because more than once, I've sent Yahoo authentication files over and over, only to have the recipient complain that the attachment didn't make it through. Apparently, many zealous mail clients look askance at curiously named, 16-byte file attachments. With the Excel file, I had a failsafe record of each verification file's contents in case they needed to be recreated by the client.
I'm sure Yahoo has a reason for giving each user a different authentication filename AND character string for EACH site that needs to be authenticated. I'm just not sure what the reason is.
Contrast this with Google Verification. First, I have to be honest and admit that I'd verified about a half dozen sites through Google before I realized that each time the server spat out an authentication file, it was the exact same file each time. Few people understand that with Google, your unique verification file (tied to your personal Google account) is your backstage pass to any concert you want. You can view the stats for any site that hosts your verification file in its root, and a site can host verfication files for as many people as need access to the stats.
So verify one site, then keep that verification file in a place you'll remember. From then on, you don't need to go through the process of having Google spit out the same info again and again, each time you want to verify a new site. Just upload your file to the root and Verify.
Like Yahoo's verification files, Google's also suffer from Napoleon Complexes - in fact, with no recommended content at all (just unique filenames), email clients are even more suspicious of them, because at 0 bytes, they're infinitely smaller than Yahoo's 16-byte files. While Google doesn't specifically demand that your file contain text, it doesn't discriminate against files that do. So here's a tip: Add some nonsense text to your Google verification file, and I think you'll find it more easily passable through email.
Site Verification Headaches with Yahoo and Google Sitemaps
Posted by erik at 11:34 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
September 12, 2006
Google Webmaster Tools Uncovers Missed Site Opportunities 
We have many clients whom we’ve helped create and submit sitemaps to Google through Google Webmaster Tools.
Simply put, a Google sitemap is a special file that resides on your server that enables you to tell Google what pages are present on your site. Once this is done, you can login to Google’s Webmaster Tools console and manage your sitemap as well as view statistics and error information about your site.
Some of the most valuable data provided by Google is under the Statistics and Query Stats tab.
Here you’ll find:
Top Search Queries – this data shows the top search queries for your site within Google’s placement results. In other words, the most popular queries where you have some presence at Google. These are highly searched-for keywords and phrases where your site shows up in Google’s natural search results. Think of this as a VISIBILITY indicator.
Top Search Query Clicks – this data shows the top search queries that sent traffic to your site. In other words, these are the most popular queries for which people actually clicked over to your site. Your site is getting clicks from these specific keywords and phrases. Think of this as a TRAFFIC indicator.
Along with the data above, Google also provides the Average Top Position of your site which is the highest position any page from your site ranked for that particular query.
While this data took us a while to digest and analyze, over the last few months we’ve been able to create some very helpful reports for clients. The secret to this data is not necessarily the data within the two groups of data above, but rather in comparing both sets of data.
For example, if a search query appears in both groups, this means the search query is both highly searched and found at Google (visibility of your site is good) AND the query is also getting clicked on (traffic is flowing from Google to your site). You may find that these queries are very important queries to your site, while others may not be. A few of our clients have been surprised by some unexpected search queries that their site is highly visible for and is also getting traffic from! The ideal situation, and a good indicator of SEO performance, in this example is to find some or all of your major keywords and phrases in this group. For us at Intrapromote, this would allow us to meet our first and second goals in an SEO and Link Building campaign: Placements (visibility) and Clickthroughs (traffic).
Perhaps the most “SEO-affecting? comparison of the two data sets is where search queries are highly visible (they are a Top Search Query) but they are not getting clicks (they are not a Top Search Query Click). We see these as potential missed opportunities IF the search query is highly relevant to your site.
For search queries where this occurs, you should ask yourself a few questions:
A. Is the keyword/phrase/query not ranking high enough on page #1 at Google to get clicks?
B. Why is the page returned by Google not getting clicks?
1. Is the page title and/or description on Google unappealing?
2. Are there on-page factors blocking higher placement?
3. What else can be done to push the page higher on page #1?
Google Webmaster Tools Uncovers Missed Site Opportunities
Posted by doug at 03:43 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
August 17, 2006
Google Says Index This! 
Recently, the Wagon has noticed a new option at play over at Google Sitemaps. Next time you're in the area, click on Preferred Domain.
Google is giving you the opportunity to choose whether you would like your urls listed with or without the www (http://www.yoursite.com or http://yoursite.com). Google follows this opportunity with the following note:
Once you specify your preference here, it may take some time for changes to be reflected in our index. While Google doesn't guarantee that we'll show your URLs in the form that you prefer, we will use your choice as a suggestion to improve our indexing.
We think this is important for a few simple reasons. Google is trying to improve its index of your site, and they are allowing you to help them. Further, Google is acknowledging that the www vs. non-www indexation issue is causing duplicate pages to appear in its index. And finally, and most importantly, Google is telling us that sites can improve their indexation by clearing up the www. vs. non-www issue.
Click over there to learn how to 301 redirect non-www pages to their www equivalent, or vice versa.
Google Says Index This!
Posted by tom at 04:56 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
August 01, 2006
NOODP: Should Google Have Kept Greasing Wheezer? 
One of my favorite Little Rascals episodes is called Bear Shooters where Spud can’t join the gang on their hunting trip because he has to stay home and grease his little brother Wheezer. It’s rumored that the band Weezer took it’s name from this classic episode.
"I can't come out," whines Spud. "I've got to stay home and grease Wheezer!"
It all works out for Spud in the end. But, I’m wondering if Google should have stayed inside and greased Wheezer a while longer before adding support for the NOODP tag. Did we celebrate prematurely?
We’ve read several site owner reports that allege their rankings have dropped significantly since implementing the tag on their sites. A few of these also report that they removed the tag and their placements returned to their former positions.
The only Intrapromote client so far that has implemented the tag has also experienced a drop in placement for their most important and competitive search phrase at Google. About the same time, they’ve also seen a major drop in the number of pages indexed by Google.
Coincidence?
Poor Wheezer was better off with the croup?
Hmmmmm . . . . . . . . . .
Our client plans to yank the NOODP tag to see if normalcy resumes. Stay tuned. I’ll let you know what happens.
NOODP: Should Google Have Kept Greasing Wheezer?
Posted by doug at 05:15 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
June 22, 2006
Anatomy of a Blog/Newsletter Archive 
Archive, Archive, Archive.
The following transcript is loosely based on many client conversations. For our purpose, we will join this conversation in progress, and leave it in much the same fashion.
Consultant: Oh, so you have a newsletter. Do you archive it?
Client: No. Should we?
Consultant: Absolutely!
Client: Well, we’ll see
. . . And Scene.
The recommendation repeatedly . . . dare I say ritualistically . . . receives reluctance. Why, man, Why? If you go to the trouble of creating fresh, useful content for a targeted audience, why not show it to the search engines? How different is that from a blog? Not too many people argue that blogs do not have search engine value, including newsletter pushers not yet ready to be called bloggers. What they do not realize is that the blog structure is flexible enough for the newsletter pusher to plug in newsletter content postfactum, thus protecting his or her newsletter status.
Enter Jill Whalen. You would be hard pressed to find someone that balances search engine value and user focus better than Jill. When she makes a move, you can guarantee that both parties are justly considered. Jill Whalen has always archived her newsletter, but she did not always archive it like this.
In adopting the blog structure, Jill Whalen has made it easier for humans to peruse her archive, for search engines to index her archive, and for me to advocate such archives. Jill's linking structure is perfect. She drives traffic and correspondence to the archive by linking to the archived location from that article in the newsletter. The article will have a temporary home on the archive main page, as well as permanent homes on its own page, within the monthly and other relevant categories.

With the release of each new issue, the archive will show search engines new pages of fresh, relevant content. And Google loves new content, but Google also loves old content when there is proof of new content, so the older articles will continue to gain steam as new issues are added.
And look at those links, a completely different navigational structure than the main site. Just one link on the left. Jill Whalen is completely emphasizing conversion. And on the other side, she points out the previous issues, beneficial to human and search engine alike. Then she links (with keywords, of course) to her most important sections of the main site. As for the body, every article links to itself in the title, then all categories under which it could be found at the end of the article.
More pages, more fresh content, and more keyword-rich links equate to more relevance, more importance, and better indexability. If you will not do if for me, please do it for Jill Whalen.
Anatomy of a Blog/Newsletter Archive
Posted by tom at 01:16 PM
| Comments (2)
| TrackBacks (0)
Printer-friendly version
April 11, 2006
Why Does Google Hide its Zeitgeist? 
In preparation for a post even less informative than this one, I was looking for the home page of Google Zeitgeist. While I'm usually pretty good at guessing where Google keeps its content, it often keeps me wondering about whether it's a subdomain of the main site (such as Google Base) or a folder off the TLD (such as Analytics).
So when I finally searched for [google zeitgeist], I was disappointed at the options. I clicked over on what I quickly figured was the page I wanted, only to find the UK version. Then there is Canada's. There's also a link to a services.google.com page so cleverly coded that its title doesn't fall within a <.head> tag.
I decided to do a site-specific search for it, hoping I'd find the right page. I didn't, but I found a page that knew the page ("friend of a friend" sort of thing).
From there, I hit "Zeitgeist home" and finally found it. By that time, I didn't even want it anymore, but I did wonder why I had such a hard time finding it. And there it was, in the source:
<.meta content="NOINDEX, NOFOLLOW" name="ROBOTS">
Why on earth would this page be set with those attributes? Is Google afraid too many people will see the rising popularity of queries for [masters], [katie couric], or the [gospel of judas]?
There's a joke there - somewhere - involving betrayal, Matt Lauer, and a 5-iron, but I spent so much time finding the page itself I don't have time to work it out.
Why Does Google Hide its Zeitgeist?
Posted by erik at 11:26 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
February 01, 2006
Dynamic URL Rewriting Done Right 
A recent press release announced Biblio.com selecting Intrapromote as their SEO partner.
Biblio.com touts 30 million used books, rare books & out-of-print books. You can imagine the mass quantity of database driven dynamic URLs on a site the size of Biblio.com.
It's no secret that some search engines still stub their indexing toe on dynamic urls, especially wildly dynamic URLs. Invented and originally written in 1996, a URL rewriting program called Mod Rewrite is perhaps the most popular URL rewriting program available today. We've recommended it to hundreds of site owners since our inception in 1999.
Mod Rewrite uses a rule-based rewriting engine to rewrite dynamic URLs on the fly. That may sound fairly simple, but it's not for the technical faint at heart. The good news, however, is the end result which can be the transformation of a dynamic URL into a static URL such as:
http://www.biblio.com/books/64717327.html
Sure beats:
And search engines gobble up that beautifully rewritten URL (insert Pac Man "whacka whacka" sound)! Well done Biblio.com!
For more on Mod Rewrite, we recommend:
Beginner's Guide To Mod Rewrite
Mod Rewrite Forum
Mod Rewrite Original Apache Documentation
Mod Rewrite Tips and Tricks
Dynamic URL Rewriting Done Right
Posted by doug at 01:24 AM
| Comments (10)
| TrackBacks (0)
Printer-friendly version
January 17, 2006
Thank You For Asking About Our Product, Please Visit Our Competitor 
I called a business today to inquire about their new product. A voice mail answered and said,
"Thank you for calling...
Blah blah blah..."
Then...drumroll...
"If you're calling about [new product name], please call back."
I am not kidding. ... please call back (echo, echo, echo)
Later in the day, I was talking with a business about an indexing issue with their site where several top performing pages at Google were clicking through to 404 error pages (the search-equivalent to Window's blue screen of death and business-equivalent of "please go away".) Well, at least they aren't the only ones.
A few of these pages were major product pages.
See where I'm going?
Has your site gone through recent changes creating indexing issues at search engines? If so, you could actually be directing them to your competitor.
... directing them to your competitor(echo echo echo)
As search engines improve relevancy, chances are if your prospective customer can't find your pages, they are just two clicks away from visiting the next site on the search results page. Work with your internal or external SEO team to clean up any current indexing issues and insert regular checks and balances to avoid these down the road.
Thank You For Asking About Our Product, Please Visit Our Competitor
Posted by doug at 03:52 PM
| Comments (0)
Printer-friendly version
January 10, 2006
The Case for Long-term SEO 
A few weeks ago, Doug posted about long term SEO and the benefits to companies that are committed to it.
Colin Christofferson of Optimize the Enterprise responded:
One thing that we struggle with here is trying to understand if we are staying "ahead of the curve." Essentially, if we did nothing else, search traffic to our site would grow simply because the web and search are growing. We want to know if we are outpacing that, and it is often difficult to report on.
He followed up with his own post that furthered the question:
I want to be able to measure my impact relative to the web as a whole. Search referral growth of 30% year-over-year seems great, unless the growth of search usage has jumped by 70% web-wide! It is important that we actually gain new mind and market share through search.
All very good points, worthy of some follow-up. Honestly addressing Colin's comments forces us to ask a few important questions:
- What is the growth rate of the internet overall, and of search engine use in particular?
- If I optimized a site and did absolutely nothing afterward, wouldn't I achieve natural growth on pace with the growth of the internet and search engine usage?
- How do we tell if our growth is outpacing our industry's?
To answer the questions, we need to look at some statistics. Wherever possible, I used the range of 2000-2005 as the comparison points.
- From 2000-2005, worldwide internet "usage" grew about 170%. (As you can imagine, the numbers vary wildly by continent, but I want to work with averages.)
- In 2000, search engines were responsible for about 7% of site referrals.
- Today, the average percentage of search-based traffic is quite difficult to determine. Our clients run the gamut, from about 15% to over 80%, depending on their vertical market and additional on- and offline marketing efforts. (Note: If 100% of your traffic comes from search engines month after month, you have a problem, and you ought to know why.)
So Colin's thoughts are panning out. Year over year, we have steady increase in the percentage of each country's population that uses the internet. On top of that, a greater and greater percentage of that user base turns to search each year to find what they're looking for.
So even if you stop optimizing, should your search traffic grow at a rate similar to the pace of the Internet, and more accurately, at the rate of search engine use? In a vacuum, perhaps. But there's one more important number we need to look at:
- Based on domain registrations (.com, .net, .org, .biz, .info, and edu), the number of actual web sites from 2000-2005 increased by about 364% (roughly 10M to 46M domains). So while many more users are out there, and many more of them are using search to find sites, the sheer number of sites competing for those eyes has increased greatly too.
(The standard disclaimers apply: Many of those are probably parked. Many are garbage sites. Etc. But you see where I'm going.)
Sites like Apple will never have trouble with traffic growth for queries like [ipod]. But few of us, including what we consider some big, big names, have a computational grip on our markets (and our resellers) like Apple does. So that leaves the "rest of us." How do we know if we're out performing our industry? Here are some possible ways.
- Watch your list of referring keywords. The total number of referring phrases should grow each month. But don't expect this to happen unless you focus on creating new content regularly, and you ensure that it's easily crawled and indexed. If your traffic grows with the "same old" set of keywords each month, that's probably a sign of simply rising with the tide of new users.
- Watch the sites within your industry. Are your competitors showing up on more and more results pages? Are there "new kids on the block" that seem to have come out of nowhere who are taking some of your traffic, yet you still see growth? Chances are you need to work harder.
- I've never been totally convinced that the Alexa toolbar is accurate for monitoring one site's traffic, but I think it can be fairly useful for watching the relative growth of a group of sites. How have you fared against your top five competitors over the last 6 months?
To test this global thesis fairly, we'd need to compare two identical sites - one that stays static, and one that regularly adds content, pursues relevant links, and breaks down crawling obstacles. Anything you do to your site ideally knocks you out of the first group, so Colin is right - it's very difficult to report on.
To all this, I should add that in some markets, you shouldn't even expect industry-average growth to continue of you stop optimizing. We see plenty of sites that were kings and queens of their respective counties after optimizing a few years ago, who have seen raw search numbers steadily decline since. They're not even staying flat. And that's happening in more and more verticals all the time.
So stay on top of your SEO game. People are still coming online and searching in numbers too great to ignore, and it's up to you to make sure your growth outpaces theirs.
And thanks, Colin, for thought-provoking questions.
The Case for Long-term SEO
Posted by erik at 11:04 PM
| Comments (0)
Printer-friendly version
December 21, 2005
Are You the Master of Your Domain? 
Just a few quick notes about domain management. These are things we see over and over and over again while we do initial site evaluations.
- If you scooped up multiple domains (for example, nike.com, nikeshoes.com, nikeapparel.com, and so on), that's fine. But be very sure you know the exact status of each one.
Merely "pointing" them to the main site and trying to make sure that no one links to the secondary domains is not good enough. In a recent site eval, a reverse IP lookup showed that a client had three total domain names sitting on one IP address. "But only the main domain is active," he said.
He was wrong. The Big Three engines had indexed portions of all three sites to varying degrees because he had not ensured that the two "inactive" domains redirected to the main site properly.
- The same concept holds true for subdomains. Make sure that you've solved any "canonical domain" issues. Typically, this means that http://domain.com should redirect (via 301) to http://www.domain.com. Otherwise, engines can crawl "both" sites (even though there's really only one) and think you're posting a mirror site. (Google is getting better at "getting" this, but they're not 100% there yet.)
This became critical for another site evaluation client whose Apache server was set up incorrectly. For some reason, the server was incorrectly utilizing "wildcard subdomains." In this particular case, any subdomain you entered would resolve to the main site. In other words, spam.domain.com, lies.domain.com, and crazy.domain.com were all mirror images of www.domain.com.
This became a real headache when an actual fan of his site linked to him - but mistyped the URL. His fan linked to ww.domain.com instead of www.domain.com. See where this is going?
To top it off, this client had set up most of his internal navigation links as relative instead of absolute. As a result, dozens of pages on the ww subdomain had been crawled and indexed, which created a real mess that's still not entirely resolved. (Thanks to Yahoo Site Explorer for playing detective. I really, really love that tool.)
Domain issues often get lost in the shuffle of everyday web dev and SEO. But make sure to check them out periodically, because they can cause significant problems if left untethered.
Are You the Master of Your Domain?
Posted by erik at 11:45 PM
| Comments (2)
Printer-friendly version
December 13, 2005
Watching MSN's Algorithm at Work 
I was all ready to bash MSN's search algorithm a week ago, and I'm glad I waited. Instead, I've grown more impressed with its self-governance and, I daresay, its audacity.
During a routine vanity search for [seo speedwagon] about two weeks ago, I was shocked at the top result - an actual results page from Google Blog Search. The following screen shot was snapped on 12/4 and shows the MSN results page for the query [seo speedwagon].
![12/4 shot of the top MSN SERP for [seo speedwagon] showing a Google Blog Search results page in top position. 12/4 shot of the top MSN SERP for [seo speedwagon] showing a Google Blog Search results page in top position.](http://seoblog.intrapromote.com/msn-spdwgn.jpg)
As you can see, the top result is a dynamic results page from Google's blog search engine. And that's my fault, I guess. Back in September I wrote about Google's new blog engine. In that post, I talked about (and linked to) various test searches I had run, including a search for our blog name. No big deal.
When you think about a search results page like this, it has several factors that, in a vacuum, should make it rank well:
- A succinct title with relevant keywords
- Loads of on-page instances of keywords
- At least 20 outbound links to related sites (Google Blog Search results pages link to both the specific post and to the blog's home page for each of the 10 positions.
- (In this case) four incoming links (albeit from the same domain) with targeted anchor text. Because of the way the Speedwagon is written and archived through Movable Type, each post tends to show up multiple times. MSN sees this specific post in four locations:
- The post's individual archive
- Our archive of blogging articles
- Our monthly archives for September
- Our archive of Google articles
So ironically, I created the very page that supplanted our own site in a vanity search. How poetic.
Now here it gets even more interesting. We can debate the merits of the Google Blog Search results page all day, but in truth, MSN never should have crawled and indexed it in the first place, because Google Blog Search results pages are specifically disallowed in the robots.txt file for that subdomain:

So let's sum up:
- MSN crawled and indexed a page from Google's blog search engine against Google's robots.txt exclusion.
- Its algorithm deemed that page worthy of ranking for a relatively uncompetitive term.
- A week after I noticed MSN ranking the Google page at position 1, it had dropped to position 2. Today, it doesn't rank anywhere.
- The Google Blog Search results page has dropped out of the MSN index.
I'd say that just about balances out right.
Watching MSN's Algorithm at Work
Posted by erik at 04:00 PM
| Comments (1)
| TrackBacks (0)
Printer-friendly version
December 09, 2005
More On SEO Considerations When Moving To .NET 
In my last post about the SEO implications of converting your site to the .NET platform, I warned about search engine indexing problems if you create long, dynamic URLs with more than 4 dynamic characters.
A few more SEO things to think about when making the conversion:
* Avoid anything after the "?" that looks like it might be a user ID or session ID (search engines will likely stumble on these).
* Before the new .NET pages go live, make sure to use a 301 redirect on the old URLs.
* Assuming the pages you are converting are optimized, before the new .NET pages go live, make sure to include the optimized title tags and meta descriptions on the new pages. (Believe me, your SEO company will really appreciate it.)
More On SEO Considerations When Moving To .NET
Posted by doug at 04:38 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
December 06, 2005
To Tag or Not To Tag Your Vegetables 
Coming on the heels of the much ballyhooed Tivo announcement that ads will now be searchable, it was titillating to follow David Berkowitz's prognostication of where this would all spill.
In the second-to-penultimate paragraph of this erotic thriller we reach what has to be a climax for all in our industry:
Searching within a map, a PDF, and even a PC desktop was much more cumbersome only a few years back. A former iCrossing colleague, Sara Holoubek, often illustrated the imminent pervasiveness of the Internet by noting how computers will one day be commonly built into refrigerators. By that example, searching the contents of your kitchen from a refrigerator-based console is hardly far-fetched (and given the difficulty I had finding ingredients when baking a kugel last weekend, it's a development I'd welcome).
The cold water splashed on this rock and roll search fantasy? I suspect spam will be a problem.
To Tag or Not To Tag Your Vegetables
Posted by john at 05:25 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
November 30, 2005
Diagnosing Potential SEO Problems - Without Guilt 
From a psychological perspective, it turns out that asking a new or potential client, "Have you ever engaged in any spammy or controversial SEO activity?" is quite a loaded question - somewhere between "Have you stopped beating your spouse?" and "Are you really going to wear that tonight?"
While we ask this question for diagnostic reasons only, it elicits immediate defense from most marketing managers. "No - of course not!" they interject, before we can even finish the question.
Too bad. I actually enjoy unthreading the spaghetti-like mess that some clients bring to us. I have fun looking at things like doorway pages, scripts that create cloaked pages, and complex link networks to see exactly how they're sculpted, and to compare the intended result of the technique versus the actual results attained. What I don't enjoy is finding out about these techniques six months into a campaign.
So recently, I recognized the error of our ways when we discovered a client, who had previously promised that all was on the level, had several additional domains, all on the same IP block, using a meta refresh to point to the main site. This was the likely cause of rejection from a few human-edited directories.
So why was it our error? To the client, it was on the level. Long ago, they had purchased domains that they either intended on using, or that they didn't want their competition to buy, and they simply pointed them to the main domain the only way they knew how. Fast-forward to the last few years, when the specific technique you use in redirection is a critical component of SEO strategy.
So now I phrase it differently. "Tell me about your site's history," I say. "What have you done in the past, and how did it work?" It's a little bit clinical and sterile, but we end up finding out much more helpful information.
As a post-script, to phrase it in terms free of the slightest hint of judgment, here are some site activities that, while we're sure they were done with ONLY THE BEST OF INTENTIONS, might cause you some eventual grief in the indexing and ranking processes:
- Exchanging links with a site or group of sites whose sole criterion for offering a link is verifying your willingness to offer a link in return.
- 5000 (or any-thousand) doorway pages with 75% keyword density that redirect to your normal site.
- Any activity whose description was punctuated with, "Don't worry, no human will ever see this anyway."
(Humans, we keep finding, are one of the most annoying factors in web usability.)
Diagnosing Potential SEO Problems - Without Guilt
Posted by erik at 02:14 PM
| Comments (0)
Printer-friendly version
November 23, 2005
No Fearing the Nofollow Link Attribute 
While I believe that sites that are ultimately successful do have high quality, user-focused content, many people incorrectly infer the converse of that statement: that if they have high quality, user-focused content, they will be ultimately successful. I don't necessarily believe that, because the demand is simply not great enough for all the high quality, user-focused sites out there. Many will do great, but not everyone is going to get rich.
But my pessimistic outlook is still no reason to avoid creating the best content you can - both at your own site and elsewhere. Think about the "nofollow" link attribute and its recent influence in SEO.
The mass adoption of the "nofollow" attribute about a year ago meant that site owners now had a way to illegitimize comment and trackback spam, even if they couldn't control its spread. If a link on your site has the "nofollow" attribute, engines know that you don't "vouch" for the authenticity of the link. Consequently, the engines won't reward such sites with any rankings benefits.
This effectively squashes a spammer's chances of benefiting from your site's status. He can add all the comments he wants to your PR4 post, but the links back to his site don't get a vote of confidence like they would if you made the exact same link in your post.
Unfortunately, however, many people promote their sites via forum and blog comments that are well reasoned, helpful to the discussion, and quite informative. And many of these people now feel that their sites no longer benefit from the links in comments. A mass exodus from SEO Chat forums is a recent example. While the introduction of the "nofollow" attribute was only a small reason for the members leaving, it was certainly a factor.
But if you think back to the days before link popularity was a religion, it's no different now. If you have something intelligent to say, if you're contributing to the discussion, and if you're offering a fresh perspective - and you do this long enough - you'll earn links the old fashioned way: By earning them. People will click through to your site via your profile, and eventually, if you impress them long enough, they'll want you to be a part of their community.
You've always had to impress potential customers with your intelligence and perspective. Be glad that "nofollow" is here, because all it has done is allow the specter of easily gained popularity to find peaceful rest.
No Fearing the Nofollow Link Attribute
Posted by erik at 11:45 AM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
November 18, 2005
Google Sitemaps Tests Your 404 Error Codes 
Today, Google unwittingly unleashed an entirely new service. It motivated thousands of webmasters to validate their error page code.
Back in September I posted about the importance of making sure that your 404 pages deliver a true 404 error code; many sites unknowingly (or knowingly) pass a 302 or 200 code instead. At the time, I wrote
In the health checklist of SEO, this issue isn't equivalent to a broken leg or clogged arteries. It's more like an old football injury that flares up at inconvenient times. The most likely results of a deceptive header code on your error pages are search query results that return error copy in the SERP description, and possibly an artificially inflated index count. Neither one will kill you, but it's something you shouldn't ignore.
Today, something strange happened. Sites that both fail to give the correct 404 error code AND that have Google Sitemaps files in place became vulnerable to a security breach. Anyone else with a Sitemaps account could look at the clickthrough stats of your Sitemaps-based pages. It's not akin to dumping your customer credit card database into the street, but it's certainly data you don't want just anyone to know about.
The moral of this story is not my prescience (I was afraid to buy Google at $85, after all). Instead, the take-home lesson is that today's C-list site dev priorities can become tomorrow's raging infernos - through no fault of anyone at your organization. No one predicted that Google would use a security measure that relied on your site spitting out a certain error code - why would they?
References: Although this story was buzzing at all the usual haunts today, I first saw it at SERoundtable, and as usual, Danny has an authoritative summary.
Google Sitemaps Tests Your 404 Error Codes
Posted by erik at 11:35 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
November 15, 2005
An SEO Checklist for Site Redesign 
When you're about to unleash a site redesign, your to-do list probably appears endless. And once you've taken care of the design itself and the graphics, page files, and remaining assets are accounted for and tested, you still have to consider SEO.
A more comprehensive SEO checklist for site redesign surely exists somewhere, but my goal is to show things that site owners typically overlook and that, when added up, can have significant impact on a site's performance over time. Note: When I refer to "redesign" in this context, I'm not talking about changing domains. Instead, I'm referring to changing the entire file, folder, and page structure of the existing site, and perhaps (but not necessarily) a migration to a new code type, such as going from PHP to ASP (or back).
- Account for your top URLs. Analyze the last few months of web stats and determine your highest performing pages, both in terms of entry pages and strict page views. Brochure-type sites will have a smaller number than database-driven sites. Determine whether these top pages will have counterparts on the new site. If so, map the old pages to the new ones using a 301 (permanent) redirect. (James discusses 301 redirects here.)
As for your non-performers, that's a judgment call. Even if certain pages don't get too many views or don't have new-site counterparts, you need to account for them when you roll over and make sure that if users land on them, they end up somewhere relevant. You can map those remaining pages to the home page (again, using a 301), or, you can avoid a lot of work and let a custom 404 page handle it (see below). Using a 301 will transfer lots of little fragmented bits of link pop and PR to the root, but a well built custom 404 page will likely be more help to your users and save them a click or two in getting to their final destination.
Along these lines, do a thorough backlink check to see who's linking deep into your site. (Yahoo Site Explorer is currently one of my favorite backlink checkers.) Even if these links don't send much traffic, chances are the traffic is pretty qualified, so you should take special care of the visitors that come from reputable industry sites and make sure that the page they land on isn't a black hole.
I often read articles that recommend you contact everyone linking to you and request that they update their links. In my experience, this hasn't been necessary. When a 301 is properly implemented from old_page.htm to new_page.asp, I've always seen links to old_page.htm eventually show up in a backlink check of new_page.asp.
- Prepare your custom 404 page. A custom 404 page is like a trapeze artist's safety net. Even if a user calls for a page that doesn't exist, she'll still end up on your site instead of seeing the ugly (and near-useless) default 404 page. A well made 404 page is often a variation of the site map, along with an introductory note that apologizes for the lost page, and an earnest hope that the user will find what she needs by following one of the links on the page. Most important, make sure that the links on your 404 page correspond to your new site content, not the old content. Otherwise, your users might just end up swallowing themselves in an endless 404 loop. (One more thing: If you're on a Microsoft platform, make sure that your 404 pages give a true 404 error code.)
- Update your robots.txt file. Many people forget to update their robots.txt exclusions when their folder structures change. If your private data, images, or testing area has changed locations, make sure to add a line in the file to account for it. Now this is important, and it needs emphasis: Don't replace old exclusions with new ones. Instead, add new exclusions and retain old ones. From the engines' perspectives, those old folder structures will still be around for a while, so if you delete old exclusions, you could see some funky things happen for several months.
- Update your analytics and conversion definitions. Here's another item that often goes overlooked until the monthly reports come out. If your analytical conversions are based on views of certain pages, specific click paths, or hits on certain files from the old site, make sure to update your conversion definitions to account for the new site. You don't need the accounting mess created when the revenue and the analytics program disagree about conversions and ROI.
- Remember your outgoing links. When everything else is done and you have time for a bit of altruism, remember that your outgoing links are likely benefiting someone else - either another site you own or a site you have recommended. Any old articles or resource pages that link to those sites will no longer benefit them if they're deleted. So if you still think those sites are worthy of your link, make sure to recreate them somehow.
I hope this list brings up some issues you hadn't thought of. If you have other ideas, drop a note in the comments and I'll post them later on.
An SEO Checklist for Site Redesign
Posted by erik at 11:10 PM
| Comments (6)
| TrackBacks (0)
Printer-friendly version
November 04, 2005
Content Management Systems and SEO 
Are you considering a Content Management System (CMS) for your site?
I've seen many Content Management Systems in my day. The main purpose of a CMS is so that site owners can make changes to the content of their web pages without having to ask their Webmaster to program the changes. You log in, find the page you want to change, make the changes to text on your page, log out, and your changes are live.
One important thing to understand: There's a reason it's called a "content" management system. It's focus is on the content of your site, not the code.
If you're considering a CMS for your site, from an SEO perspective, there are two important questions to ask:
1. How can I change my title tags and meta tags (code)?
If a client is considering a move to CMS, we've learned to ask right away how title tags and meta tags will be changed on the site. And we've heard more than once, "Oh, you can't change those." Ouch.
Make sure while gaining flexibility to change text on your pages that you're not giving up the ability to make changes to your titles and metas.
2. What will my URLs look like?
A CMS can be paired with wildly dynamic URLs that search engines have difficulty crawling and indexing. Make sure that the CMS produces search-engine-friendly URLs.
Know of an SEO-friendly CMS? I welcome your comments below.
Content Management Systems and SEO
Posted by doug at 02:45 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
October 12, 2005
A Universal Truth from Yahoo Site Explorer 
In preparation for my earlier post describing Yahoo Site Explorer, I was playing around with some random domains that were sure to be well indexed by Yahoo, and I came across something quite interesting. It's far too late in the day to try to interpret it, but here it is:
Needless to say, that was one URL worth exploring. Some observations and questions:
- That specific URL is the only one in the YSE index for the whitehouse.gov root page. If you explore the URL without the anchor, the SERP reverts back to the anchor-containing address.
- Obviously, that anchor doesn't appear in the current or cached version of the White House home page.
- What is the intent here - some sort of post-modern engine bomb? While I found it via Yahoo, that doesn't mean it was intended to affect Yahoo SERPs.
- The inlink page for that URL shows some interesting variety: search results pages, blogs, wiki scrapers...
- Could this be spread via some sort of PPC scraping? No funky anchor text appears to accompany the inlinks; they often use the title of the whitehouse.gov root page, "Welcome to the White House."
- What is the desired effect? If its purpose is to distort SERPs, it doesn't appear to be working, at least not yet.
A Universal Truth from Yahoo Site Explorer
Posted by erik at 11:37 PM
| Comments (0)
Printer-friendly version
Diagnosing Crawling and Indexing Issues with Yahoo Site Explorer 
Yahoo recently released the Yahoo Site Explorer, a very helpful way to diagnose issues that may be hampering your performance in Yahoo natural search.
Enter a URL in the Search box, and Yahoo returns two values: First, the number of indexed pages for a given site, and as a second option, the number of incoming links pointing at that site. The following image shows the location of the key data points.
Drilling down, you can select specific URLs from the results page, and "explore" those pages in depth - finding, for example, the number of pages from a specific section of your site that have been indexed, or the incoming links pointing to a specific page of your site. To find an index count for a specific site section, enter or click a URL such as http://www.site.com/press/. This returns indexing and linking results for this specific URL, as well as any pages in the /press/ directory.
Unlike Google, which purposely returns only a percentage of a site's incoming links, Yahoo Site Explorer claims to show all incoming links that it knows about. This can come in very handy when performing a competitive link analysis for sites in your industry.
One of Site Explorer's largest drawbacks is the ability to download only the first 50 results into TSV format, for import into programs like Excel. It would be wonderful to have an entire site's worth of data to sort and play with in a spreadsheet program, but it's unlikely that Yahoo is too eager to spend processing time creating TSV files with tens of thousands of rows. A resourceful programmer named John Mueller has used the Yahoo Search API to create a custom version of Yahoo Site Explorer that overcomes some of these common obstacles; it's worth a look.
If you find large blocks of URLs from your site that Yahoo has not indexed, it offers a submission system similar to that of Google Sitemaps. You can simply fill a text file (such as pages.txt) with the URLs you want Yahoo to index, separated by a hard return. Upload the text file to your web server, then submit the entire URL of the text file (such as http://www.site.com/pages.txt) to the Yahoo Free Submit page.
We don't yet have significant data on the crawl rate for URLs submitted via this method, but we'll be sure to publish any information we collect.
Diagnosing Crawling and Indexing Issues with Yahoo Site Explorer
Posted by erik at 06:39 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
September 30, 2005
Is .NET A Search Engine Ranking Killer? 
A lot of companies are moving their web site architecture over to Microsoft's .NET technology.
Peer pressure? Perhaps, but I believe in many of .NET's benefits. However, from an SEO perspective, just a word of caution. Before you dive head first into migrating your site, be sure you have a good idea of how your URLs will change (and they will).
If you're going from static URLs or dynamic URLs with 1 or 2 dynamic characters to extremely long dynamic URLs with 4+ dynamic characters, you will likely have serious indexing issues with search engines.
It's still difficult to predict what Google will do with a dynamic URL with 4+ dynamic characters, but you should still use caution when considering anything that may make your site less search-engine-friendly and negatively affect your hard-earned search engine rankings.
Have your web developer or team show you some examples of the new URLs. Then compare to your current URLs and make sure you're not about to dramatically decrease your web site's visibility while taking the .NET plunge.
That way, this isn't your next technology solution.
Is .NET A Search Engine Ranking Killer?
Posted by doug at 04:42 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
September 27, 2005
Custom 404 Error Pages and Deceptive Error Codes 
Last week, a client asked me to look into a strange search result on Yahoo. For this particular search phrase, the page listed in the Yahoo SERP had long ago been deleted. In fact, the Yahoo descriptive snippet for this page gave the typical "error" description from the client's custom 404 page:
"The page you requested may have moved or no longer exists...," etc. etc.
404 error pages sticking around in the engines' indexes is nothing new. Many clients wait 3-6 months or more for their 404 pages to drop out of Google and Yahoo. But this was different. I ran the client's URL through a header checker, and sure enough, the problem was not a stale index. The problem was a deceptive http header code.
Many webmasters believe that creating 404 error pages automatically ensures that the server automatically passes the 404 error code along with it. This isn't always the case. Here are a few examples.
Adidas error pages give the http 200 code. The 200 code means that the intended page was found, which is certainly not the case in a URL like www.adidas.com/zzzzzzzz. The error page does a meta refresh after 10 seconds back to the Adidas home page, so the user eventually arrives at navigable content. But it really should send the 404 code instead to ensure that the legacy URL gets flushed from the engine's index.
The British National Space Centre uses a 302 redirect to take the user to a page containing error copy, but that page gives the 200 error code. (Try this fictitious page to see what I mean.) So for any BNSC pages lingering in search engine indexes, the content of the first page is used to create the SERP description, and the second page is the user's ultimate destination. The problem is that engines frequently believe that both pages still belong in the index, when in reality neither does.
In the health checklist of SEO, this issue isn't equivalent to a broken leg or clogged arteries. It's more like an old football injury that flares up at inconvenient times. The most likely results of a deceptive header code on your error pages are search query results that return error copy in the SERP description, and possibly an artifically inflated index count. Neither one will kill you, but it's something you shouldn't ignore.
I've found this phenomenon to be most common on Microsoft IIS platforms, usually with .asp or .aspx pages. This doesn't mean that other platforms don't also cause the error - only that I haven't seen any non-MS instances.
Make sure that your error pages produce the proper 404 header code. To do this, create a nonsense url, such as www.mycompany.com/fake-page-name.zzz - in other words, a page that you know does not exist. Insert that URL into a header checker, such as the one at SEO Consultants or Rex Swain's HTTP Viewer. (Rex's site has been on my bookmarks list for several years now, and I use it about 10 times a day.)
If your fake URL gives any header code except a 404, it's time to roll up your sleeves and fix it. Here's a long thread about possible fixes on IIS, and WebmasterWorld has a little information about the same thing happening on an Apache server.
Custom 404 Error Pages and Deceptive Error Codes
Posted by erik at 04:30 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
September 16, 2005
Just Do It: Make Your Site Search Engine Friendly 
Every day, senior level staff perform highly relevant queries at Google to see if the company’s site appears in the top few results. Upon seeing that the site isn’t on page #1 at Google, a good share of these folks become irritated and fire off an email, demanding an answer from the subordinate responsible for search engine marketing.
It may go something like this:
Linda,
We still are not in the top three at Google for “x?. Please advise ASAP!
Steve
...or...
Linda,
We are getting killed at Google by our main competitor. Please advise ASAP!
Steve
I’m convinced this happens hundreds of times each week in businessland.
From an agency who specializes in SEO, what we’ve seen many times is a tendency for some companies to practice selective listening when it comes to recommendations that require them to make critical changes to their site to improve its search engine friendliness.
Are you one of those frustrated senior level people?
Have your people or your SEO Firm recommended changes to your site to make it more spiderable or search engine friendly and it just hasn’t been implemented?
Are you sticking with the same site architecture and continuing to add pages to your site that handicap its search engine performance?
Should your next email be?…..
Hi Linda,
Let’s meet today about making those changes to make our site friendlier to the search engines.
Steve
Just Do It: Make Your Site Search Engine Friendly
Posted by doug at 12:20 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
September 06, 2005
Hurricane Katrina and Fast-Acting Search Results 
When I added the Red Cross ad to the blog recently (please give to some type of rescue organization if you're able!), I was curious about the flurry of activity surrounding Katrina over the last week, and to what extent it had affected organic results pages. Unlike specific news engines (where I typically get my news), the organic algos are generally a little more sluggish (or stable, depending on your worldview). Still, when comparing results for [katrina] at large engines, I noticed several interesting results.
First, an analysis of the first page of Google results for the query [katrina]:
1. Red Cross root page (www.redcross.org). The #1 ranking for this site is due in large part their massive viral banner campaign (which we found out about thanks to Threadwatch). Also at play here, in my opinion, is the phrase "katrina" in close proximity to the link to redcross.org in blogs and news stories. For example, "To help victims of Hurricane Katrina, please give to the Red Cross." This "proximity credit" is an ages-old part of the algo that accounts for web coders who use "click here" as anchor text instead of the nearby money phrase.
2. National Weather Service's National Hurricane Center and Tropical Depression Predictor (nhc.noaa.gov). In case you're curious, the NOAA stands for National Oceanic and Atmospheric Administration. If you're familiar with American bureaucracy, it should be no surprise that the NOAA is part of the US Department of (wait for it...) Commerce. The NOAA site, an XML/RSS smorgasbord, is to serious weather chasers what Slashdot is to 30-year-old gadget freaks living in their parents' basements. Plenty of legacy link pop here, probably using link text like Hurricane Katrina, since weather enthusiasts are keen to note (and anchor) the difference between tropical storm, tropical depression, hurricane, and so on. Also note that this site ranks in the top spot for [hurricane katrina] and [tropical storm katrina], so it's been building momentum for a while.
3. Wikipedia entry for Hurricane Katrina (en.wikipedia.org/wiki/Hurricane_Katrina). Yahoo shows over 24,000 links to this page, and at least right now, Wiki can do very little wrong in Google's eyes.
4. Home page of an on-the-ball web developer whose name is actually Katrina (katrina.com). She's had the domain for the better part of a decade, and probably had the top rank until about a week or so ago. Kudos to her - she's taken advantage of her rank to show tons of missing persons and disaster relief information. (Question to self: What's the moral and semantic opposite of a scraper called?)
5. Weather.com homepage (weather.com). Certainly helped by current radar images, slide shows, and the accompanying incoming links to those features. And it doesn't hurt that a "synonym search" at Google for [~hurricane] shows "weather" as a bolded phrase - meaning that Google's algo considers them very similar terms. LSI, anyone?
6. A Red Cross credit card contribution page tied to a specific site. Hard to say why Google picked this specific iteration of the "Contribute" page without further investigation. The link redirects from www to give.redcross.org, so the second subdomain (or maybe the https protocol) is probably the reason it's not an indented entry under the first Red Cross result at #1.
7. Official site of Katrina and the Waves (katw.com), '80s pop group behind such hits as "Walking on Sunshine" and ... um ...
8. See #7, but this time (katrinasweb.com) it's the personal site of the band's lead singer. Entries 7 and 8 are like the Moe Green of [katrina] - making their bones when the Red Cross and National Weather Service were going out with cheerleaders.
9. FEMA (fema.gov), likely helped along by some linking discussing their handling of the hurricane aftermath.
10. Local New Orleans TV station (wwltv.com) offering a prolific blog about events in the city as they happen. Yahoo shows about 16,000 links directly to this page, and over 72,000 to the domain.
So what's the moral here? Google loves links - all shapes, sizes, and locations. Linking during national crises is a hyperbolized version of natural linking. It happens more quickly and in greater numbers. Still, there appears to be no temporary link devaluation (TLD) or backlink over-optimization devaluation (BLOOD) going on here. Is the onslaught of new links spread out over enough sites as to appear natural (which it is), even with the accelerated timeframe? Apparently so, assuming there's no manual intervention going on.
In my next post, I'll tear apart Yahoo's SERP for the same query.
Hurricane Katrina and Fast-Acting Search Results
Posted by erik at 10:32 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
August 17, 2005
Yahoo Gives Coke a Second Chance 
I recently discussed how Google gives Coca-Cola a pass by offering the top four SERP slots to the exact same content, for the simple query [coke].
To prove I'm not picking on Google, I wanted to show how the other big engines treat the same query, and it's Yahoo's turn. To be honest, I expected similar results from Yahoo, but they surprised me with an interesting twist.
Through the first three results of a search for [coke], things are going fine. coca-cola.com, then the Coke Music site, followed by Diet Coke. No real surprises. But then we hit the fourth result. Notice anything familiar about it?
You should. It's exactly the same as the first result. Exactly the same. Same title, same description (taken from the Yahoo Directory, which misspells the company name, of all things), same Yahoo Directory category, and the same destination site.
This is surely a glitch, one I haven't seen before (with little or big brands, for that matter). I won't speculate on the cause without further investigation, but I'm quite eager to try to reproduce this effect with other queries, and I'm very curious about how it affects clickthrough.
Coke, if you want to share some clickthrough data with me, I'm all ears. And I promise not to refer to you as Coco-Cola.
Yahoo Gives Coke a Second Chance
Posted by erik at 06:25 PM
| Comments (0)
| TrackBacks (0)
Printer-friendly version
August 16, 2005
Coca-Cola and Breaks for Big Brands 
While the relevancy of engines like Google and Yahoo goes more or less unnoticed by most people, gripes often surface among hardcore webmasters and SEO types.
Often, they say, big brands get some breaks that the little guys don't. Whether it's coincidence or an algorithm in action, only a few people know for certain. But here's an example.
A Google query for [coke] offers three different domains in the top four results: www.coca-cola.com, www.cocacola.com, and www.coke.com. Now it's not surprising that several domains from the same organization come up. The same happens with all types of big brands, such as [ford], [honda], [microsoft], ... you name it. That's the benefit of a large, global footprint.
What differentiates the [coke] result is that all three domains have mirrored content. Clicking any of these results takes you to the same Flash file. You won't find too many small-ish brands that can pull that off - not for long, anyway.
The first thing that most SEOs would tell a company like Coke is to consolidate their URLs into one, probably using a 301 to point the two lesser sites into the domain with the largest index and IBL (inbound link) count. But with results like this, why should Coke listen?
And don't blame Coke or label them "spammers." They're simply covering their online bases. Registered no later than 1997, each of these three domains has been around far longer than Google, so owning the above-the-fold area for a brand search isn't necessarily their motivation.
In my next post, I'll discuss how Yahoo treats the same query. Foreshadow alert: It surprises even me.
Coca-Cola and Breaks for Big Brands
Posted by erik at 04:08 PM
| Comments (2)
| TrackBacks (0)
Printer-friendly version

