Crawling and Indexing Articles by SEO Speedwagon

July 01, 2008

Google to Index Flash Content ... Again erik

In a post last night entitled "Improved Flash Indexing," the Google Webmaster Tools blog reports that

We've improved our ability to index textual content in SWF files of all kinds. This includes Flash "gadgets" such as buttons or menus, self-contained Flash websites, and everything in between. ... In addition to finding and indexing the textual content in Flash files, we're also discovering URLs that appear in Flash files, and feeding them into our crawling pipeline—just like we do with URLs that appear in non-Flash webpages. For example, if your Flash application contains links to pages inside your website, Google may now be better able to discover and crawl more of your website.

This brings up several satellite issues:

  • Since it's been so difficult to index Flash content, a virtual cottage industry sprang up with ways to circumvent that disability, including methods like SWFObject, sIFR, user-agent-based delivery of plain text vs. Flash content, and so on. With these techniques becoming more sophisticated and easy to implement, is it likely that sites will abandon them soon?
  • It appears that for now, Flash files spawned when users fail a JavaScript test will still be uncrawlable, since engines too typically fail a JS sniffer.
  • If you have a SWF file embedded as only a part of a larger HTML page, trust me that you do NOT want only that SWF file being returned in search results. It typically looks awful, lacking both the size requirements you implemented, as well as the critical navigation that resides in your HTML. The Webmaster Central post didn't say that SWF files would be returned in SERPs, so I'm not saying that's what will happen. But I've tested client sites by searching for strings of text that only appear in Flash files, and I've seen it happen. So test with your own site and cross your fingers.

I chose a somewhat sarcastic post title because ever since search engines and Flash have butted heads, the ability for engines to index text embedded in Flash files has been "just around the corner." In 2002, for example, hearts were briefly aflutter about the Macromedia Flash Search Engine SDK, which was going to be the end of engines' inability to index Flash content. Hear that? The end. 2002.

So I enter into this new era with guarded optimism. Optimistic because Google never releases anything "new" until it's been tested in the wild for months or years. Guarded because the "right" recommendation for clients is never quite as black and white as people think it will be.

Google to Index Flash Content ... Again
Posted by erik at 09:18 AM | Comments (2) | TrackBacks (0)
Printer-friendly version

June 04, 2008

A Guide to Robots Exclusion Protocol erik

Google's Prashanth Koppula wrote a ready-to-bookmark post over at the official Webmaster Tools blog, showing tons of different robots-exclusion protocol (REP) directives that can be implemented in various ways. Following is a listing of directives discussed and the methods of implementation:

Directives for the robots.txt file:

  • Disallow
  • Allow
  • $ Wildcard
  • * Wildcard
  • Sitemaps location

Meta tags for insertion into HTML:

  • NOINDEX
  • NOFOLLOW
  • NOSNIPPET
  • NOARCHIVE
  • NOODP

Of special note are the two different wildcard uses; the post links to usage models for each. One additional funny bit is in the explanation of NOARCHIVE, in which the post describes the tag's usage as "Do not make available to users a copy of the page from the Search Engine cache." Contrast this with "Do not cache the page," which I believe is most people's idea of the tag's effect. I love little semantic hooks like that.

The post notes that the directives above are observed by Google, Yahoo, and MSN/Live, which is a nice bonus. In addition, the post discusses some directives that only Google honors, such as UNAVAILABLE_AFTER (which I discussed about a year ago), NOIMAGEINDEX, and NOTRANSLATE.

I appreciate what engines are doing with the REP advancements. It's the equivalent of the basic Robotstxt.org protocol being the vehicle, but the engines have become after-market accessory specialists, showing you how to get additional mileage, power, and stunts out of your car.

A Guide to Robots Exclusion Protocol
Posted by erik at 08:07 AM | Comments (0) | TrackBacks (0)
Printer-friendly version

May 23, 2008

Client Hits Home Run With: URLs Matter! doug

sizemore-spring-training-2005-sm.jpg
On this Friday before the long Memorial Day weekend, I thought I would share something very smart that a client recently said to me and a group of his colleagues.

In a nutshell, the discussion was about potentially rewriting dynamic URLs.

He said:

"One thing we can't forget is that URLs are marketing assets....they matter....they need to be friendly to both users and search engines."

For a moment, I felt like I was at Progressive Field, about to rise out of my chair and high five complete strangers around me after a Grady Sizemore home run.

I couldn't have said that better myself. Bravo!!!

Client Hits Home Run With: URLs Matter!
Posted by doug at 02:01 PM | Comments (1) | TrackBacks (0)
Printer-friendly version

April 30, 2008

Real and Imagined Errors in Google Sitemap Feeds erik

When you upload your XML sitemap feed to your server -- especially if it's GZipped -- don't expect it to look pretty. I got a nervous call from a client because when he called the XML feed URL in his browser, he saw this:

g-sitemap-error.jpg

While it looks like an error, it's really not. Not in the traditional sense, at least. The error here is that your browser (in this case, Firefox) isn't able to view the file without a little help -- specifically, a stylesheet that tells it how it should look to human viewers.

The bottom line is that this message doesn't mean that engines can't read your XML feed -- only that you can't see it. To see whether Google can process it, for example, check the Sitemap Summary report. For some reason, this report isn't in the main GWT left nav. To find it, you need to click the "Details" link at the far right of the Sitemap Overview report. When you click that link, here's what you see:

g-sitemap-error02.jpg

Real sitemap errors do exist, even in the example I used above. In this case, I've inadvertently included in the sitemap a URL that I also excluded via robots.txt. So I'm sending Google a mixed message there. Fortunately, the robots.txt file overrides the URL's inclusion in the sitemap, so it ends up being more of a gentle nudge than a true, crippling error. If the error doesn't specifically say that the sitemap is invalid and unreadable, then it's probably not.

Real and Imagined Errors in Google Sitemap Feeds
Posted by erik at 08:52 AM | Comments (0) | TrackBacks (0)
Printer-friendly version

April 23, 2008

Update on Google Showing Excluded URLs as Sitelinks erik

A little over a month ago, I wrote about Google showing robots-excluded URLs as Sitelinks. Here's a shot of what Google showed for the query [seo speedwagon] in mid-March:

Google showing an excluded URL as a Sitelink

The ip login link was (and is) excluded via robots.txt. A month prior (in February), a link to one of our monthly archives -- a page with the robots "noindex" meta tag -- appeared as a Sitelink also.

Since then, the SERP has been cleaned up. I use the passive voice because I don't exactly know who to thank. Either the algo picked it up on its own, or someone hand-washed it. Either way, it looks better now:

Sitelinks are all 'allowed' URLs now

I'm not sure if we're an isolated case, so if you have any examples of excluded URLs still showing up in Sitelinks, please let us know in the comments.

Update on Google Showing Excluded URLs as Sitelinks
Posted by erik at 08:12 AM | Comments (4) | TrackBacks (0)
Printer-friendly version

March 04, 2008

NYT Traffic Doubles, Revenue Grows Since Killing Subscriptions erik

John wrote a few times last fall about the NY Times tearing down its paid subscription wall and allowing spiders in.

Now, in an interview at The Deal, Google's David Eun (on p. 5) confirms that it was a good idea:

We have some partners that have made very bold steps, such as The New York Times, which went from a pay model to a free model. After they went free, the traffic they got from us alone doubled. Their math says they make more money by offering content free to consumers, but stimulating demand and making it work with advertising. The Financial Times did the same thing, and at least early on in the process they experienced at least a 100% growth in traffic.

Don't hold your breath waiting for further breakdown of the math, especially for the NYT example. Note that while Eun says traffic doubled, he was less specific about the money, saying only that "they make more" under the current scenario.

It should be no surprise that it's Google -- not the Times -- telling us the good news about expanded indexation. After all, Google has more to gain from all of us knowing about it, because it now gets a slice of the pie:

NYT Adwords premium ad

Thanks to BeetTV via SearchCap.

NYT Traffic Doubles, Revenue Grows Since Killing Subscriptions
Posted by erik at 10:51 AM | Comments (0) | TrackBacks (0)
Printer-friendly version

January 23, 2008

Duplicate Content - Thinking Inside the 'Big Box' Stores erik

SEOs (myself included) love to preach the Organic Gospel as if knowledge is the barrier, not implementation. "If people only knew this or that," we say, "they'd be saved."

But a lot of times, knowing the right stuff only leads to the next barrier, which is, "How the heck to we DO it?" Various facets of site maintenance -- user experience and tracking, to name only two -- frequently compete with SEO techniques for front-burner attention.

As an example, I want to look at Circuit City's site and show a couple examples of things they're doing that aren't optimal from an organic SEO perspective, but that are probably necessary for other reasons. (I should note that I LOVE the CC site from a user's perspective. They -- and a lot of the Big Box stores, to be fair -- have a great way to narrow and expand the choices to help you find exactly the product(s) you're looking for.)

The first example, however, is one I'm fairly critical of, because I feel like whatever benefit they're gaining probably isn't worth it. The CC home page has two links to the main "TV & Home Entertainment" page -- one in the top nav, and one in the lower set of links. You can see the links in the following screen shot, and I've listed them afterward:

Two links to the same content, but two different URLs

Here are the respective links. I've bolded the points at which the dynamic strings diverge:

http://www.circuitcity.com/ccd/categorySpecial.do?catOid=-12866&N=20012866&c=1

http://www.circuitcity.com/ccd/categorySpecial.do?catOid=-12866&N=20012866
&SESSIONLINK&cm_re=011308%20HOME%20PAGE%20A-_-navboxes%20
TV-_-TV%20and%20Home%20Entertainment

The content of the pages is identical, so you know where this is going. They're diluting the potential power of each by having a similarly named mirror.

Let's look at a more complicated example. Consider two more URLs on the Circuit City site, and I'll contrast the query strings and the page contents. I've stripped out the "http://www.circuitcity.com" portion of the URL for brevity, but I've linked to them so you can see for yourself if you want. In addition, after each URL, I've shown the breadcrumb navigation from each page so you can see the subtle difference.

Page 1: TVs that cost $500-$999 that are Sony:
URL: /ssm/Televisions/sem/rpsm/c/1/catOid/-12867/Ns/net_price
|0||accm_grs_mgn_dllr|1/link/ref/N/20012866+20012867+312867003+40000229
/link/ref/rpem/ccd/categorylist.do

Breadcrumb trail:
circuit-city-breadcrumb-1.jpg

Page 2: TVs that are Sony that cost $500-$999:
URL: /ssm/Televisions/sem/rpsm/c/1/catOid/-12867/Ns/net_price
|0||accm_grs_mgn_dllr|1/link/ref/N/20012866+20012867+40000229+312867003
/link/ref/rpem/ccd/categorylist.do

Breadcrumb trail:
circuit-city-breadcrumb-2.jpg

The on-page content from these two URLs is identical. After all, no matter in what order you query the database, it should theoretically produce the same products (in this case, three specific TVs). But notice (in bold) how the order of two parameters is swapped in the two URLs, in effect causing a duplication. The content (and breadcrumb navigation) is generated based on the order in which the user selects search criteria. This makes a fantastic user experience -- no doubt about it. But it's hurting them subtly because engines either crawl too many pages and dilute each one's unique potential for ranking well, or, more likely, the bots hit a nav scheme like this, turn a few corners and crawl a handful of pages, then bail because they can recognize what a sinkhole it is.

About a month ago, a WebmasterWorld thread ($upport required) discussed a topic similar to this. Member PageOneResults discussed a client's site, which offers multiple paths and entry points to specific product URLs, with the final product URL varying based on the entry point used and the path taken to that product. Following is a response to his original post, followed by his reaction:

>>If the URL depends on the route taken through the site, then you have a major problem to figure out and fix.

Yes, we have a major challenge ahead of us in regards to the one example provided where there were 10 access points for one product. That takes into consideration 5 under www and 5 under non www which is what is happening.

I'm all for as many access points as can be possibly provided as to not hamper the visitor experience. And, as pointed out, as long as that product leads to the same URI from all access points, life is sweet. But, that is not the case...

One important note is that PageOneResults is aka Edward Lewis, who runs SEO Consultants (of which Intrapromote is a proud member). Edward has probably forgotten more about SEO in the last 12 hours than I have ever known, so when he asks for input, it's not due to lack of knowledge. The bottom line is, this stuff can get extremely complicated regardless of your SEO knowledge level.

Duplicate Content - Thinking Inside the 'Big Box' Stores
Posted by erik at 11:35 AM | Comments (1) | TrackBacks (0)
Printer-friendly version

December 19, 2007

Supplemental Index, We Hardly Knew Ye erik

The Google Webmaster Central Blog put another nail (the final one?) in the coffin of Google's infamous Supplemental Index just now by declaring it fully immersed into the main index:

We improved the crawl frequency and decoupled it from which index a document was stored in, and once these "supplementalization effects" were gone, the "supplemental result" tag itself -- which only served to suggest that otherwise good documents were somehow suspect -- was eliminated a few months ago. Now we're coming to the next major milestone in the elimination of the artificial difference between indices: rather than searching some part of our index in more depth for obscure queries, we're now searching the whole index for every query.

This is, in my opinion, much more significant than the prior act of simply removing the "Supplemental Index" label. The main problem has never been the label applied (or not applied) to URLs, but the fact (or at least the fear) that SI pages were being given short shrift in their efforts to contend for queries. So what's the intended result?

From a user perspective, this means that you'll be seeing more relevant documents and a much deeper slice of the web, especially for non-English queries. For webmasters, this means that good-quality pages that were less visible in our index are more likely to come up for queries.

Of course the onus doesn't fall entirely on Google here. SI pages were SI for a reason. If you think they're worth ranking for, the old rules still apply. Make sure you remove any obstacles to crawling and indexing that may remain, and try to get some additional links -- internal and external -- pointing to them.

Supplemental Index, We Hardly Knew Ye
Posted by erik at 12:29 PM | Comments (2) | TrackBacks (0)
Printer-friendly version

November 28, 2007

Are You A Canonical Fascist? Stand Tall! john

We are sticklers with our clients when it comes to issues of content duplication, sometimes to the point, I think, of being viewed as Canonical Fascists. This can be annoying, much like fascism mostly can be annoying, so it is gratifying to see Mr. Google himself lay out just why such annoyance is worthwhile advocacy, even approaching the subject of PageRank Splitting in the process:

When I did a wget from the Googleplex, I eventually got a 301 from the seomoz.com url to the seomoz.org url. But look at the timestamps: " --09:28:33-- " was the initial fetch and "--09:32:41--" was when the 301 came over the wire. Assuming that I'm reading right, that means almost a four minute delay on getting the 301 from seomoz.com to seomoz.org. Googlebot will wait around for several seconds for a page, but it won't wait four minutes. Instead, the connection will time out and we'll treat those urls as separate (and think that we couldn't fetch the seomoz.com url). So if a bunch of people are linking to your article, and some link to seomoz.org and some link to seomoz.com, that PageRank is getting split between two urls, and the long delay on the 301 response can cause Google to believe that the urls are separate and therefore cause dupe issues.

Hat tip to Randfish for calling forth such manna in his heavily commented comments area.

Are You A Canonical Fascist? Stand Tall!
Posted by john at 03:45 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

November 15, 2007

Tagging The Site Organic john

We spend a great deal of time on site structural issues with our clients, and one of the first things we usually do with a new client is try and transition them from thinking of SEO as a page-level concern to more of a holistic, organic discipline, one where we must try and understand the site architecture in its interdependent relationship between the whole and its parts. After all, organic is ultimately the moniker that won the day.

Almost invariably the large sites that we recognize are not living up to their potential are what we call top-heavy architecturally, in that the TLD so dominates all things search that even the main folder levels are all but invisible, let alone deeper, longer-tail-rich pages. As we explain the phenomenon we often find ourselves referring to blog structure, and how we might borrow some of the structural characteristics of a blog in discovering how to flatten out the top-heavy site. There are reasons blogs are so eminently crawlable.

One of those reasons is tagging, and I was pleased this morning to find a fellow tag-appreciator in Stephan Spencer, explaining his tag appreciation more eloquently than I have yet seen done to date:

Tagging isn't just a tool for usability (even though it's typically mostly thought of in those terms), it's also a powerful weapon for search engine optimization. That's because tagging allows you to rejig your internal hierarchical linking structure, flowing the link juice more strategically throughout your site. And because those links are textual and keyword-rich, a tag cloud is far superior in terms of SEO to the traditional graphical navigation bar.

Bravo, Stephan. Long live tag conjunction!

Tagging The Site Organic
Posted by john at 07:41 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

October 22, 2007

HTML Form Selections Being Indexed james

I had a client write in and ask a question:

"How do you prevent an engine from indexing a huge list of form selections?"

Initially I started to put together my own list of different ways to block, move or get rid of huge amounts of worthless text but then I dug a little deeper and realized this particular client's situation wasn't a problem at all.

For one thing the pages in question are buried deep in the SERP’s and only show in the index for this query because they support one instance of the “keyphrase”.

My reply…

The reason the form selection(s) are being indexed in these SERP's is because the form HTML is the only place the "keyphrase" is mentioned on those pages. Google is taking the only text around that “keyphrase” and using it in the SERP. If the "keyphrase" wasn't in that form selection (which is read by Google as text), that page would never even come up for that query.

1. Create content on the page that supports the “keyphrase” and Google will use it instead.

2. Change form application to a non-crawled script such as java. Then you can hold the text in a separate file. This option will decrease your indexed page count for that query because the one instance of the word that Google was using for the SERP will now be gone. Unless by chance there is some obscure link out there pointing to the page with “keyphrase” inserted.

3. Not a huge concern. We are more worried about what comes up in the first few pages of the SERP’s… not the last few.

Yahoo has code to stop the indexing of text snippets but Google still does not.


As complicated as stuff can get sometimes, it’s nice to see the simple things come across my desk every once in a while.

HTML Form Selections Being Indexed
Posted by james at 07:32 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

October 12, 2007

Site Down Time Causing Drops in the Google Index james

It's been a very busy week in Campaign Director land and now I'm ready to enjoy the weekend with my family. I've had to deal with many issues over the past few days but one particular problem I dealt with was a client's site down time and the affect this can have in the Google index.

Recently this client had been going through server changes and routine maintenance. Occasionally the site would be down from between a few minutes to a few hours at a time. Over the course of a few weeks they would lose and then regain positions (and indexed pages) in Google. The process of crawl, server down error, starting to be removed from the Google index, reappearance in the Google index, was only a few days.

This topic was discussed earlier this year (Site Down Time Can Remove You From Google Index) - (Site Downtime Can Delist You From Google) and it was understood then that:

1. Google can remove your pages from the index if they are down even temporarily.

2. "Googlebot will try a few times before the pages drop from the index." ~ Vanessa Fox

3. "As for how long it takes for a page to get back in once the site is back up, that really depends on a number of factors, such as how often the site is crawled in general." ~ Vanessa Fox

4. "Sometimes temporarily down pages turn into truly-gone-forever pages, so we have to drop those pages at some point. But it’s also true that we go back and revisit those pages pretty often and try to recrawl them in case the site comes back up." ~ Matt Cutts

I know this isn't anything new but it's an issue that I currently deal with and will probably continue dealing with in the future. Site down time has been a factor when trying to diagnose index problems with client sites and will stay in my checklist. It took my client a few days to reappear in the Google index and in those few days they lost major positions and traffic. For some Ecommerce sites, this "little" problem can turn into a major situation in a hurry.

Monitor your uptime, try to keep the downtime to a minimum, make sure your server returns the proper "server status" error.

See ya Monday, enjoy!

Site Down Time Causing Drops in the Google Index
Posted by james at 03:24 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

October 01, 2007

Google Search Results Already Finding Columnist Articles john

Frank and Maureen and Thomas, oh my!

The chipped cement still has yet to be cleaned up fully from the wall being torn down at that historical error known as TimesSelect, and already we are seeing NY Times columnists able to commune with readers freely at point of search, at least at the Frank and Maureen level:
frank.jpg
maureen.jpg
As internet titan Alan Meckler noted in his posting of the Times e-mail to subscribers, search results like these were the driving force:

Since we launched TimesSelect, the Web has evolved into an increasingly open environment. Readers find more news in a greater number of places and interact with it in more meaningful ways. This decision enhances the free flow of New York Times reporting and analysis around the world. It will enable everyone, everywhere to read our news and opinion - as well as to share it, link to it and comment on it.

Sharing it, linking to it, and commenting on it are the currency of being able to find it in search, and that might be important to a newspaper if, as the latest surveys indicate, 91% of adults use a search engine to find information and 72% get news therefrom.

Ya think?

LATE UPDATE: We just noticed that similar to 1989, another Eastern Block Web Site is about to topple...

Google Search Results Already Finding Columnist Articles
Posted by john at 04:15 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

September 24, 2007

Topix TLD Migration -- Six Months Later erik

I'm a big fan of Rich Skrenta, co-founder of NewHoo (née Gnuhoo, which eventually took the more recognizable name DMOZ), co-founder of Topix.net, and -- what may be the coolest of all -- author of one of the first known computer viruses, one of the few to be written before the actual term "computer virus" was even coined.

So that's all very cool, but the search-related part of all this was how, six months ago, Topix finally purchased the .com version of its domain and decided to make the move away from .net. The Wall Street Journal, in a mainstream SEO article that actually managed to hit most of the salient points pretty accurately, highlighted Skrenta's anxiety at the global domain change:

Such a simple change, Mr. Skrenta has discovered, could have disastrous short-term results. About 50% of visits to his news site come through a search engine -- and about 90% of the time, that is Google. Some companies say their sites have disappeared from top search results for weeks or months after making address switches, due to quirky rules Google and other search engines have adopted. So the same user who typed "Anna Nicole Smith news" into Google last week and saw Topix.net as a top result might not see it at all after the change to Topix.com.

Like a lot of SEOs, at the time I wondered what was so "wrong" with the time-tested (at least in my experience) method of full-on 301 redirects from the old site to new -- especially since the code would be short and sweet, with each old .net URL going directly to its .com counterpart.

The Topix crew had apparently heard too many domain migration horror stories. On his own blog, Skrenta noted,

...there've been a whole bunch of the seo posts saying essentially "hey, it's easy to move a domain, you just 301 it." Of course I know about 301 and 302 redirects. The problem is that half of these people follow up and say "you'll only be out of the index for a few months". They also ignore the problems that big sites have. A redirect for a small site may work great, but if you have hundreds of thousands of pages or more, there are lots of cases where this caused some form of not-in-the-index-anymore doom.

The number of seo consultants who claim to know how to move a 100k+ page site is much smaller than the number who have actually done it.

That last point is a good one. Conventional wisdom in SEO is frequently spawned by 5% research and 95% extrapolation, which is often the best you can do. The other dirty little secret of SEO is that when you start seeing the same sort of anecdote often enough, it's tempting to put it in the "research" column.

Still, I'd not seen or heard of the type of monumental tragedies that Skrenta was talking about (at least within the last few years), and neither had Danny Sullivan:

I still remain surprised that the 301 is that much of a problem for even a big site. I just haven't heard of that trouble, of half the people saying you'll be out or whatever. If that's what you had been hearing, I can understand your concern. But it seems a pretty straight-forward change, and it shouldn't even be a burden on the server in that you're not actually talking about 100,000 of physical redirects that have to be created and check but a change of one domain to the other.

Ultimately, Topix listened to its heart (or maybe its board) and surprised me by opting for all-out duplication, running identical content on the .net and .com sites, avoiding any sort of redirection plan. So to call it a TLD migration isn't quite accurate. it's more like Topix.net bought a summer house and called it Topix.com. Again, from the WSJ article:

Concerned about that [redirection] strategy, Topix has run its site at both Topix.net and Topix.com for awhile. One danger with that approach is that it is unpredictable; Google will see two versions of the same page and could choose to show the Topix.net page most prominently.

This course would appear to run contrary to the advice even Matt Cutts gave in the article:

Google's Mr. Cutts says the search engine should ultimately understand what is going on when a site changes its Web address. He says the best strategy is to move one section of the site to the new address and see what happens before switching the whole thing.

Skrenta has since left Topix, but the duplicated domain strategy hasn't. Go to any page on the Topix.net site, and you'll see that the exact same page exists on the .com version of the site. Currently, Google shows about 2 million Topix pages indexed on the .net TLD and about 1.2 million on the .com TLD.

So six months ago, had you asked any SEO "expert" about what to do, almost no one would have suggested the present course. But has it hurt anything? Maybe, maybe not. Topix.com has a PR6 home page with about 1.2 million inbound links, while Topix.net has a PR8 home page with almost 7.4 million inbound links. So at this point, in terms of raw accumulated power, consolidating those domains would create one very powerful site.

But would that be better than the current situation? Perhaps not. What about that "unpredictability" the WSJ (and zillions of SEOs) talk about with dupe domains? That even in the best case, Google will pick one page or the other at its own discretion -- and it might not be the one you want? Consider the SERP for [detroit local news]:

In Motown, Topix has its cake and eats it too.

A page from Topix.net shows up in spot 8, and a page from Topix.com -- an exact mirror of the .net page -- comes in at spot 9. Not exactly Google choosing one over the other.

But if the pages are identical, why do they have different titles on the SERP? Because Topix.com is heavily linked to from DMOZ, and the .com page shown in this screen shot shows the title used on the Detroit News and Media category of DMOZ. This leads to a question: When Google pulls one page's title and/or meta description from DMOZ, does that override the duplication filter?

Detroit's hardly alone here. Do some searches yourself with a city name and "news" or "local news" and see for yourself.

There's really no moral to this story, other than every time something like this happens, the "guidelines" from engines lose more and more bite. I'm happy that (at least according to my superficial research) Topix is doing well; it's a great idea, smartly executed. But these SERPs are yet another frustrating case of mixed signals from engines.

In my opinion, in a case like this, once Topix owned the .com version, that was all that needed to happen. I don't think the .com even needed to have any content for the problem to be solved. Looking back, my advice would have been to keep all content on Topix.net and immediately set up a 301 from Topix.com to Topix.net -- the exact opposite direction that most people recommended. That way, the authority of Topix.net content would never have been in jeopardy, and any links that mistakenly found their way to Topix.com would immediately transfer link popularity (as well as the user) back to the main (.net) site. They could have even promoted the site as "Topix.com" with no major headaches. Almost no one would care (or even notice) if they were redirected, whether it was type-in traffic or someone that clicked over from a news story.

Topix TLD Migration -- Six Months Later
Posted by erik at 03:14 AM | Comments (0) | TrackBacks (0)
Printer-friendly version

September 18, 2007

Search Tearing Down Walls Like It's 1989 john

We knew it was coming and we tried to bake a cake for Maureen Dowd more than a Month ago, yet we are still surprised at how search-friendly they are being in their explanation today:

What changed, The Times said, was that many more readers started coming to the site from search engines and links on other sites instead of coming directly to NYTimes.com. These indirect readers, unable to get access to articles behind the pay wall and less likely to pay subscription fees than the more loyal direct users, were seen as opportunities for more page views and increased advertising revenue.

If you have any doubt that this is the SEO equivalent of 1989 scroll a bit further down the page for this money quote:

The Wall Street Journal, published by Dow Jones & Company, is the only major newspaper in the country to charge for access to most of its Web site, which it began doing in 1996. The Journal has nearly one million paying online readers, generating about $65 million in revenue.

Dow Jones and the company that is about to take it over, the News Corporation, are discussing whether to continue that practice, according to people briefed on those talks. Rupert Murdoch, the News Corporation chairman, has talked of the possibility of making access to The Journal free online.

Mr. Murdoch, tear down that wall!

Search Tearing Down Walls Like It's 1989
Posted by john at 03:53 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

September 12, 2007

302 Redirects: A Guide to Search-Friendly Usage erik

Here are a few of the "absolutes" in the SEO field:

  • Never use Flash
  • Never use cloaking
  • Never use 302 redirects

These are, of course, wild oversimplifications. Equating Flash or cloaking with bad SEO is like classifying Hank Aaron as a poor hitter because he batted cross-handed.

So when is a 302 redirect better than a 301 redirect? Here are a few examples. Keep in mind that in these scenarios, the 302 is not the only possibility to achieve the desired results. Plenty of CMS options likely exist that will do the same thing. It all depends on your setup. But in each of these cases, the 302 is certainly better than a 301. Let me say that one more time. Each of these scenarios likely has several non-redirect options that perform the same task. But the point of this post is that the 301 is not always the redirect you want, and sometimes it can hurt you if you buy into the "301 is the only good redirect" argument.

Example 1: New Products, Fresh Content. You run a cell phone information site, and one of your targeted phrases is [newest cell phones]. You have a URL called /newest-cell-phones.php, which is your users' go-to page for the latest in cellular technology.

For the last few days, the /newest-cell-phones.php page has redirected (via 302) to /lg-vx8350.php, which is the latest phone you've torn apart and reviewed. At the same time, you also have a static link in the LG portion of your site directly to /lg-vx8350.php, because you want to get that content crawled on its own as well. You're not particularly worried about dupe content issues, because tomorrow, you'll be done with your /nokia-2610.php page, and it will then become the target URL for the /newest-cell-phones.php redirect.

Example 2: Restaurant Content, Lazy-Susan Style. You run a popular restaurant with a large user base who check in each morning to see the day's menu. Because you buy local and fresh, you often have only a few days' notice about what you'll serve on any given day. You've done some user testing and found that your customers HATE the dreaded "giant PDF menu download" they find at most restaurant sites. Conseqently, you have a link on your home page directly to a URL called /todays-menu.htm. In addition, you have seven other pages:

  • /sunday-menu.htm
  • /monday-menu.htm
  • /tuesday-menu.htm
  • /wednesday-menu.htm
  • /thursday-menu.htm
  • /friday-menu.htm
  • /saturday-menu.htm

On Monday, for example, /todays-menu.htm uses a 302 redirect to /monday-menu.htm. The next day, it redirects to /tuesday-menu.htm, and so on.

Would a 301 work here too? Absolutely not. You want /todays-menu.html to remain in the indexes and rank for terms like [restaurant name menu]. You do NOT want a URL like /wednesday-menu.htm to become associated with [restaurant name menu], because it's counterintuitive for such a URL to appear in SERPs (at least six-sevenths of the time!), and you have no control over what URL the engines will choose to replace /todays-menu.htm in the index, since you don't control what day they crawl it.

So what do these examples have in common? Here are some guidelines when deciding whether 302 is the way you want to go:

You could consider using a 302 when, for the following redirect:

URL A ---> URL B

  • ...it is important that URL A be indexed and remain indexed.
  • ...it's not critical that URL B be indexed, but its content is very helpful for users.
  • ...you have several different URLs that might fit logically in the URL B spot above.
  • ...you have put some resources into strong internal and directory linking for URL A.

As I said before, a 302 is not the only way to achieve this. Plenty of dev environments and independent scripts will enable you to achieve this also, but depending on your setup, a 302 might be the easiest way.

A little background on the often-misunderstood 302: If you spend any time slogging through HTTP header documentation (and who doesn't?), you've probably noticed that the 307 -- not the 302 -- is the true "temporary redirect." But older clients apparently don't always know how to handle a 307 properly, and the 302 does more or less the same thing.

Plenty of development environments use 302 redirects to direct users to the "appropriate" version of the home page when the root is called. For example, if you go to www.adidas.com, you'll be redirected (twice, actually) to the appropriate language and country version of the home page -- in my case, /us/shared/home.asp. This, in my opinion, is not the best environment for a 302 redirect. (See Bill Slawski's excellent analysis of 302s at domain roots from earlier this year.)

The downside to the whole approach I've laid out here is that it's sometimes hard to get good external links to "URL A," because by the time users click it, they've been redirected, and if they copy the URL from their browser, it's "URL B." Some good directories will allow you to submit a URL that redirects, as long as it redirects to a page on the same domain. But to get people to link to URL A, you'll need to make a conscious effort to give them the correct URL.

302 Redirects: A Guide to Search-Friendly Usage
Posted by erik at 04:55 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

September 05, 2007

Boost Visibility With XML Sitemap Submission to Ask.com doug

We have been keeping a close eye on Ask.com and are seeing traffic increases from Ask.com for some of our clients. In fact, one e-commerce client over this summer has seen transactions more than double from Ask.com referrers. A 100% increase in transactions - now that really gets our attention!

We've shared with Wagon readers about submitting an XML sitemap to Google, to Yahoo and updated readers about the potential of submitting to MSN this Fall. I thought I would remind readers that you can also submit your XML sitemap to Ask.com and provide some simple how to's.

There are two ways to submit your XML sitemap to Ask.com:

1. Use the auto-discovery directive in your robots.txt file:

SITEMAP: http://www.yoursitemapurl.xml

2. Submit your sitemap via Ask.com's ping URL:

http://submissions.ask.com/ping?sitemap=http://www.yoursitemapurl.xml

Lastly, make sure you're using the accepted sitemaps protocol.

Boost Visibility With XML Sitemap Submission to Ask.com
Posted by doug at 03:44 PM | Comments (4) | TrackBacks (0)
Printer-friendly version

August 30, 2007

New MSN Tools Coming Includes Sitemaps doug

Back in April, I updated Wagon readers on the latest in Sitemap and Sitemap Protocol news and now it's time for a quick update.

Last week on MSN Live Search's blog, MSN trumpeted their new Webmaster Portal that will allow sitemap creation and submission. A beta program has been started and the Wagon has applied for a test drive.

Along with these sitemap features, MSN also announced that the Webmaster Portal will also include crawling and indexing tools as well as statistics about web sites. As we've said in the past, these statistics can be very helpful.

Stay tuned for news on the beta program. Official launch of the Webmaster Portal is expected in early Q4.

New MSN Tools Coming Includes Sitemaps
Posted by doug at 09:15 AM | Comments (0) | TrackBacks (0)
Printer-friendly version

August 15, 2007

NY Times Select(s) Death over Charade john

As you probably know, the NY Times has been the most prominent experiment in the paid content-behind-a-firewall-yet-at-least-partially-indexable model, and they are indeed now, finally, announcing via trial ballooning they are no longer going to put their most popular columnists behind that magic curtain one has to pay to sweep aside. After the magic show ends and the same fingers which initially drew the curtain are finished being pointed this way and that, this failed experiment will have had much to do with the principles of Link Building.


A party-goer cloaks her content as Maureen Dowd. Found on Flickr. Copyright 485i

First a great quote that helps explain the decision's relevance to our industry:

But the truth of the matter is that you get far more eyeballs when you're not locking away your content from the general public. The reality of Web 2.0 news is that people a rising tide raises all the ships. If you've got good content, and the Times does, people will link to it. When people read a technology blog like Engadget or a political blog like Daily Kos and find links to articles at the New York Times, everybody wins. Keeping your archives, op-eds, and other content locked up means that blogs and news sites won't link to you, won't give you credit for finding a story first, and won't drive up your traffic.

This lack of inbound links to the content-behind-the-firewall damaged traffic to the site not only through a paucity of visitors being able to click on these links to the columns themselves...:

...the share of traffic that the NY Times sends to NY Times Select has been decreasing over the past year – down by 16% year-on-year in July. With NY Times Select receiving more than two thirds (67%) of its US traffic from NYTimes.com, the decline had an impact with US visits to NY Select down 22% in the past year.

...in having to rely far too heavily on the parent site rather than third party links for traffic, but also in the residual effect such had in these columns' search engine visibility. With few third party inbound links accumulating with each new column, in fact from a deliberate online community decision not to link to content-behind-a-firewall, it is also very difficult for each new column to be judged more relevant than similarly themed columns emerging on the same topic that immediately acquire inbound links in the form of the same online community recommending them. It's no wonder the Times Select had to rely so heavily on clicks from the parent site for visits, as a great many of those visits were likely already subscribers. In that situation it is difficult to grow at the rate of the internet. Try these two simple searches for Frank and Maureen alone: nary a column to be found. Haven't they written quite a few?

I think everyone likely to read this blog knew this would happen. But to say we knew it would happen ultimately is not to say we are not happy to see even giants felled by an algorthm rejected, not select(ed).

NY Times Select(s) Death over Charade
Posted by john at 03:35 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

August 08, 2007

Download all query stats for this site (including subfolders) john

I get the feeling that most people, even in our industry, using Google Webmaster Tools for themselves or a client aren't scrolling far enough on the Query Stats page to reach this link:

allstatsincludingsubfolders.jpg

What you get if you click is rather unwieldy, sure, especially if you are dealing with a very large site, but the payoff is simply as large by the same degree. We are beginning to view it more and more here as a kind of matrix for how Google views your site architecturally, especially in light of GSI now having been moved to an undisclosed location. Actually, now that I've said it I'm a bit afraid it, too, will be taken away...

Download all query stats for this site (including subfolders)
Posted by john at 02:59 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

August 01, 2007

Google Supplemental Index Label Formally Dropped erik

Yesterday's announcement from Google's Official Webmaster Central Blog represents a formal declaration of what Matt Cutts hinted at in a recent SEOMoz post: The "Supplemental Result" label that used to appear for pages in Google's "junior varsity" index will no longer appear.

Do not mininterpret this as "Supplemental Result pages no longer exist." They still do, if you read the Google post with subtlety. The gist of it is that Google crawlers are now able to cruise through these pages with more frequency and reliability. This apparently negates the need for a special label, as while these two indexes are certainly not treated the same, the differences appear to be waning.

Personally, I don't really mind that this is disappearing, because even in the best cases, it wasn't always easy to determine if pages were "really" in it, and people seldom agreed on what caused pages to be there (despite several Googlers saying exactly what got you there). Still, the presence of Supplemental pages forced webmasters to figure out some better ways to organize their sites, which is the silver lining.

With a good analytics program, you have the capability of seeing how many pages on your site are performing well or not performing at all. Seeing them labeled as "Supplemental" was a shortcut to diagnosis, but its absence is hardly cause for panic.

Google Supplemental Index Label Formally Dropped
Posted by erik at 07:06 AM | Comments (0) | TrackBacks (0)
Printer-friendly version

July 06, 2007

Google Weighs in on Image Replacement (sIFR) erik

On the rare occasion when an engine expresses an actual opinion on a real technique, It's a welcome, welcome sight. So imagine my glee when I read Google Webmaster Central Blog's take on dealing with Flash.

While John lamented the mixed signals just last month, we've been asking the question internally for years: From a best practices standpoint, does image replacement stand safely in the DMZ of glorified CSS, or does it boldly encroach the characteristic of "showing engines one thing and users another"?

And that's just the beginning. The real problem, when you're using image replacement, is not the insertion of stylized copy, but instead, what you do with the HTML text you're replacing. Some systems simply let it lie underneath the script-spawned Flash layer, while some use "hidden" status in CSS, while still others pull it off the visible screen and hard-code it somewhere in the -5000px range -- each of which is detectable and grounds for a good spanking if your motives are anything but pure. Traditionally, that left us with the worries of trusting the algorithm to detect our motives.

But forget all that, because today we know, and we know it based on the way all such things are Known -- because it's mentioned in an official Google blog, midstream in a list of "practical suggestions" about how to deal with Flash:

sIFR: Some websites use Flash to force the browser to display headers, pull quotes, or other textual elements in a font that the user may not have installed on their computer. A technique like sIFR still lets non-Flash readers read a page, since the content/navigation is actually in the HTML -- it's just displayed by an embedded Flash object.

This proclamation, coming on the heels of Independence Day, is fitting, because no longer are we bound by the tyranny of not knowing on whose side of the fight sIFR truly sits.

Google Weighs in on Image Replacement (sIFR)
Posted by erik at 07:20 AM | Comments (0) | TrackBacks (0)
Printer-friendly version

June 26, 2007

Checking Supplemental Index Status for URLs in Large Sites erik

For sites with fewer than 1000 pages, it's possible (if not monotonous) to see which URLs are in Google's Supplemental Index. Simply run a site: command for your domain (example) and scroll through the results pages until you start to see "Supplemental Result" next to some of the URLs.

But what if your site has 50,000 pages and the supplemental results don't start until the final 10,000? Even the fairly common site:domain.com *** -view query isn't totally accurate, and it's still subject to the 1000 URL display limit.

Depending on which case you find yourself, it can be either tedious or impossible to detect whether a specific URL is Supplemental.

Using our blog site as an example, suppose I suspect -- but can't confirm -- that an old post about Yahoo Sitemaps is in the SI. A simple info: query doesn't tell you whether the URL is supplemental or not. For example, the following shot came from the query:

[info:http://seoblog.intrapromote.com/2006/11/an_update_on_ya.html]

An info: query does not display Supplemental status

Instead, a quick way to check Supplemental status is to pull a unique string from the URL in question (such as a folder or filename) and tack it into an inurl:-filtered site: query. In other words, the following shot came from this query, in which I added the filename (minus extension) into the inurl: command:

[inurl:an_update_on_ya site:seoblog.intrapromote.com]

Using inurl: in a site: command will show Supplemental status

In this result, note the Supplemental Index status.

The bottom line is to find an inurl: string that will quickly filter down the site: query results so that your specific URL shows up quickly.

Checking Supplemental Index Status for URLs in Large Sites
Posted by erik at 08:50 AM | Comments (5) | TrackBacks (0)
Printer-friendly version

June 14, 2007

The Google Supplemental Index Inbound Link(s) Threshold john

Wagon Rider Pat Fusco penned a great primer this Month on the issue of Google’s Supplemental Results as they are tied to duplicate content, if you’d like to orient yourself first. Our own Wagoneer Doug offers some points on Getting Out of Hell Free (that is, at least, without requiring direct payments to Google), the last method of which involved examining backlinks, and the fact that a great number of pages in the GSI we have examined across the massive sites we spend time with each day seem to share the commonality of zero inbound links.

We are finding, increasingly, that the distance from zero to one in terms of inbound links to a page seems to be much more of a threshold for exiting the Google Supplemental index than, say, 2 to 100. This is not to say that 1 gets you out, bada bing, but that there is a great more deal of love granted from Google on that single giant step from nil to 1 than there seems to be on the next link steps a page takes out of infancy.

A baby’s first steps are much more exciting and remarkable than the subsequent toddling around the room that follows, and it may be helpful to think of Google watching a page with no links in the same manner.

The Google Supplemental Index Inbound Link(s) Threshold
Posted by john at 08:49 PM | Comments (3) | TrackBacks (0)
Printer-friendly version

June 12, 2007

A Quick Route to the Google Text Cache erik

I am constantly looking at cached copies of web pages through Google's cache. It's a wonderful way to double-check that your content is showing up the way you want it to and quickly tell if Googlebot has noticed any recent changes you've made. It's also a great way to sniff out some rather "brave" techniques on behalf of your competition.

As a primer, you can view the cached version of any web page by typing cache:URL in a Google search box, where URL is any full URL string. For example,

cache:www.united.com

will show a cached copy of the home page of United.com:

United's Google cache

But beyond the normal cached version, I prefer to look at the text-only cache. The pink rectangle above shows the link to the text-only cached version. Clicking this shows you a version of the page much more like what a robot really sees.

But sometimes because of the site layout, the box at the top of the cached page doesn't appear. Consequently, the link to the "text only" version of the cached copy is hidden -- as in this sample shot from the BMW USA home page:

Because of the way BMW is laid out, the link to the text-only cache is hidden

When this happens, the text-only cached version is still not hard to find. Simply append

&strip=1

to the URL of the regular cached version, and you'll see the text-only cache.

Now to take this to the next level (if you're a Firefox user ... and you are, right?), here's how to use Firefox Quicksearch Bookmarks to find the cache of a page so fast you'll amaze your friends and stun your competition.

(If the concept of Quicksearch Bookmarks is new to you, get some background here and here. It's a way to search any site from the Firefox address field. Trust me: If you search a lot, you will LOVE this.)

When you create a new Quicksearch bookmark, here's the data to use:

Name: Google Text Cache
Location: http://72.14.209.104/search?q=cache%3A%s&strip=1
Keyword: tc

So typing this in Firefox's Address field:

tc www.yahoo.com

will show you this. Pretty cool, huh?

A Quick Route to the Google Text Cache
Posted by erik at 02:24 PM | Comments (0) | TrackBacks (0)
Printer-friendly version

June 05, 2007

SEO Case Study: Press Release Archives erik

We recently worked on the press release archive for a pretty large company and I wanted to show an informal case study about what happened.

Here's the initial structure:

The original structure of the press release archive

Each of the smallest document icons represents a specific press release. The issues with this architecture are fairly obvious when displayed graphically. It's a linear linking format, with the main press page showing the most current releases. You need to click a "next" button to hit a page linking to releases 11-20, 21-30, etc. etc. So the oldest releases were literally dozens of clicks away from the home page.

With this setup, here are the baseline stats. I can't use real numbers, but I'm hopeful that your algebra knowledge is current enough that variables will suffice:

  • Total releases indexed: X (representing about 6% of total possible pages)
  • Search traffic (monthly visits from search engines where the landing page is a press release): ~Y visits / month

Here's the modified structure:

The modified press archive structure

The main press page now has child pages devoted to releases from specific years, as well as pages devoted to releases based on their subject area. Consequently, each release has links from at least two internal pages, and no release is further away from the home page than 3 clicks.

Stats 45 days after implementation:

  • Total releases indexed: 16.8X (representing >98% of total possible press releases)
  • Search traffic (monthly visits from search engines where the landing page is a press release): ~2.4Y visits / month

A couple important notes:

Don't fool yourself. This is content that people actually search for in pretty significant numbers. If you write press releases just to write press releases -- and your content doesn't have the pent-up demand to justify it -- don't expect results like these. Jill Whalen wrote a smart article about this very topic at Search Engine Land.

Sitemaps aren't a back door. For 6+ months prior to the change implementation, all press release files were included in a Google XML sitemap file. So don't expect a sitemap feed to dramatically increase your indexing if your site's architecture doesn't back it up.

SEO Case Study: Press Release Archives
Posted by erik at 03:50 PM | Comments (1) | TrackBacks (0)
Printer-friendly version

June 01, 2007

Flash, Javascript, CSS, Ajax, sIFR, and Textual Image Replacement... Oh My! john

Not just because I am somewhat easily confused, but as our title today suggests, the overlap, literally and figuratively, among all of these web