« API Business Models: A Summary | Main | Social Media River Presentation »

November 18, 2008

Duplicate Content Penalties in the Age of Mash-Ups, Feeds, and APIs

I sat in on an excellent Pubcon session last week that was called “Getting Rid of Duplicate Content Once and for All.”

For those of you not familiar with the issue, it’s a big one in the SEO community (a site search for “duplicate content” on SEO forum WebmasterWorld yields 63K+ results).  Duplicate Content has long been one of those hazy issues that nobody could really give you a clear answer on.

The most common questions go like this: Can syndicating out your site’s content adversely affect your site’s ability to attract organic search traffic?  Can scrapers outrank you for your own content?  Does the so called “duplicate content penalty” exist for sites that syndicated too much of its content from elsewhere?  And can duplicate content on one page of a site hurt the organic search rankings of unique content on that same domain?

The panel was moderated by Rand Fishkin, one of the better known SEOs, and the panelists were Ben D’Angelo, engineer and Director of Duplicate Content at Google, Priyank Garg of Yahoo!, and Derrick Wheeler, an SEO from Microsoft.

Even the rumor of something called a Google applied “duplicate content penalty” should be a frightening thing for the social media crowd.  Frankly, the social web is very much about duplicate content nowadays: from RSS feeds, to reblogging comments, to exposing and consuming APIs, to mashing up services.  If Google were to penalize sites in the organic SERPS for either exposing or consuming content to / from other sites,  it could be a real damper on the innovation of remixed services.

This summary post is based on what I’ve seen with my own sites, what I’ve read on the subject, and based on the official positions of Google and Yahoo! as presented in last week’s sessions.

The Mechanics of Duplicate Content

Duplicate content, similar or identical content that appears on more than one url, is problematic for search engines.  Search engine users don’t like to see the same content under different pages in the organic search results, and crawling and indexing duplicate content is an unnecessary drain on search engine resources.

Both Google and Yahoo! attempt to deal with duplicate content the same way:

Step One is to cluster similar pages (from either the same domain or different domains)

Step Two is to decide which of those pages is the “best” representative of that cluster

Stuff can go wrong during both steps of this process.  The wrong pages might be clustered together (meaning the content is not duplicate).  And Google / Yahoo! might choose the wrong page as the best (i.e. “original”) source of the content.

The guiding principle for both Google and Yahoo! is to display one url in the index for one piece of content.  So if something goes wrong with either Step One or Step Two, the original producer of the content will be left out of the organic search results.

Types of Duplicate Content

There are two types of duplicate content:

Duplicate Content on a single domain

Duplicate Content on different domains

Duplicate content within your domain is a hassle for two primary reasons; 1) it unnecessarily taxes your servers as search engine crawlers hit these pages; and 2) it can use up your site’s quote of indexed pages, pushing good, unique pages out.

Duplicate content on a single domain can result from any number of factors.  Perhaps the site owner has chosen to point multiple URLs at a single piece of content.  Maybe the site owner has forgotten to redirect his non-www domains to www, or vice versa (it’s important to not have both www.mysite.com and http://mysite.com available to search engine crawlers, it can dilute your ability to rank).   Pages with pagination, sorting options, or other parameters in the URL are notorious sources of duplicate content.  No matter how you sort the page (title, size, color); the content on the page is still the same.  You only want to show one page to the search engines.

More examples include print friendly pages and urls that change depending which link you hit (www.rateitall.com vs. www.rateitall.com/default.aspx).

Duplicate content on your domain and someone else’s matters primarily to the original publisher of the content.  If you are using somebody else’s content, you should not expect to outrank them in the organic search results.  But if someone else is using your content, either with or without your consent, you probably expect your page to outrank the other domain’s.  When this doesn’t happen, it’s a problem for you and an opportunity for those who are using your content.
Duplicate content on multiple domains can also come from a variety of sources, many of which are familiar to readers of this blog.  APIs.  Widgets.  RSS Feeds.  Mash-ups.  Reblogging.   And then there are the scrapers – those that crawl your site, oftentimes without your consent, and pull content back.

What Can be Done to Avoid Unnecessary Duplicate Content?

There are a number of tools available to site owners to deal with unnecessary duplicate content – and by unnecessary, I mean the accidental kind.

In terms of your own domain, Google offers Google Webmaster Tools – a dashboard that alerts you to pages that have duplicate title tags or meta descriptions.  This can be helpful in identifying duplicates due to things like sorting, or url parameters, which you can then (theoretically) stamp out using no index tags.

However, once a deep page gets indexed, it is very hard to get it unindexed.  My site had about 100K duplicate pages indexed, due to a sorting option on our top ten lists.  We have had no index no follow tags on all the dupe pages for months now, and we still have thousands of duplicate pages in the index.
Yahoo! offers a much better solution for this type of issue – a tool that dynamically rewrites urls related to “content neutral” duplicates (like parameters in the URL) and concentrates all of the urls’ link juice into the primary URL.  This is a big deal, especially as the Yahoo! engineer who I heard describe it spoke specifically of a Yahoo! index page quota – meaning duplicate pages caused by URL parameters might be keeping your good pages out of the index.

If you are syndicating out content via feeds or APIs, Google suggests that you require a link back to your site to show the search engines where the original source of the content is.  However, I know from personal experience that this does not always work.

For folks taking your content without your consent, you can block them from crawling your sites and issue a DMCA takedown request.

Google vs. Yahoo: a Subtle Difference in Mindset

One of the things that struck me in listening to the Google and Yahoo! reps discuss this issue side by side was a difference in tone.  The Google rep was much less ideological – he spoke of how Google treats duplicate content with filters and not penalties (generally).  He did, however, seem to speak of duplicate content as a problem that needs to be addressed, as opposed to a core characteristic of the mashup generation.

The Yahoo! rep was much more confrontational.  He used terms like “dodgy” and “abusive” to describe duplicate content conditions, and spoke out against “weaving” content together to win more organic search traffic (mashups??).

Conclusion

I walked out of the session thinking to myself that both Google and Yahoo! need to adjust their attitudes about duplicate content, and Yahoo! more than Google.  Duplicate content is not a scourge.  It is not an indication of dodgy or abusive behavior.  And convincing webmasters that they need to avoid proliferating duplicate content seems to be both an impossible task AND bad for the web. 

However, the search engines need to a much better job at figuring out the original source of the content.  One of the suggestions thrown out during a discussion after the Pubcon session was that the search engines agree on a “not original source” tag that API and feed providers could require as part of their TOS.

I am pretty sure that Google does not often penalize sites for duplicate content.  They might filter the duplicate content out of the index, but I don’t think they penalize the associated domains.

I’m not so sure about Yahoo!.  Listening to the Y! rep speak, it’s very clear that he views duplicate content as an intentional drain of Y! resources.  I would not be surprised if Y! makes frequent use of a duplicate content penalty.

As a lover of the distributed web, I’m looking forward to the search engines A) learning that dupe content is not the enemy; and B) figuring out how to better identify the source.

About

  • My name is Lawrence Coburn and I'm the CEO of RateItAll - a distributed consumer review company.

    lc

Subscribe / Tip

  • Subscribe

Rate This Blog!

  • RateItAll Badge for Sexy Widget