Monday, December 18, 2006
At the recent Search Engine Strategies conference in freezing Chicago, many of us Googlers were asked questions about duplicate content. We recognize that there are many nuances and a bit of confusion on the topic, so we'd like to help set the record straight.
What is duplicate content?
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Most of the time when we see this, it's unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and—worse yet—linked) via multiple distinct URLs, and so on. In some cases, content is duplicated across domains in an attempt to manipulate search engine rankings or garner more traffic via popular or long-tail queries.
What isn't duplicate content?
Though we do offer a handy translation utility, our algorithms won't view the same article written in English and Spanish as duplicate content. Similarly, you shouldn't worry about occasional snippets (quotes and otherwise) being flagged as duplicate content.
Why does Google care about duplicate content?
Our users typically want to see a diverse cross-section of unique content when they do searches.
In contrast, they're understandably annoyed when they see substantially the same content within
a set of search results. Also, webmasters become sad when we show a complex URL
example.com/contentredir?value=shorty-george〈=en) instead of the pretty URL
they prefer (
What does Google do about it?
During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in "regular" and "printer" versions and neither set is blocked in robots.txt or via a noindex meta tag, we'll choose one version to list. In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering rather than ranking adjustments ... so in the vast majority of cases, the worst thing that'll befall webmasters is to see the "less desired" version of a page shown in our index.
How can Webmasters proactively address duplicate content issues?
- Block appropriately: Rather than letting our algorithms determine the "best" version of a document, you may wish to help guide us to your preferred version. For instance, if you don't want us to index the printer versions of your site's articles, disallow those directories or make use of regular expressions in your robots.txt file.
301redirects: If you have restructured your site, use
RedirectPermanent) in your
.htaccessfile to smartly redirect users, the Googlebot, and other spiders.
Be consistent: Endeavor to keep your internal linking consistent; don't link to
Use TLDs: To help us serve the most appropriate version of a document, use top level
domains whenever possible to handle country-specific content. We're more likely to know that