Stay organized with collections
Save and categorize content based on your preferences.
Tuesday, July 02, 2019
Yesterday we announced that we're
open-sourcing Google's production robots.txt parser.
It was an exciting moment that paves the road for potential Search open sourcing projects in the
future! Feedback is helpful, and we're eagerly collecting questions from
developers and
webmasters alike. One question
stood out, which we'll address in this post:
Why isn't a code handler for other rules like crawl-delay included in the code?
The internet draft we published yesterday provides an
extensible architecture for rules that are not part of the standard. This means that if a
crawler wanted to support their own line like unicorns: allowed,
they could. To demonstrate how this would look in a parser, we included a very common line,
sitemap, in our open-source robots.txt parser.
While open-sourcing our parser library, we analyzed the usage of robots.txt rules. In particular,
we focused on rules unsupported by the internet draft, such as
crawl-delay, nofollow, and
noindex. Since these rules were never documented by Google,
naturally, their usage in relation to Googlebot is very low. Digging further, we saw their usage
was contradicted by other rules in all but 0.001% of all robots.txt files on the internet.
These mistakes hurt websites' presence in Google's search results in ways we don't think
webmasters intended.
In the interest of maintaining a healthy ecosystem and preparing for potential future open source
releases, we're retiring all code that handles unsupported and unpublished rules (such as
noindex) on September 1, 2019. For those of you who relied on the
noindex indexing rule in the
robots.txt file, which controls crawling, there are a number of
alternative options:
noindex
in robotsmeta tags: Supported both in the HTTP response headers and in HTML, the
noindex rule is the most effective way to remove URLs from
the index when crawling is allowed.
404 and 410 HTTP status codes:
Both status codes mean that the page does not exist, which will drop such URLs from Google's
index once they're crawled and processed.
Password protection: Unless markup is used to indicate
subscription or paywalled content,
hiding a page behind a login will generally remove it from Google's index.
Disallow in robots.txt: Search engines can only index pages
that they know about, so blocking the page from being crawled usually means its content won't
be indexed. While the search engine may also index a URL based on links from other pages,
without seeing the content itself, we aim to make such pages less visible in the future.
Search Console Remove URL tool:
The tool is a quick and easy method to remove a URL temporarily from Google's search results.
For more guidance about how to remove information from Google's search results, visit our
Help Center.
If you have questions, you can find us on Twitter
and in our Webmaster Community,
both offline and online.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],[],[[["\u003cp\u003eGoogle open-sourced their robots.txt parser and is retiring support for undocumented and unpublished rules (like \u003ccode\u003enoindex\u003c/code\u003e) on September 1, 2019.\u003c/p\u003e\n"],["\u003cp\u003eUnsupported rules like \u003ccode\u003ecrawl-delay\u003c/code\u003e, \u003ccode\u003enofollow\u003c/code\u003e, and \u003ccode\u003enoindex\u003c/code\u003e were never documented by Google and their usage is contradicted by other rules in almost all robots.txt files.\u003c/p\u003e\n"],["\u003cp\u003eWebmasters relying on the \u003ccode\u003enoindex\u003c/code\u003e directive in robots.txt should switch to alternatives like \u003ccode\u003enoindex\u003c/code\u003e in robots meta tags, \u003ccode\u003e404/410\u003c/code\u003e status codes, or password protection.\u003c/p\u003e\n"],["\u003cp\u003eGoogle provides alternative options for removing URLs from search results, including disallowing crawling in robots.txt and using the Search Console Remove URL tool.\u003c/p\u003e\n"],["\u003cp\u003eDevelopers and webmasters can provide feedback and ask questions through GitHub, Twitter, and the Webmaster Community.\u003c/p\u003e\n"]]],["Google open-sourced its robots.txt parser, allowing for custom rules like \"unicorns: allowed.\" The parser will retire code handling unsupported rules like `noindex` on September 1, 2019. Alternatives to `noindex` in robots.txt include `noindex` in meta tags, 404/410 HTTP status codes, password protection, `Disallow` in robots.txt, and the Search Console Remove URL tool. Google analyzed robots.txt rule usage and found unsupported rules are rarely used effectively.\n"],null,["# A note on unsupported rules in robots.txt\n\nTuesday, July 02, 2019\n\n\nYesterday we announced that we're\n[open-sourcing Google's production robots.txt parser](/search/blog/2019/07/repp-oss).\nIt was an exciting moment that paves the road for potential Search open sourcing projects in the\nfuture! Feedback is helpful, and we're eagerly collecting questions from\n[developers](https://github.com/google/robotstxt) and\n[webmasters](https://twitter.com/googlesearchc) alike. One question\nstood out, which we'll address in this post:\n\n\nWhy isn't a code handler for other rules like crawl-delay included in the code?\n\n\n[The internet draft](/search/blog/2019/07/rep-id) we published yesterday provides an\nextensible architecture for rules that are not part of the standard. This means that if a\ncrawler wanted to support their own line like `unicorns: allowed`,\nthey could. To demonstrate how this would look in a parser, we included a very common line,\nsitemap, in our [open-source robots.txt parser](https://github.com/google/robotstxt).\n\n\nWhile open-sourcing our parser library, we analyzed the usage of robots.txt rules. In particular,\nwe focused on rules unsupported by the internet draft, such as\n`crawl-delay`, `nofollow`, and\n`noindex`. Since these rules were never documented by Google,\nnaturally, their usage in relation to Googlebot is very low. Digging further, we saw their usage\nwas contradicted by other rules in all but 0.001% of all robots.txt files on the internet.\nThese mistakes hurt websites' presence in Google's search results in ways we don't think\nwebmasters intended.\n\n\nIn the interest of maintaining a healthy ecosystem and preparing for potential future open source\nreleases, we're retiring all code that handles unsupported and unpublished rules (such as\n`noindex`) on September 1, 2019. For those of you who relied on the\n`noindex` indexing rule in the\n`robots.txt` file, which controls crawling, there are a number of\nalternative options:\n\n- **[`noindex`](/search/docs/crawling-indexing/block-indexing)\n in robots `meta` tags:** Supported both in the HTTP response headers and in HTML, the `noindex` rule is the most effective way to remove URLs from the index when crawling is allowed.\n- **[`404` and `410` HTTP status codes](https://en.wikipedia.org/wiki/HTTP_404):** Both status codes mean that the page does not exist, which will drop such URLs from Google's index once they're crawled and processed.\n- **Password protection:** Unless markup is used to indicate [subscription or paywalled content](/search/docs/appearance/structured-data/paywalled-content), hiding a page behind a login will generally remove it from Google's index.\n- **`Disallow` in `robots.txt`:** Search engines can only index pages that they know about, so blocking the page from being crawled usually means its content won't be indexed. While the search engine may also index a URL based on links from other pages, without seeing the content itself, we aim to make such pages less visible in the future.\n- **[Search Console Remove URL tool](https://support.google.com/webmasters/answer/1663419):** The tool is a quick and easy method to remove a URL temporarily from Google's search results.\n\n\nFor more guidance about how to remove information from Google's search results, visit our\n[Help Center](/search/docs/guides/advanced/remove-information?ref_topic=1724262).\nIf you have questions, you can find us on [Twitter](https://twitter.com/googlesearchc)\nand in our [Webmaster Community](https://support.google.com/webmasters/community),\nboth [offline](/search/events/search-central-live) and online.\n\n\nPosted by [Gary Illyes](https://garyillyes.com/+)"]]