Friday, February 17, 2023
Over the last few months we noticed an uptick in website owners and some content delivery networks
(CDNs) attempting to use 404
and other 4xx
client errors (but not
429
) to attempt to reduce Googlebot's crawl rate.
The short version of this blog post is: please don't do that; we have documentation about how to reduce Googlebot's crawl rate. Read that instead and learn how to effectively manage Googlebot's crawl rate.
Back to basics: 4xx
errors are for client errors
The 4xx
errors servers return to clients are a signal from the server that the
client's request was wrong in some sense. Most of the errors in this category are pretty benign:
"not found" errors, "forbidden", "I'm a teapot" (yes, that's a thing). They don't suggest anything
wrong going on with the server itself.
The one exception is 429
, which stands for "too many requests". This error is a clear
signal to any well-behaved robot, including our beloved Googlebot, that it needs to slow down
because it's overloading the server.
Why 4xx
errors are bad for rate limiting Googlebot (except 429
)
Client errors are just that: client errors. They generally don't suggest an error with the server:
not that it's overloaded, not that it's encountered a critical error and is unable to respond
to the request. They simply mean that the client's request was bad in some way. There's no
sensible way to equate for example a 404
error to the server being overloaded.
Imagine if that was the case: you get an influx of 404
errors from your friend accidentally
linking to the wrong pages on your site, and in turn Googlebot slows down with crawling. That
would be pretty bad. Same goes for 403
, 410
, 418
.
And again, the big exception is the 429
status code, which translates to "too many
requests".
What rate limiting with 4xx
does to Googlebot
All 4xx
HTTP status codes (again, except 429
) will cause your content
to be removed from Google Search. What's worse, if you also serve your robots.txt file with a
4xx
HTTP status code, it will be treated as if it didn't exist. If you had a rule
there that disallowed crawling your dirty laundry, now Googlebot also knows about it; not great
for either party involved.
How to reduce Googlebot's crawl rate, the right way
We have extensive documentation about how to reduce Googlebot's crawl rate and also about how Googlebot (and Search indexing) handles the different HTTP status codes; be sure to check them out. In short, you want to do either of these things:
- Use Search Console to temporarily reduce crawl rate.
-
Return a
500
,503
, or429
HTTP status code to Googlebot when it's crawling too fast.
If you need more tips or clarifications, catch us on Twitter or post in our help forums.