This guide describes how to optimize Google's crawling of very large and frequently updated sites.
If your site does not have a large number of pages that change rapidly, or if your pages seem to be crawled the same day that they are published, you do not need to read this guide; merely keeping your sitemap up to date and checking your index coverage regularly should be adequate.
If you have content that's been available for a while but has never been indexed, this is a different problem; use the URL Inspection tool instead to find out why your page isn't being indexed.
Who this guide is for
This is an advanced guide and is intended for:
- Large sites (1 million+ unique pages) with content that changes moderately often (once a week), or
- Medium or larger sites (10,000+ unique pages) with very rapidly changing content (daily).
Please note that the numbers given here are a rough estimate to help you classify your site. These are not exact thresholds.
The web is a nearly infinite space, exceeding Google's ability to explore and index every available URL. As a result, there are limits to how much time Googlebot can spend crawling any single site. The amount of time and resources that Google devotes to crawling a site is commonly called the site's crawl budget. Note that not everything crawled on your site will necessarily be indexed; each page must be evaluated, consolidated, and assessed to determine whether it will be indexed after it has been crawled.
Crawl budget is determined by two main elements: crawl capacity limit and crawl demand.
Crawl capacity limit
Googlebot wants to crawl your site without overwhelming your servers. To prevent this, Googlebot calculates a crawl capacity limit, which is the maximum number of simultaneous parallel connections that Googlebot can use to crawl a site, as well as the time delay between fetches. This is calculated to provide coverage of all your important content without overloading your servers.
The crawl capacity limit can go up and down based on a few factors:
- Crawl health: If the site responds quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.
- Limit set by site owner in Search Console: Website owners can optionally reduce Googlebot's crawling of their site. Note that setting higher limits won't automatically increase crawling.
- Google's crawling limits: Google has a lot of machines, but not infinite machines. We still need to make choices with the resources that we have.
Google typically spends as much time as necessary crawling a site, given its size, update frequency, page quality, and relevance, compared to other sites.
The factors that play a significant role in determining crawl demand are:
- Perceived inventory: Without guidance from you, Googlebot will try to crawl all or most of the URLs that it knows about on your site. If many of these URLs are duplicates, or should not be crawled for some other reason (removed, unimportant, and so on), this wastes a lot of Google crawling time on your site. This is the factor that you can positively control the most.
- Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.
- Staleness: Our systems want to recrawl documents frequently enough to pick up any changes.
Additionally, site-wide events like site moves may trigger an increase in crawl demand in order to reindex the content under the new URLs.
Taking crawl capacity and crawl demand together, Google defines a site's crawl budget as the set of URLs that Googlebot can and wants to crawl. Even if the crawl capacity limit isn't reached, if crawl demand is low, Googlebot will crawl your site less.
Follow these best practices to maximize your crawling efficiency:
- Manage your URL inventory: Use the appropriate tools to tell Google which pages to crawl and which not to crawl. If Google spends too much time crawling URLs that aren't appropriate for the index, Googlebot might decide that it's not worth the time to look at the rest of your site (or increase your budget to do so).
- Consolidate duplicate content. Eliminate duplicate content to focus crawling on unique content rather than unique URLs.
- Block crawling of URLs that shouldn't be indexed. Some pages might be important to users, but shouldn't appear in Search results. For example, infinite scrolling pages that duplicate information on linked pages, or differently sorted versions of the same page. If you can't consolidate them as described in the first bullet, block these unimportant (for search) pages using robots.txt or the URL Parameters tool (for duplicate content reached by URL parameters). Don't use noindex, as Google will still request, but then drop the page when it sees the noindex tag, wasting crawling time. Don't use robots.txt to temporarily free up crawl budget for other pages; use robots.txt to block pages or resources that you think that we shouldn't crawl at all. Google won't shift this freed-up crawl budget to other pages unless Google is already hitting your site's serving limit.
- Return 404/410 for permanently removed pages. Google won't forget a URL that it knows about, but a 404 is a strong signal not to crawl that URL again. Blocked URLs, however, will stay part of your crawl queue much longer, and will be recrawled when the block is removed.
- Eliminate soft 404s. Soft 404s will continue to be crawled, and waste your budget. Check the Index Coverage report for soft 404 errors.
- Keep your sitemaps up to date. Google reads your sitemap regularly, so be sure to include all the content that you want Google to crawl. If your site includes updated content, we recommend including the
- Avoid long redirect chains, which have a negative effect on crawling.
- Make your pages efficient to load. If Google can load and render your pages faster, we might be able to read more content from your site.
- Monitor your site crawling. Monitor whether your site had any availability issues during crawling, and look for ways to make your crawling more efficient.
Here are the key steps to monitoring your site's crawl profile:
- See if Googlebot is encountering availability issues on your site.
- See whether you have pages that aren't being crawled, but should be.
- See whether any parts of your site need to be crawled more quickly than they already are.
- Improve your site's crawl efficiency.
- Handle overcrawling of your site.
Improving your site availability won't necessarily increase your crawl budget; Google determines the best crawl rate based on the crawl demand, as described previously. However, availability issues do prevent Google from crawling your site as much as it might want to.
Use the Crawl Stats report to see Googlebot's crawling history for your site. The report shows when Google encountered availability issues on your site. If availability errors or warnings are reported for your site, look for instances in the Host availability graphs where Googlebot requests exceeded the red limit line, click into the graph to see which URLs were failing, and try to correlate those with issues on your site.
- Read the documentation for the Crawl Stats report to learn how to find and handle some availability issues.
- Block pages from crawling if they shouldn't be crawled. (See manage your inventory)
- Increase page loading and rendering speed. (See Improve your site's crawl efficiency)
- Increase your server capacity. If Google consistently seems to be crawling your site at its serving capacity limit, but you still have important URLs that aren't being crawled or updated as much as they need, having more serving resources might enable Google to request more pages on your site. Check your host availability history in the Crawl Stats report to see if Google's crawl rate seems to be crossing the limit line often. If so, increase your serving resources for a month and see whether crawling requests increased during that same period.
Google spends as much time as necessary on your site in order to index all the high-quality, user-valuable content that it can find. If you think that Googlebot is missing important content, either it doesn't know about the content, the content is blocked from Google, or your site availability is throttling Google's access (or Google is trying not to overload your site).
Search Console doesn't provide a crawl history for your site that can be filtered by URL or path, but you can inspect your site logs to see whether specific URLs have been crawled by Googlebot. Whether or not those crawled URLs have been indexed is another story.
Remember that for most sites, new pages will take several days minimum to be noticed; most sites shouldn't expect same-day crawling for URLs, with the exception of time-sensitive sites such as news sites.
If you are adding pages to your site and they are not being crawled in a reasonable amount of time, either Google doesn't know about them, the content is blocked, your site has reached its maximum serving capacity, or you are out of crawl budget.
- Tell Google about your new pages: update your sitemaps to reflect new URLs.
- Examine your robots.txt rules to confirm that you're not accidentally blocking pages.
- If all your non-crawled pages have URL parameters, it's possible that your pages were excluded because of settings in the URL Parameters tool; unfortunately there isn't a way to check for such an exclusion, which is why we typically recommend against using that tool.
- Review your crawling priorities (a.k.a. use your crawl budget wisely). Manage your inventory and improve your site's crawling efficiency.
- Check that you're not running out of serving capacity. Googlebot will scale back its crawling if it detects that your servers are having trouble responding to crawl requests.
Note that pages might not be shown in search results, even if crawled, if there isn't sufficient value or user demand for the content.
If we're missing new or updated pages on your site, perhaps it's because we haven't seen them, or haven't noticed that they are updated. Here is how you can help us be aware of page updates.
Note that Google strives to check and index pages in a reasonably timely manner. For most sites, this is three days or more. Don't expect Google to index pages the same day that you publish them unless you are a news site or have other high-value, extremely time-sensitive content.
Examine your site logs to see when specific URLs were crawled by Googlebot.
To learn the indexing date, use the URL Inspection tool or do a Google search for URLs that you updated.
- Use a news sitemap if your site has news content. Ping Google when your sitemap is posted or has changed.
- Use the
<lastmod>tag in sitemaps to indicate when an indexed URL has been updated.
- Use a simple URL structure to help Google find your pages.
- Provide standard, crawlable
<a>links to help Google find your pages.
- Submitting the same, unchanged sitemap multiple times per day.
- Expecting that Googlebot will crawl everything in a sitemap, or crawl them immediately. Sitemaps are useful suggestions to Googlebot, not absolute requirements.
- Including URLs in your sitemaps that shouldn't appear in search. This can waste your crawl budget on pages that shouldn't be indexed.
Increase your page loading speed
Google's crawling is limited by bandwidth, time, and availability of Googlebot instances. If your server responds to requests quicker, we might be able to crawl more pages on your site. That said, Google only wants to crawl high quality content, so simply making low quality pages faster won't encourage Googlebot to crawl more of your site; conversely, if we think that we're missing high-quality content on your site, we'll probably increase your budget to crawl that content.
Here's how you can optimize your pages and resources for crawling:
- Prevent large but unimportant resources from being loaded by Googlebot using robots.txt. Be sure to block only non-critical resources--that is, resources that aren't important to understanding the meaning of the page (such as decorative images).
- Make sure that your pages are fast to load.
- Watch out for long redirect chains, which have a negative effect on crawling.
- Both the time to respond to server requests, as well as the time needed to render pages, matters, including load and run time for embedded resources such as images and scripts. Be aware of large or slow resources required for indexing.
Wasting server resources on unnecessary pages can reduce crawl activity from pages that are important to you, which may cause a significant delay in discovering great new or updated content on a site.
Exposing many URLs on your site that shouldn't be crawled by search can negatively affect a site's crawling and indexing. Typically these URLs fall into the following categories:
- Faceted navigation and session identifiers: (Faceted navigation is typically duplicate content from the site; session identifiers and other URL parameters that simply sort or filter the page don't provide new content.) Use robots.txt to block faceted navigation pages. If you find that Google is crawling a significant number of essentially duplicate URLs with different parameters on your site, consider blocking parameterized duplicate content.
- Duplicate content: Help Google identify duplicate content to avoid unnecessary crawling.
- Soft 404 pages: Return a 404 code when a page no longer exists.
- Hacked pages: Be sure to check the Security Issues report and fix or remove any hacked pages you find.
- Infinite spaces and proxies: Block these from crawling with robots.txt.
- Low quality and spam content: Good to avoid, obviously.
- Shopping cart pages, infinite scrolling pages, and pages that perform an action (such as "sign up" or "buy now" pages).
- Use robots.txt if you don't think we should be crawling a resource or page at all.
- Adding or removing pages or directories from robots.txt regularly as a way of "freeing up" some additional crawl budget for your site. Use robots.txt only for pages or resources that shouldn't appear on Google for the long run.
- Rotating sitemaps or using other temporary hiding mechanisms to "free up more budget."
Googlebot has algorithms to prevent it from overwhelming your site with crawl requests. However, if you find that Googlebot is overwhelming your site, there are a few things you can do.
Monitor your server for excessive Googlebot requests to your site.
In an emergency, we recommend the following steps to slow down an overwhelming crawl from Googlebot:
- Return 503/429 HTTP result codes temporarily for Googlebot requests when your server is overloaded. Googlebot will retry these URLs for about 2 days. Note that returning "no availability" codes for more than a few days will cause Google to permanently slow or stop crawling URLs on your site, so you should take the following additional actions.
- Reduce the Googlebot crawl rate for your site. This can take up to 2 days to take effect, and requires Search Console property owner permissions. Do this only if you see long-term, repeated overcrawling from Google in the Crawl Stats report, in the Host availability > Host utilization chart.
- When the crawl rate goes down, stop returning 503/429 for crawl requests; returning 503 for more than 2 days will cause Google to drop the 503 URLs from the index.
- Monitor your crawling and your host capacity over time, and if appropriate, increase your crawl rate again, or allow the default crawling rate.
- If the problematic crawler is one of the AdsBot crawlers, the problem is likely that you have created Dynamic Search Ad targets for your site that Google is trying to crawl. This crawl will reoccur every 2 weeks. If you don't have the server capacity to handle these crawls you should either limit your ad targets or get increased serving capacity.
Compressing my sitemaps can increase my crawl budget.
- It won't. Zipped sitemaps still have to be fetched from the server, so you're not really saving much crawling time or effort on Google's part by sending compressed sitemaps.
Google prefers fresher content, so I'd better keep tweaking my page
- Content is rated by quality, regardless of age. Create and update your content as necessary, but there's no additional value in making pages artificially appear to be fresh by making trivial changes and updating the page date.
Google prefers old content (it has more weight) over fresh content
- False: If your page is useful, it's useful, whether it's new or old.
Google prefers clean URLs and doesn't like query parameters
- False: we can crawl parameters. However, remember to block pages with parameters that point to duplicate content.
Small sites aren't crawled as often as big ones
- False: If a site has important content that changes often, we crawl it often, regardless of the size.
The closer your content is to the homepage the more important it is to Google
- Partly true: Your site's homepage is often the most important page on your site, and so pages linked directly to the homepage may be seen as more important, and therefore crawled more often. However, this doesn't mean that these pages will be ranked more highly than other pages on your site.
The faster your pages load and render, the more Google is able to crawl
- True... in that our resources are limited by a combination of time and number of crawling bots. If you can serve us more pages in a limited time, we will be able to crawl more of them. However, we might devote more time crawling a site that has more important information, even if it is slower. It's probably more important for you to make your site faster for your users than to make it faster to increase your crawl coverage. It's much simpler to help Google crawl the right content than it is to crawl all your content every time.
- Note that crawling a site involves both retrieving and rendering the content. Time spent rendering the page counts as much as time spent requesting the page. So making your pages faster to render will also increase the crawl speed.
URL versioning is a good way to encourage Google to recrawl my pages
- Partly true: Using a versioned URL for your page in order to entice Google to crawl it again sooner will probably work, but often this is not necessary, and will waste crawl resources if the page is not actually changed. In general, a sitemap with a
<lastmod>value is the best way to indicate updated content to Google. If you do use versioned URLs to indicate new content, you should only change the URL when the page content has changed meaningfully.
Site speed and errors affects my crawl budget
- True: Making a site faster improves the users' experience while also increasing crawl rate. For Googlebot a speedy site is a sign of healthy servers, so it can get more content over the same number of connections. On the flip side, a significant number of 5xx HTTP result codes (server errors) or connection timeouts signal the opposite, and crawling slows down.
- We recommend paying attention to the Crawl Stats report in Search Console and keeping the number of server errors low.
Crawling is a ranking factor
- False: Improving your crawl rate will not necessarily lead to better positions in Search results. Google uses many signals to rank the results, and while crawling is necessary for a page to be in search results, it's not a ranking signal.
Alternate URLs and embedded content count in the crawl budget
I can control Googlebot with the "crawl-delay" directive
- False: The non-standard "crawl-delay" robots.txt directive is not processed by Googlebot.
The nofollow directive affects crawl budget
- Partly true: Any URL that is crawled affects crawl budget, so even if your page marks a URL as nofollow it can still be crawled if another page on your site, or any page on the web, doesn't label the link as nofollow.