Websites are visited not only by humans, but also by search engine web crawlers. Learn how to improve search accuracy and ranking for your website.
- Determine the URL structure of your web page.
- Responsive design is most recommended.
rel='alternate'for separate desktop/mobile sites.
Vary HTTPheader for a single URL dynamically serving separate desktop/mobile HTMLs.
noindexfor pages you want to limit access to those who know the URL.
- Use relevant authentication mechanism for pages you want to keep private.
Give search engines your site structure
How your website appears in search results is important to multi-device site design. This guide helps you optimize your website for search engines based on its URL structure.
Are you planning to build a responsive web page? Is there a mobile-specific version with a separate URL? Are you serving both the desktop version and the mobile version from the same URL? Regardless, you can always do a better job of optimizing your website for search engines.
Give your site a URL structure
There are several ways to serve content to different devices. The three most common methods are:
Responsive web design: serves the same HTML from one URL and uses CSS media queries to determine how the content is rendered on the client side. For example, Desktop and Mobile: http://www.example.com/
Separate mobile site: redirects users to a different URL depending on the user-agent. For example, Desktop: http://www.example.com/ Mobile: http://m.example.com/
Dynamic serving: serves different HTML from one URL depending on the user- agent. For example, Desktop and Mobile: http://www.example.com/
The best approach is to use responsive web design, though many websites use other methods.
Determine which URL structure suits your web page. Then try the respective best practices to optimize it for search engines.
We recommend responsive web design
The benefits of making your website responsive are:
- Friendlier for user sharing.
- Quicker page load without redirects.
- Single URL for search results.
Learn to build websites with responsive web design at Responsive Web Design Basics.
link[rel=alternate] when serving separate URLs
Serving similar contents on a desktop version and a mobile version at different URLs may cause confusion for both users and search engines because it's not obvious to viewers that they are intended to be identical. You should indicate:
- That the content of the two URLs are identical.
- Which is the mobile version.
- Which is the desktop (canonical) version.
This information helps search engines better index content and ensures that users find what they're looking for in a format that works for their device.
Use alternate for desktop
When serving the desktop version, indicate that there's a mobile version on
another URL by adding a
link tag with a
rel="alternate" attribute that points
to the mobile version in thehref` attribute.
<title>...</title> <link rel="alternate" media="only screen and (max-width: 640px)" href="http://m.example.com/">
Use canonical for mobile
When serving the mobile version, indicate that there's a desktop (canonical)
version on another URL by adding a
link tag with a
that points to the desktop version in the
href attribute. Help search engines
understand that the mobile version is explicitly for small screens by adding a
media attribute with a value of
"only screen and (max-width: 640px)".
<title>...</title> <link rel="canonical" href="http://www.example.com/">
Use the Vary HTTP header
Serving different HTML based on device type reduces unnecessary redirects, serves optimized HTML, and provides single URL for search engines. It also has several disadvantages:
- There may be intermediate proxies between a user's browsers and the server. Unless the proxy knows that the content varies depending on user agent, it may serve unexpected results.
- Changing contents depending on user agent risks being considered "cloaking", which is a violation of Google’s Webmaster Guidelines.
By letting search engines know that the content varies depending on user agent, they can optimize search results for the user agent that is sending queries.
To indicate that the URL serves different HTML depending on user agent, provide a b
Vary: User-Agent in the HTTP header. This allows search indexing to treat
desktop and mobile versions separately, and intermediate proxies to cache those
http://www.example.com/ HTTP Header
HTTP/1.1 200 OK Content-Type: text/html Vary: User-Agent Content-Length: 5710
To learn more about building URL structure across desktop and mobile, read about building smartphone-optimized websites.
Control crawling and indexing from search engines
Being listed properly on search engines is critical to delivering your website to the world, but poor configuration can cause unexpected content to be included in the results. This section helps you avoid such problems by explaining how crawlers work and how they index websites.
Sharing information has no better place than the web. When you publish a document, it's immediately available to the entire world. The page will be visible to anyone who knows the URL. That's where search engines come in. They need to be able to find your website.
However, there are some cases where you don't want people to find those documents even though you want to put them on the web. For example, a blog's admin page is something only certain people should have access to. There's no benefit to letting people find those pages through search engines.
This section also explains how to restrict certain pages from appearing in search results.
The difference between "crawl" and "index"
Before you learn how to control search results, you need to understand how search engines interact with your web page. From your site's point of view, there are roughly two things search engines do to your site: crawling and indexing.
Crawling is when a search engine bot fetches your web page to analyze its content. The content is stored in the search engine's database and can be used to populate search result details, rank pages, and discover new pages by following links.
Indexing is when a search engine stores a website's URL and any associated information in its database so it is ready to serve as a search result.
Control crawling with robots.txt
You can use a text file called
robots.txt to control how well-behaved crawlers access your web page.
Robots.txt is a simple text file describing how you want
search bots to crawl your site. (Not all crawlers necessarily respect
robots.txt. Imagine that anyone can create their own stray crawlers.)
robots.txt at the root directory of your website's host. For example,
if your site's host is
http://pages.example.com/, then the
should be located at
http://pages.example.com/robots.txt. If the domain has
different schema, subdomains, or other ports, they are considered
different hosts and should have
robots.txt for each of their root
Here's a quick example:
User-agent: * Disallow: /
This indicates that you want to disallow all bots from crawling your entire website.
Here's another example:
User-agent: Googlebot Disallow: /nogooglebot/
You can specify the behavior per bot (user agent) by indicating a user-agent
name. In the above case, you are disallowing the user agent called
/nogooglebot/ and all contents below this directory.
Learn more about each search engine's bots on their help pages:
Depending on which crawlers your robots.txt is targeting, search engine
providers may provide a tool to test
robots.txt. For example, for Google
there's a validator in
that you can use to test your robots.txt.
Yandex provides a similar tool.
Control search indexing with meta tags
If you don't want your web page to show up in search results, robots.txt isn't the solution. You need to allow those pages to be crawled, and explicitly indicate that you don't want them to be indexed. There are two solutions:
To indicate you don't want an HTML page to be indexed, use a specific kind of
<meta> tag, one with its attributes set as
<!DOCTYPE html> <html><head> <meta name="robots" content="noindex" />
By changing the value of the
name attribute to a specific user agent name, you can narrow the scope. For example,
name="googlebot" (case insensitive) indicates that you don't want Googlebot to index the page.
<!DOCTYPE html> <html><head> <meta name="googlebot" content="noindex" />
Other options for the robots meta tag include:
To indicate that you don't want resources such as images, stylesheets, or script
files to be indexed, add
X-Robots-Tag: noindex in an HTTP header.
HTTP/1.1 200 OK X-Robots-Tag: noindex Content-Type: text/html; charset=UTF-8
If you want to narrow the scope to a specific user agent, insert the user agent name before
HTTP/1.1 200 OK X-Robots-Tag: googlebot: noindex Content-Type: text/html; charset=UTF-8
To learn more about X-Robots-Tag:
robots.txt to control search indexes.
Examples by content type
What are the best solutions to control crawling and indexing? Here are some example solutions for different types of pages.
Fully accessible and searchable by anyone
Most of the pages on the web are of this type.
- No robots meta tags required.
Limited access by people who know the URL
- Login page for a blog admin console.
- Private content shared by passing a URL for novice internet users.
In this case, you don't want search engines to index those pages.
noindexmeta tags for HTML pages.
X-Robots-Tag: noindexfor non HTML resources (images, pdf, etc).
Restricted access from authorized people
In this case, even if someone finds the URL, the server refuses to present the result without a proper credential. For example:
- Privately shared content on a social network.
- Enterprise expense system.
In these types of pages, search engines should neither crawl nor index them.
- Return response code 401 "Unauthorized" for an access without a proper credential (or redirect the user to a login page).
- Don't use
robots.txtto disallow crawling these pages. Otherwise, 401 can't be detected.
The restriction mechanism here can be an IP address, a cookie, basic auth, OAuth, etc. How to implement such authentication/authorization depends on your infrastructure and is beyond this article's scope.
Request a page removal from a search engine
You might want to remove a search result when:
- The page no longer exists.
- A page was accidentally indexed that includes confidential information.
Major search engines provide a way to send a request to remove such pages. The process usually takes the following:
Make sure the page you want removed:
- Is already deleted from your server and returns 404
- Is configured not to be indexed (ex: noindex)
Go to the request page on each search engine. (Google and Bing require you to register and validate ownership of your website.)
- Send a request.
Check out concrete steps at the respective search engine's help pages:
Appendix: List of crawler user agents
Was this page helpful?