Create a robots.txt file
A robots.txt file lives at the root of your site. So, for site
the robots.txt file lives at
www.example.com/robots.txt. robots.txt is a plain
text file that follows the
Robots Exclusion Standard.
A robots.txt file consists of one or more rules. Each rule blocks (or allows) access for a
given crawler to a specified file path in that website.
Here is a simple robots.txt file with two rules, explained below:
# Group 1 User-agent: Googlebot Disallow: /nogooglebot/ # Group 2 User-agent: * Allow: / Sitemap: http://www.example.com/sitemap.xml
The user agent named "Googlebot" is not allowed to crawl the
http://example.com/nogooglebot/directory or any subdirectories.
- All other user agents are allowed to crawl the entire site. This could have been omitted and the result would be the same; the default behavior is that user agents are allowed to crawl the entire site.
The site's sitemap file is located at
See the syntax section for more examples.
Basic robots.txt guidelines
Here are some basic guidelines for robots.txt files. We recommend that you read the full syntax of robots.txt files because the robots.txt syntax has some subtle behavior that you should understand.
Format and location
You can use almost any text editor to create a robots.txt file. The text editor should be able to create standard UTF-8 text files. Don't use a word processor; word processors often save files in a proprietary format and can add unexpected characters, such as curly quotes, which can cause problems for crawlers.
Format and location rules:
- The file must be named robots.txt.
- Your site can have only one robots.txt file.
The robots.txt file must be located at the root of the website host to
which it applies. For instance, to control crawling on all URLs below
http://www.example.com/, the robots.txt file must be located at
http://www.example.com/robots.txt. It cannot be placed in a subdirectory (for example, at
http://example.com/pages/robots.txt). If you're unsure about how to access your website root, or need permissions to do so, contact your web hosting service provider. If you can't access your website root, use an alternative blocking method such as meta tags.
A robots.txt file can apply to subdomains (for example,
http://website.example.com/robots.txt) or on non-standard ports (for example,
- A robots.txt file must be an UTF-8 encoded text file (which includes ASCII). Using other character sets is not possible.
- A robots.txt file consists of one or more groups.
- Each group consists of multiple rules or directives (instructions), one directive per line.
- A group gives the following information:
- Who the group applies to (the user agent)
- Which directories or files that agent can access
- Which directories or files that agent cannot access
- Groups are processed from top to bottom, and a user agent can match only one rule set, which is the first, most specific rule that matches a given user agent.
The default assumption is that a user agent can crawl any page or directory
not blocked by a
Rules are case-sensitive. For instance,
Disallow: /file.aspapplies to
http://www.example.com/file.asp, but not
- Comments are any content after a
The following directives are used in robots.txt files:
User-agent:[Required, one or more per group] The directive specifies the name of the automatic client known as search engine crawler that the rule applies to. This is the first line for any rule group. Google user agent names are listed in the Google list of user agents. Using an asterisk (
*) as in the example below will match all crawlers except the various AdsBot crawlers, which must be named explicitly. Examples:
# Example 1: Block only Googlebot User-agent: Googlebot Disallow: / # Example 2: Block Googlebot and Adsbot User-agent: Googlebot User-agent: AdsBot-Google Disallow: / # Example 3: Block all but AdsBot crawlers User-agent: * Disallow: /
Disallow:[At least one or more
Allowentries per rule] A directory or page, relative to the root domain, that you don't want the user agent to crawl. If the rule refers to a page, it should be the full page name as shown in the browser; if it refers to a directory, it should end in a
Allow:[At least one or more
Allowentries per rule] A directory or page, relative to the root domain, that may be crawled by the user agent just mentioned. This is used to override a
Disallowdirective to allow crawling of a subdirectory or page in a disallowed directory. For a single page, the full page name as shown in the browser should be specified. In case of a directory, the rule should end in a
Sitemap:[Optional, zero or more per file] The location of a sitemap for this website. The sitemap URL must be a fully-qualified URL; Google doesn't assume or check http/https/www.non-www alternates. Sitemaps are a good way to indicate which content Google should crawl, as opposed to which content it can or cannot crawl. Learn more about sitemaps. Example:
Sitemap: https://example.com/sitemap.xml Sitemap: http://www.example.com/sitemap.xml
All directives, except
sitemap, support the
* wildcard for a path
prefix, suffix, or entire string.
Lines that don't match any of these directives are ignored.
Another example file
A robots.txt file consists of one or more groups, each beginning with a
User-agent line that specifies the target of the groups. Here is a file with two
groups; inline comments explain each group:
# Block googlebot from example.com/directory1/... and example.com/directory2/... # but allow access to directory2/subdirectory1/... # All other directories on the site are allowed by default. User-agent: googlebot Disallow: /directory1/ Disallow: /directory2/ Allow: /directory2/subdirectory1/ # Block the entire site from anothercrawler. User-agent: anothercrawler Disallow: /
Full robots.txt syntax
You can find the full robots.txt syntax here. Please read the full documentation, as the robots.txt syntax has a few tricky parts that are important to learn.
Useful robots.txt rules
Here are some common useful robots.txt rules:
|Disallow crawling of the entire website. Keep in mind that in some situations URLs from the website may still be indexed, even if they haven't been crawled.||
User-agent: * Disallow: /
|Disallow crawling of a directory and its contents by following the directory name with a forward slash. Remember that you shouldn't use robots.txt to block access to private content: use proper authentication instead. URLs disallowed by the robots.txt file might still be indexed without being crawled, and the robots.txt file can be viewed by anyone, potentially disclosing the location of your private content.||
User-agent: * Disallow: /calendar/ Disallow: /junk/
|Allow access to a single crawler||
User-agent: Googlebot-news Allow: / User-agent: * Disallow: /
|Allow access to all but a single crawler||
User-agent: Unnecessarybot Disallow: / User-agent: * Allow: /
Disallow crawling of a single web page by listing the page after the slash:
User-agent: * Disallow: /private_file.html
Block a specific image from Google Images:
User-agent: Googlebot-Image Disallow: /images/dogs.jpg
Block all images on your site from Google Images:
User-agent: Googlebot-Image Disallow: /
Disallow crawling of files of a specific file type (for example,
User-agent: Googlebot Disallow: /*.gif$
Disallow crawling of an entire site, but show AdSense ads on those pages,
and disallow all web crawlers other than
User-agent: * Disallow: / User-agent: Mediapartners-Google Allow: /
To match URLs that end with a specific string, use
User-agent: Googlebot Disallow: /*.xls$