Stay organized with collections
Save and categorize content based on your preferences.
Friday, April 11, 2008
Google is constantly trying new ideas to improve our coverage of the web. We already do some
pretty smart things like scanning JavaScript and Flash to discover links to new web pages, and
today, we would like to talk about another new technology we've started experimenting with
recently.
In the past few months we have been exploring some HTML forms to try to discover new web pages and
RLs that we otherwise couldn't find and index for users who search on Google. Specifically, when
we encounter a <form> element on a high-quality site, we might choose to do a
small number of queries using the form. For text boxes, our computers automatically choose words
from the site that has the form; for
select menus, check boxes, and radio buttons on the form, we choose from among the values of
the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that
correspond to a possible query a user may have made. If we ascertain that the web page resulting
from our query is valid, interesting, and includes content not in our index, we may include it
in our index much as we would include any other web page.
Needless to say, this experiment follows good Internet citizenry practices. Only a small number of
particularly useful sites receive this treatment, and our crawl agent, the
ever-friendly Googlebot,
always adheres to robots.txt, nofollow, and noindex rules. That means that if a search
form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate.
Similarly, we only retrieve GET forms and avoid forms that require any kind of user
information. For example, we omit any forms that have a password input or that use terms commonly
associated with personal information such as logins, userids, contacts, etc. We are also mindful
of the impact we can have on web sites and limit ourselves to a very small number of fetches for a
given site.
The web pages we discover in our enhanced crawl do not come at the expense of regular web pages that
are already part of the crawl, so this change doesn't reduce PageRank for your other pages. As
such it should only increase the exposure of your site in Google. This change also does not
affect the crawling, ranking, or selection of other web pages in any significant way.
This experiment is part of Google's broader effort to increase its coverage of the web. In fact,
HTML forms have long been thought to be the gateway to large volumes of data beyond the normal
scope of search engines. The terms Deep Web, Hidden Web, or
Invisible Web
have been used collectively to refer to such content that has so far been invisible to search
engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead
search engine users to documents that would otherwise not be easily found in search engines, and
provide webmasters and users alike with a better and more comprehensive search experience.
Written by Jayant Madhavan and Alon Halevy, Crawling and Indexing Team
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],[],[[["\u003cp\u003eGoogle is experimenting with a new technology that uses HTML forms to discover new web pages and URLs that were previously unindexed.\u003c/p\u003e\n"],["\u003cp\u003eThis process involves automatically selecting values for form inputs, generating URLs, and crawling the resulting pages to find valuable content to add to the index.\u003c/p\u003e\n"],["\u003cp\u003eGoogle's form-based crawling adheres to robots.txt rules and avoids forms that require personal information, limiting its impact on websites and user privacy.\u003c/p\u003e\n"],["\u003cp\u003eThis experiment aims to increase Google's coverage of the web, including the "Deep Web" or "Invisible Web" containing content not typically found by search engines.\u003c/p\u003e\n"],["\u003cp\u003eThis change should increase website exposure and provide a better search experience for users without negatively impacting PageRank or existing search results.\u003c/p\u003e\n"]]],["Google's experiment involves crawling HTML forms on high-quality sites to discover new web pages. Using text boxes, select menus, checkboxes, and radio buttons, Google's computers choose values to generate and crawl potential user query URLs. Valid, unique content found is added to the index. This method only uses `GET` forms, respects `robots.txt`, `nofollow`, and `noindex` directives, and avoids forms requesting personal information. The experiment aims to uncover \"Deep Web\" content and improve search results without negatively impacting existing web page rankings.\n"],null,["# Crawling through HTML forms\n\nFriday, April 11, 2008\n\n\nGoogle is constantly trying new ideas to improve our coverage of the web. We already do some\npretty smart things like scanning JavaScript and Flash to discover links to new web pages, and\ntoday, we would like to talk about another new technology we've started experimenting with\nrecently.\n\n\nIn the past few months we have been exploring some HTML forms to try to discover new web pages and\nRLs that we otherwise couldn't find and index for users who search on Google. Specifically, when\nwe encounter a `\u003cform\u003e` element on a high-quality site, we might choose to do a\nsmall number of queries using the form. For text boxes, our computers automatically choose words\nfrom the site that has the form; for\nselect menus, check boxes, and radio buttons on the form, we choose from among the values of\nthe HTML. Having chosen the values for each input, we generate and then try to crawl URLs that\ncorrespond to a possible query a user may have made. If we ascertain that the web page resulting\nfrom our query is valid, interesting, and includes content not in our index, we may include it\nin our index much as we would include any other web page.\n\n\nNeedless to say, this experiment follows good Internet citizenry practices. Only a small number of\nparticularly useful sites receive this treatment, and our crawl agent, the\n[ever-friendly Googlebot](/search/blog/2008/03/first-date-with-googlebot-headers-and),\nalways adheres to robots.txt, `nofollow`, and `noindex` rules. That means that if a search\nform is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate.\nSimilarly, we only retrieve `GET` forms and avoid forms that require any kind of user\ninformation. For example, we omit any forms that have a password input or that use terms commonly\nassociated with personal information such as logins, userids, contacts, etc. We are also mindful\nof the impact we can have on web sites and limit ourselves to a very small number of fetches for a\ngiven site.\n\n\nThe web pages we discover in our enhanced crawl do not come at the expense of regular web pages that\nare already part of the crawl, so this change doesn't reduce PageRank for your other pages. As\nsuch it should only increase the exposure of your site in Google. This change also does not\naffect the crawling, ranking, or selection of other web pages in any significant way.\n\n\nThis experiment is part of Google's broader effort to increase its coverage of the web. In fact,\nHTML forms have long been thought to be the gateway to large volumes of data beyond the normal\nscope of search engines. The terms Deep Web, Hidden Web, or\n[Invisible Web](https://www.amazon.com/Invisible-Web-Uncovering-Information-Sources/dp/091096551X/ref=sr_1_1?ie=UTF8&s=books&qid=1207179068)\nhave been used collectively to refer to such content that has so far been invisible to search\nengine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead\nsearch engine users to documents that would otherwise not be easily found in search engines, and\nprovide webmasters and users alike with a better and more comprehensive search experience.\n\nWritten by Jayant Madhavan and Alon Halevy, Crawling and Indexing Team"]]