Around 25 years ago, Google put forth an official internet standard for the robots.txt files’ rules. The rules were defined in the Robots Exclusion Protocol (REP) and are still considered as an unofficial standard.
Although the search engines have endorsed REP during the last 25 years, developers often assign their own meanings as it’s unofficial. Moreover, with time, it has become outdated as well, failing to cater to the use cases of today.
Even Google admitted that the ambiguous nature of the standard creates difficulties for website owners to implement the rules correctly.
The Tech Giant then proceeded with a solution as well by documenting how the REP should be applied on modern web. Google then submitted the draft to the Internet Engineering Task Force (IETF) for evaluation.
According to Google, the draft includes extensive details regarding the real world experience of depending on robots.txt rules, used by Googlebot, various crawlers as well as over half a billion websites dependent on REP. With the help of these rules, the website publishers gain the power to decide what they would like to be crawled on their site and whether it should be shown to the interested consumers.
It should be noted that the draft doesn’t change the already defined rules. It has just updated them to suit the modern web.
The updated rules include (but are not limited to):
Read next: Google introduces several new features to help businesses make their profiles more appealing in search
Although the search engines have endorsed REP during the last 25 years, developers often assign their own meanings as it’s unofficial. Moreover, with time, it has become outdated as well, failing to cater to the use cases of today.
Even Google admitted that the ambiguous nature of the standard creates difficulties for website owners to implement the rules correctly.
The Tech Giant then proceeded with a solution as well by documenting how the REP should be applied on modern web. Google then submitted the draft to the Internet Engineering Task Force (IETF) for evaluation.
In 25 years, robots.txt has been widely adopted– in fact over 500 million websites use it! While user-agent, disallow, and allow are the most popular lines in all robots.txt files, we've also seen rules that allowed Googlebot to "Learn Emotion" or "Assimilate The Pickled Pixie". pic.twitter.com/tmCApqVesh— Google Webmasters (@googlewmc) July 1, 2019
According to Google, the draft includes extensive details regarding the real world experience of depending on robots.txt rules, used by Googlebot, various crawlers as well as over half a billion websites dependent on REP. With the help of these rules, the website publishers gain the power to decide what they would like to be crawled on their site and whether it should be shown to the interested consumers.
It should be noted that the draft doesn’t change the already defined rules. It has just updated them to suit the modern web.
The updated rules include (but are not limited to):
- Robots.txt are no longer limited to just HTTP and can now be used by any transfer protocol based on URI.
- At least first 500 kibibytes of a robot.txt should be parsed by the Developers themselves.
- To provide website owners with flexibility in updating their robots.txt, a maximum caching time of 24 hours or cache directive value (depending on availability) should be brought forth.
- After server failures render a robots.txt file inaccessible, known disallowed pages would not be crawled for a specific amount of time.
Read next: Google introduces several new features to help businesses make their profiles more appealing in search