Google Works To Make Robots Exclusion Protocol A Real Standard
Google’s webmaster channel is on a series of posts every hour around the Robots Exclusion Protocol – in short, an hour ago, Google announced that after 25 years of being a de-facto standard, Google has worked with Martijn Koster, webmasters, and other search engines to make the Robots Exclusion Protocol an official standard.
Here are the posts starting at 3am and going every hour thus far:
It’s 1994 and crawlers are overwhelming servers ????. To help webmasters, Martijn Koster (@makuk66), a webmaster himself, proposes a protocol to control what URLs crawlers may access on sites ????.
https://t.co/HiRsEgc2xO— Google Webmasters (@googlewmc) July 1, 2019
The robots.txt protocol is very simple, yet incredibly effective: by specifying a user-agent and rules for it, webmasters have granular control over what crawlers may access. It doesn’t matter if it’s a single URL, a certain file-type, or a whole site– robots.txt works for each. pic.twitter.com/fOlFFE2yMi
— Google Webmasters (@googlewmc) July 1, 2019
It’s been 25 years, and the Robots Exclusion Protocol never became an official standard. While it was adopted by all major search engines, it didn’t cover everything: does a 500 HTTP status code mean that the crawler can crawl anything or nothing? ???? pic.twitter.com/imqoVQW92V
— Google Webmasters (@googlewmc) July 1, 2019
Today we’re announcing that after 25 years of being a de-facto standard, we worked with Martijn Koster (@makuk66), webmasters, and other search engines to make the Robots Exclusion Protocol an official standard!https://t.co/Kcb9flvU0b
— Google Webmasters (@googlewmc) July 1, 2019
In 25 years, robots.txt has been widely adopted– in fact over 500 million websites use it! While user-agent, disallow, and allow are the most popular lines in all robots.txt files, we’ve also seen rules that allowed Googlebot to “Learn Emotion” or “Assimilate The Pickled Pixie”. pic.twitter.com/tmCApqVesh
— Google Webmasters (@googlewmc) July 1, 2019
But there are also lots of typos in robots.txt files. Most people miss colons in the rules, and some misspell them. What should crawlers do with a rule named “Dis Allow”? pic.twitter.com/nZEIyPYI9R
— Google Webmasters (@googlewmc) July 1, 2019
To help developers create parsers that reflect the Robots Exclusion Protocol requirements, we’re releasing our robots.txt parser as open source!
Updated to cover all corner cases, the parser ensures that Googlebot only crawls what it’s allowed to.https://t.co/NmbLRzDkHF— Google Webmasters (@googlewmc) July 1, 2019
Happy 25th birthday, robots.txt! You make the Internet a better place. You’re the real MVP! pic.twitter.com/vxvZTcHpR3
— Google Webmasters (@googlewmc) July 1, 2019
Google said “it doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web. Notably:”
- Any URI based transfer protocol can use robots.txt. For example, it’s not limited to HTTP anymore and can be used for FTP or CoAP as well.
- Developers must parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size ensures that connections are not open for too long, alleviating unnecessary strain on servers.
- A new maximum caching time of 24 hours or cache directive value if available, gives website owners the flexibility to update their robots.txt whenever they want, and crawlers aren’t overloading websites with robots.txt requests. For example, in the case of HTTP, Cache-Control headers could be used for determining caching time.
- The specification now provisions that when a previously accessible robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.
This was a big deal for the folks at Google and the partners to make happen:
After 25 years, robots.txt is slowly becoming a standard! A real one! It took so much work to make this happen, but we’re finally there! https://t.co/8yoiyOOZJM
— Gary “鯨理” Illyes (@methode) July 1, 2019
robots.txt is 25 years old! For a trip down memory lane: https://t.co/5dCVVNAIBd pic.twitter.com/PHx0wCcROx
— Martijn Koster (@makuk66) July 1, 2019
Just to be clear – nothing is changing with this announcement for you:
No, nothing at all
— Gary “鯨理” Illyes (@methode) July 1, 2019