Google wants to turn the decades-old Robots Exclusion Protocol (REP) into an official internet standard — and it’s making its own robots.txt parser open source as part of the push.
The REP, which was proposed as a standard by Dutch software engineer Martijn Koster back in 1994, has pretty much become the standard used by websites to tell automated crawlers which parts of a website should not be processed.
Google’s Googlebot crawler, for example, scans the robots.txt file when indexing websites to check for special instructions on what sections it should ignore — and if there is no such file in the root directory, it will assume that it’s fine to crawl (and index) the whole site.
These files are not always used to give direct crawling instructions, though, as they can also be stuffed with certain keywords to improve search engine optimization, among other use cases.
It’s worth noting that not all crawlers respect robots.txt files, with the likes of the Internet Archive electing to pull support for its Wayback Machine archiving tool a couple of years ago, while other, more malicious crawlers also choose to ignore REP.
While the REP is often referred to as a “standard,” it has never in fact become a true internet standard, as defined by the Internet Engineering Task Force (IETF) — the internet’s not-for-profit open standard’s organization.