robots.txt won't protect you from LLM crawlers: a sign is not a lock

Since the boom of large language models, many companies ask the same question: "How do we stop our content from ending up in the training data of LLMs?" And almost always the same answer comes back: "We'll add it to robots.txt." That is well meant, but it rests on a misunderstanding of what robots.txt actually is. The short version: robots.txt is a sign, not a lock. Whoever treats it as protection is relying on the politeness of strangers.

What robots.txt actually is

robots.txt is a text file in the root of a website that tells crawlers which areas they should and should not visit. The idea dates back to 1994, from Martijn Koster, when the web was still small and a few search engines accidentally overloaded servers. It was a pragmatic agreement among the well-meaning: "Please leave these paths alone."

The crucial part is already in the name of the mechanism: Robots Exclusion Protocol. It is a protocol for voluntary self-restraint, not access protection. robots.txt prevents nothing technically. It is a request, not a command. Anyone who can call a URL can fetch the content, completely regardless of what robots.txt says. The file is only respected when the crawler chooses to respect it.

What the RFC says

For a long time robots.txt was only a de-facto standard, a convention without an official document. Only in 2022 was it formalised as RFC 9309 at the IETF. Whoever reads the RFC finds there exactly the confirmation of the weakness, not its fix.

The RFC cleanly describes the format, the syntax, how crawlers should fetch and interpret the file. But nowhere does it make compliance binding or enforceable. On the contrary: the document explicitly speaks of crawlers that voluntarily follow the protocol, and lists security considerations, including the plain note that robots.txt is not an access-control mechanism. An RFC that standardises a format does not give that format enforcement power. It describes how to ask politely, not how to compel.

Put differently: even the official standard says the standard only holds if the other side plays along.

Why this helps especially little against LLM crawlers

With classic search engines the politeness agreement worked quite well for a long time, because there was a self-interest: Google does not want to be seen as the actor that ignores robots.txt. With data collection for language models, that logic crumbles for several reasons.

1. There is no single crawler. You can list GPTBot, Google-Extended, CCBot, ClaudeBot and a dozen more. But you can only block what you know. New actors with new user agents appear constantly, faster than you can maintain your robots.txt. You are playing a game where you are always one move behind.

2. The user agent is freely chosen. A crawler that does not want to follow the rules simply identifies as a normal browser. There is no obligation to label oneself honestly. Whoever is malicious or merely careless ignores robots.txt not loudly, but invisibly.

3. The data is often already gone. Huge public datasets like Common Crawl have archived the web over years. A Disallow set today changes nothing about what has already been collected, copied and passed on into training corpora. You are closing a door everyone walked through long ago.

4. Third parties collect and resell. Even if the big, visible provider behaves, there is a whole industry of scrapers that harvest data and sell it as datasets. Their business model is precisely to do what robots.txt forbids.

5. The line between "crawling" and "fetching" blurs. When a language model calls a specific URL at a user's request to summarise it, that is, from some providers' viewpoint, not "crawling" in the robots.txt sense. So your content lands in the LLM context even when the classic training crawler is locked out.

The most dangerous mistake: robots.txt as secrecy

There is a particularly common and particularly risky misuse: trying to "hide" sensitive paths via Disallow. The opposite happens. robots.txt is publicly retrievable, anyone can read it. Whoever enters Disallow: /admin-backup/ there has just handed the whole world a map of the interesting hiding places. You hide nothing, you point the way.

Rule of thumb: what really should not be public belongs behind authentication, not in a list of things people are kindly asked not to look at.

What actually protects

robots.txt is not useless, it is just made for something else: managing crawl load, steering well-behaved search engines, governing crawl budget. For that purpose it makes sense and should be maintained. As protection against unwanted data collection you need real mechanisms:

Authentication and paywalls: content behind a login is unreachable for crawlers. That is the only truly hard boundary.
Rate limiting and bot detection: throttle or block suspicious access patterns instead of asking politely.
WAF and server-side rules: lock out known malicious actors at the network level.
Legal means: terms of use and copyright help against serious actors with an address, but little against anonymous scrapers abroad.
The most uncomfortable truth: what stands publicly on the web is effectively copyable. Truly sensitive content does not belong unprotected on the open web.

Our take

robots.txt is a sensible agreement among the well-meaning and a useful tool for crawl management. But it is not a security mechanism, and the RFC that standardises it says so itself. The notion that all actors obey a voluntary sign is wishful thinking, especially in a market where data is the raw material. Whoever wants to seriously protect their content builds locks, not signs.

If you are thinking about how to handle the visibility of your content towards LLMs, from crawl management to real access control, talk to us. We will separate for you what is a request and what is a boundary.