I Audited 70 Companies' llms.txt Files. Most Don't Have One.

2026-05-15T16:40:13+00:00

Everyone's writing tutorials on llms.txt. Nobody's checking the homework. I probed 70 sites that should already have one — AI labs, dev infra, docs platforms, big SaaS. 31 had a real llms.txt. 12 of those 31 don't follow the format their own community defined. Here's the data.

The /llms.txt proposal has been around since late 2024. Two years in, every SEO blog has a tutorial on it. Every "GEO" thinkpiece mentions it as a checkbox. WP Engine has published five articles about it in the last month. Mintlify built a feature for it.

What nobody's done is check whether the people who should have one actually do, and whether what they shipped is any good.

So I ran a script over 70 sites — the AI labs, the docs platforms, the dev infrastructure giants, the modern SaaS darlings, the WordPress media — and pulled their /llms.txt. Here's what I found.

TL;DR: 31 of 70 (44%) had a real llms.txt. The other 56% either returned 404, served an HTML fallback (SPA routes catching the path), or 403'd a bot. Of the 31 that exist, 12 (41%) don't follow the format llmstxt.org itself defined. File sizes range from 648 bytes to 280 KB — a 432× spread. The companies you'd expect to lead — OpenAI, Hugging Face, Google AI, Mintlify, GitBook, Readme — don't have one.

What I tested, and how

I picked 70 sites across eight buckets:

AI labs (Anthropic, OpenAI, Mistral, Cohere, Perplexity, Hugging Face, Google AI, Replicate, ElevenLabs)
AI coding tools (Cursor, Codeium, Windsurf, Claude, Aider)
Docs platforms (Mintlify, GitBook, ReadMe, Redocly, Docusaurus)
Dev infra (Vercel, Supabase, Stripe, Cloudflare, Fly, Netlify, Railway, Render, Bun, Deno)
Modern SaaS (Linear, Notion, Retool, Raycast, Framer, Airtable, Segment, Amplitude, Mixpanel, PostHog)
WordPress ecosystem (Kinsta, WP Engine, Yoast, ManageWP, Cloudways, WordPress.com, WP Tavern)
llms.txt origin (llmstxt.org, fast.ai, AnswerAI)
Community / publishing (dev.to, Hacker Noon, Substack, Indie Hackers, GitHub, GitLab)

For each, I fetched https://<domain>/llms.txt with a regular browser User-Agent, no special headers. I parsed the body to check whether it was actual markdown (starts with # heading) or HTML (a SPA fallback catching the route).

Then I measured seven dimensions on each real sample:

File size in bytes
Lines
Number of H2 sections
Total markdown links
Percentage of links pointing to deep internal pages vs. marketing root vs. external
Whether it follows the # Title + > intro blockquote format from the spec
First 200 characters (to spot HTML fallbacks the heuristic missed)

That's it. No login, no API key, no special crawler. Just plain HTTP GET, the same any LLM crawler would do.

Finding 1: More than half of the obvious candidates don't have one

Outcome	Count	%
Real `llms.txt`	31	44%
404	22	31%
HTML fallback (200 but SPA-served)	9	13%
403 / 530 / timeout	8	11%

The 44% headline already sounds bad. But the picture is worse than that, because the "should obviously have one" cohort is the loudest part of the miss list:

OpenAI: 403 on the root, 404 on platform.openai.com. They've built the most-talked-about LLM in the world and don't help it find their own docs.
Hugging Face: 404. The platform whose entire reason for existing is hosting models that read text.
Google AI (ai.google.dev): 302 redirect, no llms.txt at the destination.
Mintlify (the company that sells "llms.txt as a feature" to its customers): 404 on their own marketing site, 530 on their docs site.
GitBook, ReadMe: both 404.
fast.ai: 404. Jeremy Howard's own site. He's the one who proposed llms.txt.
WP Engine, Kinsta, WP Tavern, ManageWP: 404 or HTML fallback. The same WordPress-hosting companies running 5-article-a-month content series about GEO and AI search ranking don't have one.

A reasonable counter-argument: "an llms.txt only helps if your site is what an LLM would link to in an answer; OpenAI doesn't need to optimize for ChatGPT citing OpenAI." Fine. But that doesn't explain Mintlify, whose entire pitch right now is "add llms.txt with one click using our platform." If they don't run their own dogfood, you should ask why.

The HTML fallback group (9 sites returning 200 but with an HTML body) is also revealing. Cohere docs, Cursor docs, Codeium, Windsurf, Claude.ai, Kinsta, Notion, Segment, Deno all fall here. These are companies on Next.js / Vue / similar SPAs where the framework's catch-all route is returning the app shell for /llms.txt. No one noticed. No one's monitoring. An LLM crawler hitting their /llms.txt gets an HTML page that doesn't even mention the word "llms.txt."

This is the most damning category. 404 is a deliberate non-answer. HTML fallback is an unaware non-answer. You ran a tool, you copied a snippet from your CMS, you assumed it worked, and it doesn't.

Finding 2: 41% of the files that exist don't follow the format

The llms.txt spec, as defined by llmstxt.org itself, is dead simple. There are exactly two required structural elements:

A single # Title H1 at the top.
A > blockquote line right after the H1, summarizing what this site is.

That's it. The rest is sections, links, optional ## H2 groupings — all free-form.

Of 29 unique samples (after deduping cloudflare.com / www.cloudflare.com and cursor.com / www.cursor.com):

17 (59%) follow the H1 + blockquote pattern.
12 (41%) skip the blockquote. Some skip the H1 too.

The non-compliant list is not who you'd expect:

Site	Sections	Links	What's wrong
docs.anthropic.com	3	1,415	No blockquote intro
docs.stripe.com	26	528	No blockquote intro
render.com	6	295	No blockquote intro
bun.sh	2	317	No blockquote intro
netlify.com	10	83	No blockquote intro
developers.cloudflare.com	9	103	No blockquote intro
docs.mistral.ai	1	75	Single section, no intro
elevenlabs.io	7	49	No blockquote intro
cursor.com	20	0	No blockquote, and has 0 markdown links — just raw URL list
replicate.com	7	1	No blockquote intro, near-zero links
amplitude.com	7	0	No blockquote, 0 links
supabase.com	2	19	No blockquote intro

Anthropic — the company whose docs platform is literally the reference customer for the LLM ecosystem — ships 150 KB of links with no opening blockquote. Stripe ships 90 KB of structured docs index with no intro line. These aren't beginner sites. They have technical writers.

What this looks like in practice: someone on the platform team auto-generated an llms.txt from the docs sitemap, shipped it, and never read the spec. The crawler hitting it gets a sea of links with no context about what the site is.

Cursor's is the strangest. Twenty sections, beautifully nested, zero markdown links — every entry is a raw URL on its own line:

## Get Started
- https://cursor.com/docs.md
- https://cursor.com/docs/get-started/quickstart.md
- https://cursor.com/docs/models-and-pricing.md
  - https://cursor.com/docs/models/claude-4-6-sonnet.md

It also has at least one concatenation bug: https://cursor.comhttps://cursor.com/changelog.md shows up unescaped. If an LLM crawler was strict about markdown link parsing, half of Cursor's file would be invisible to it.

Finding 3: File sizes vary by 432×, and nobody agrees what this file is for

The smallest real llms.txt: 648 bytes. That's llmstxt.org — the site that defined the spec — pointing to three docs.

The largest: 279,743 bytes. That's PostHog, dumping what looks like their entire docs sitemap, with prose annotations, 54 H2 sections, and 2,555 markdown links pointing to .md variants of every doc page.

Between them, eight orders of practical magnitude:

Bucket	Count	Examples
Tiny (<1 KB)	1	llmstxt.org (648 B)
Small (1-5 KB)	8	railway.app (1.4 KB), supabase.com (1.3 KB), amplitude.com (2.8 KB), replicate.com (3.4 KB), mistral.ai (4.9 KB), cohere.com (4.8 KB), hackernoon.com (4.9 KB), answerai.com (1.8 KB)
Medium (5-20 KB)	9	linear.app, cursor.com, framer.com, yoast.com, wordpress.com, cloudflare.com, elevenlabs.io, docs.mistral.ai, developers.cloudflare.com
Large (20-100 KB)	9	docs.perplexity.ai (22 KB), github.com (28 KB), bun.sh (33 KB), render.com (36 KB), netlify.com (20 KB), stripe.com (64 KB), cloudways.com (64 KB), docs.stripe.com (93 KB), redocly.com (98 KB)
Huge (100 KB+)	2	docs.anthropic.com (152 KB), posthog.com (280 KB)

Median: 9.4 KB. Mean: 34.7 KB. The mean is misleadingly high because of the two huge outliers — most sites are actually quite small.

This spread maps directly to two different philosophies of what llms.txt is. The spec text is ambivalent on this, so people just picked:

Philosophy A (the spec example): llms.txt is a tiny pointer file. A few hundred bytes. A title, a one-line summary, and a handful of links to the most important pages. The LLM is supposed to crawl those pages separately. Sites like llmstxt.org (648 B), supabase.com (1.3 KB), railway.app (1.4 KB), answerai.com (1.8 KB) follow this.

Philosophy B (the index dump): llms.txt is a comprehensive sitemap-for-LLMs. Every documented page gets a link. PostHog (280 KB) and Anthropic docs (152 KB) ship more or less their entire docs map this way. Stripe (90 KB) is the same idea slightly trimmed.

There's no consensus on which is right. Mintlify's product (when it works) emits Philosophy B. Hand-written ones tend toward Philosophy A. A 280 KB file is presumably a lot for a model to ingest as context; a 648 B file is presumably too little for a model to actually find anything useful.

The honest answer is nobody knows yet which works better, because nobody has published data on whether either approach causes more LLM citations or traffic.

Finding 4: Link quality is actually fine — but link quantity hides the misses

The good news: when sites do put links in their llms.txt, the links are mostly useful.

Across all 7,557 markdown links in the 29 samples:

73.8% are deep internal pages (paths with two or more segments — docs.example.com/api/v2/something)
23.3% are external (mostly references to standards, SDKs, GitHub repos)
2.9% point to marketing roots (the homepage or top-level category pages — /pricing, /about)

That 2.9% marketing-root number is encouraging. The fear with llms.txt was that companies would use it as a marketing-funnel SEO play — link the AI to your homepage and your pricing page. They mostly aren't. The actual links point at useful documentation.

The bad news is that count is heavily skewed by a few big files. PostHog alone contributes 2,555 of those 7,557 links — a third of the total dataset. Anthropic, Redocly, Stripe, Cloudways add another 3,000+. Most sites have between 0 and 100 links.

Four sites have zero markdown links: amplitude.com, cohere.com, cursor.com, hackernoon.com. Cursor and Hackernoon use raw URL strings (technically usable by some crawlers, technically not markdown). Amplitude and Cohere just describe their products in prose, which is arguably useful context but not what the spec is for.

Finding 5: The WordPress ecosystem is almost entirely missing

Out of seven WordPress-ecosystem candidates I tested:

Cloudways, Yoast, WordPress.com: have one. ✅
Kinsta: returns HTML fallback (you'd think they had one — they didn't). ❌
WP Engine, ManageWP, WP Tavern: 404. ❌

This is striking because WP Engine in particular has been publishing aggressively about "GEO" — Generative Engine Optimization. Five separate articles in the last month, including "Technical GEO: Ensuring Sites are Machine Readable." The author of those articles works for a company whose own marketing site doesn't have llms.txt.

Kinsta's HTML fallback is the more interesting case. They publish content telling agencies how to make sites AI-friendly. They probably believe they have an llms.txt because someone enabled a plugin. The plugin doesn't work — the Next.js router catches /llms.txt and serves the React shell. Nobody on their team checked.

Across the WordPress ecosystem, the gap between content about llms.txt and ownership of llms.txt is roughly total.

Finding 6: The "HTML fallback" problem is bigger than it looks

The nine sites returning 200 OK with HTML are worth highlighting separately because, from any monitoring tool's perspective, they look like success.

If you run a curl with -I against kinsta.com/llms.txt, you get a 200 response. Your CDN logs say everyone who hit /llms.txt got data back. Nothing's broken.

But the body is the React app. An LLM that tries to parse the response as markdown gets a 50 KB blob of <div class="..."> and JavaScript bundle URLs. It has no way to know this wasn't deliberate.

This affects:

Cohere docs, Cursor docs (sister sites of two of the most-cited Real samples, but the docs subdomains catch-all)
Codeium, Windsurf, Claude.ai — AI coding tools missing their own llms.txt
Notion (the SaaS most likely to be cited by an LLM)
Segment, Kinsta, Deno

Most of these are running modern JavaScript frameworks where a catch-all route handles unknown paths. The fix is one line of config to make /llms.txt 404 explicitly, or to serve a real file. Nobody's done it.

If your site is built on Next.js / Nuxt / Astro / similar, this is the first thing to check. Hit your own /llms.txt with curl and read the bytes. If you see <!DOCTYPE html>, you have a problem regardless of whether you intended to have an llms.txt.

Finding 7: We have no idea if any of this works

The hardest part of writing this audit was admitting that we don't yet have evidence llms.txt does anything useful.

I looked for it. Specifically:

Is there public data showing that sites with llms.txt get cited more often in Perplexity, ChatGPT, or Claude answers than equivalent sites without?
Are LLM crawlers actually requesting /llms.txt? (Cloudflare publishes some bot traffic data, but not specifically for this path.)
Has any A/B-tested case study been published showing that adding llms.txt changed citation share, referral traffic, or anything measurable?

The answer to all three, as of mid-2026, is "not really." There are anecdotes. There are tutorials. There are case studies that read like vendor marketing. There is no rigorous public evidence that llms.txt moves the needle.

It's possible — likely, even — that big LLM providers are reading it as a hint, the same way some search engines read sitemap.xml. It's also possible they're not, and the entire conversation around this file is the SEO industry filling the silence about how AI answer engines actually rank sources.

The interesting fact is that even the companies whose business model would directly benefit from llms.txt working — the model providers — mostly aren't writing one for their own docs. That's either because they know something the SEO industry doesn't, or because they're as confused as everyone else.

What I think this means

A few cautious takeaways. Not advice — llms.txt advice on the internet is already a glut. Just observations from staring at 31 real files for a day.

1. The standard is winning by default, not by adoption. It's the only proposal in this space with public buy-in. But "I've heard of it" and "I've added it to my site" are not the same conversation. Most sites haven't added it. Most that have, half-added it.

2. The Philosophy A vs B divide is the actual open question, not "should you add llms.txt at all." A 648-byte pointer file and a 280-kilobyte sitemap dump can't both be the correct interpretation of the same spec. Until someone publishes real data on which retrieves better, every "best practices" article is guessing.

3. The HTML fallback problem is the single highest-leverage fix. If you run a modern JavaScript framework and you've never checked your own /llms.txt response body, you probably have a silent miss. One curl will tell you. One line of routing config will fix it.

4. Following the spec is almost free, and almost nobody does it. Forty-one percent of the files that exist skip the blockquote intro. The crawler reading these files has to guess what the site is about from the URL list alone. Adding a single sentence in > blockquote form takes ten seconds and probably matters more than the elaborate hierarchy below it. Probably.

5. The gap between "writes about llms.txt" and "has llms.txt" is enormous, especially in the WordPress and hosting ecosystem. This is either an opportunity (everyone else is asleep) or a tell (the people closest to SEO have done the math and skipped it). Take your pick.

The samples

If you want to look at the source data: I'm publishing the 31 real llms.txt files I pulled, plus the probe script and the analysis output, under a public mirror. The format and methodology are simple enough that you can rerun this audit in five minutes against your own list of suspect sites.

Worth doing on your own site, at least. The HTML fallback bug is silent, the spec is short, and the upside — if llms.txt ever does matter — is just a properly-formatted text file away.

If you find your own audit produces wildly different conclusions, I'd genuinely like to see it. Especially if anyone has citation-share data showing this stuff actually works. So far, on this question, the marketplace of opinions is loud and the marketplace of evidence is silent.

About the data: 70 candidate sites probed on 2026-05-16 using vanilla HTTP GET with a desktop browser User-Agent. 31 returned real llms.txt files. After deduping mirror domains (cloudflare.com / www.cloudflare.com, cursor.com / www.cursor.com), 29 unique samples. Probe script, raw responses, and per-site analysis are mirrored.

Method limits: Only the root /llms.txt path was probed. Some sites may host llms.txt at a subdomain or under a versioned path. I didn't check /llms-full.txt (the optional verbose variant from the spec). Sites that block automated requests by IP or User-Agent (Reddit, parts of Indie Hackers) may have real files I couldn't see. The HTML-fallback class is a false-negative risk that probably understates real adoption by 1-3 sites.