How I Built an llms.txt For My Clients (And Why It Beats Your Sitemap)

I stopped obsessing over sitemaps and started shipping llms.txt files instead. Here is how I built them for real clients and what actually matters for LLMs.
How I Built an llms.txt For My Clients (And Why It Beats Your Sitemap)
Photo by Aerps.com / Unsplash

Why I Stopped Caring About Sitemaps (And Started Shipping llms.txt)

I used to treat sitemaps like a sacred SEO artifact. You generate the XML, wire it into Search Console, whisper a small prayer to the Google gods, move on.

Then LLMs started eating the web.

Suddenly, I cared a lot less about how crawlers discover pages for ranking, and a lot more about how models discover context for answers. Different game. Same web. New rules.

So I started putting a simple text file at the root of client sites:

/llms.txt

Not a standard. Not blessed by anyone in a suit. Just a convention that makes sense for how I actually build with LLMs.

It is basically a robots.txt for context. For models. For RAG systems. For anyone trying to turn a website into a reliable answer engine instead of a random word salad.

I now ship this by default. Key2Control got one. WBU got one. More are coming.

This post is me walking through how I structured them, what goes inside, and why I think it matters more than a sitemap.

What llms.txt Is For (And What It Is Not)

llms.txt is not magic. It does not force OpenAI or Anthropic to obey you. No crawler police will show up.

I see it as one thing: a contract for structured, machine-readable guidance about how models should treat your site.

That contract is useful for:

  • Your own RAG pipelines and agents.
  • Third-party tools that index your site for AI features.
  • Future LLM crawlers that want to act politely.

If you build AI features on top of websites, you can either guess and scrape, or you can agree on a small convention. I prefer the second option. Fewer surprises. Fewer nasty edge cases later.

The Basic Shape Of llms.txt

I like boring formats. Plain text. Key-value style. Commented. Easy to parse with a 20-line script.

Here is the minimal skeleton I use:

# llms.txt for example.com

version: 1
owner: Your Company Name
contact: ai@example.com
updated: 2025-01-01

[index]
allow: /
disallow: /admin

[sitemaps]
url: https://example.com/sitemap.xml

[priority]
# higher number = more important
page: https://example.com/docs 10
page: https://example.com/blog 6

[rag]
# sections that are safe, current, and canonical for RAG
include: https://example.com/docs/**
include: https://example.com/blog/**
exclude: https://example.com/archive/**

Looks like INI. Not really INI. I do not care. It is simple enough that any sensible engineer can parse it.

Key idea: I am not describing all pages. That is what XML sitemaps already do. I am describing how the content should be used by models.

How LLMs (And My Own Tools) Use It

Right now, I mostly use llms.txt in my own pipelines. But I am designing it like something a generic LLM crawler could use:

  • index / disallow: Guides crawling scope. Think soft robots.txt.
  • sitemaps: Points to canonical discovery mechanisms.
  • priority: Helps ranking inside my own vector stores.
  • rag: Defines safe, current, canonical content for retrieval systems.

That last one matters the most.

When I build a RAG stack for a client, I do not want archived, half-broken, GDPR-unsafe, or outdated pages sneaking into answers. The llms.txt file is where I draw a hard boundary.

The structure gives me three layers of control:

  • Discovery: What URLs should I even bother crawling?
  • Eligibility: Which of those URLs are allowed to feed LLM context?
  • Priority: If there is conflicting info, which URLs win?

This is the part traditional sitemaps do badly. XML sitemaps care about existence and freshness. LLMs care about trust, context boundaries, and conflicting sources.

Real Example 1: Key2Control

Key2Control builds software for governance, risk, compliance. A lot of the content is very structural. Think frameworks, process descriptions, and policy mapping.

Perfect for LLMs. Also very easy to screw up context.

When we started adding AI layers on top of their site, I noticed a pattern: generic crawlers mashed together marketing fluff, product docs, and legacy support notes. Answers looked confident and slightly wrong. The worst combo.

So we added an llms.txt at the root of their marketing and docs domain. The core constraints I wanted:

  • Models should treat framework documentation as canonical.
  • Older blog posts should be second-class citizens.
  • Anything related to specific customer configs needed to stay out.

Here is a simplified flavour of what their llms.txt looks like, with URLs anonymised a bit:

# llms.txt for key2control.nl

version: 1
owner: Key2Control B.V.
contact: ai@key2control.nl
updated: 2025-01-10

[index]
allow: /
disallow: /wp-admin

[sitemaps]
url: https://www.key2control.nl/sitemap_index.xml

[priority]
page: https://www.key2control.nl/oplossing/   9
page: https://www.key2control.nl/kennisbank/  8
page: https://www.key2control.nl/blog/        5

[rag]
include: https://www.key2control.nl/kennisbank/**
include: https://www.key2control.nl/oplossing/**
exclude: https://www.key2control.nl/blog/2018/**
exclude: https://www.key2control.nl/demo/**

[notes]
text: Content in /kennisbank/ and /oplossing/ is reviewed and product-team approved.
text: Blog posts before 2020 may be outdated and should be treated as low-trust sources.

This changed how I built their RAG pipeline.

Instead of “crawl everything under the domain and dump into a vector DB”, the flow became:

  • Fetch /llms.txt.
  • Resolve includes / excludes from the [rag] section.
  • Boost vectors from /kennisbank/ and /oplossing/ by the weight in [priority].
  • Fallback to blog content only if top-k results from canonical sections are weak.

The quality jump was obvious.

LLM answers started quoting the knowledge base structure correctly. Less hand-holding. Fewer “actually, that is from a legacy version” corrections.

Key detail: none of this required Key2Control to learn about embeddings or chunking. They just own a small text file with rules written in plain language. I wrote the schema. They control the contract.

Real Example 2: WBU

WBU (short for a client in the sports / training space) had a different problem.

They have lots of pages for clinics, events, schedules, and coaching content. Some are evergreen. Some expire after a weekend.

LLMs do not understand time unless you hit them over the head with it. So they would happily suggest an event that happened three months ago as if it were coming up next week.

For WBU, llms.txt was less about structure and more about temporal validity.

Here is a stripped-down example of what we set up:

# llms.txt for wbu-baseball.nl

version: 1
owner: WBU Baseball
contact: ai@wbu-baseball.nl
updated: 2025-01-05

[index]
allow: /
disallow: /wp-admin

[sitemaps]
url: https://wbu-baseball.nl/sitemap.xml

[rag]
include: https://wbu-baseball.nl/training/**
include: https://wbu-baseball.nl/coaching/**
include: https://wbu-baseball.nl/events/**
exclude: https://wbu-baseball.nl/events/archive/**

[validity]
# patterns with expiry semantics
path: https://wbu-baseball.nl/events/**  type=dated
path: https://wbu-baseball.nl/training/** type= evergreen

[notes]
text: For events pages (type=dated), prefer information with a future date.
text: Do not recommend events with dates in the past, except as examples.

Then I wired the RAG side to respect these hints:

  • When a query mentions “upcoming”, “this month”, or specific dates, I bias towards type=dated content with dates in the future.
  • When events are in the past, I still keep the content indexed for analytics and recap questions, but retrieval ranks them lower for recommendation-style prompts.

The llms.txt file is where this policy lives. Not hidden in the retrieval code.

That makes it maintainable. The WBU team can change rules without touching my pipeline, and future AI tools can read the same signals. They control how their site should behave as a knowledge source, not just as a marketing object.

What Goes Into A Good llms.txt (Step By Step)

Here is the rough process I follow for new clients now. Not theoretical. This is literally what I do in my editor.

1. Map Content Types, Not Just URLs

I start by listing content types, in plain language.

Example categories I have actually written down:

  • Product docs (canonical, high trust).
  • Marketing pages (persuasive, sometimes vague).
  • Knowledge base or FAQ (structured answers).
  • Blog posts (mixed quality, temporal).
  • Events or time-bound info.
  • Admin, dashboards, customer-only content.

Then I map those to URL patterns. That mapping becomes the backbone of the [rag], [priority], and optional [validity] sections.

2. Decide What Models Should Ignore

I am more aggressive than most here. I assume LLMs will happily hallucinate from any weak signal, so I prefer a smaller, cleaner context set.

Typical disallow or exclude entries:

  • Old campaign landing pages.
  • Archive sections that refer to deprecated products.
  • Experimental subdomains.
  • Anything with personal data, internal dashboards, or customer-specific details.

This is where llms.txt beats sitemaps for me. The XML sitemap is often full of SEO junk that you would never want an LLM to quote as truth. Here, I cut it out explicitly.

3. Rank Your Own Content Honestly

I added [priority] out of frustration. Too many systems treat all pages as equal once they are in a vector store. I think that is wrong.

So I ask clients a simple question:

If two pages conflict, which one should the model trust?

Then we encode that in [priority]:

[priority]
page: https://example.com/docs/**    10
page: https://example.com/faq/**      8
page: https://example.com/blog/**     5
page: https://example.com/archive/**  2

Higher number means “trust this first”.

My retrieval code reads that and applies a weight to embeddings or to ranking scores. If two chunks are equally relevant, the higher priority wins.

4. Add Human Notes For Future Agents

I like the [notes] section. It is free-form text: lines that tools can show to human operators or even feed into system prompts.

Real examples I have used:

  • text: Pricing on the website is not final. Always ask the sales team before quoting.
  • text: Do not provide legal advice. The content describes internal policies only.
  • text: For dates, treat the Dutch version of the page as canonical.

This is not machine-enforced policy yet. It is a hint. But even hints help when you are debugging weird answers or plugging the site into multiple AI systems.

Why I Think llms.txt Matters More Than A Sitemap

Google cares about your sitemap. LLMs barely care.

They care about:

  • Which chunks they see.
  • How those chunks are weighted.
  • What they are allowed to say based on those chunks.

Sitemaps are about discovery and freshness. That is it. They do not describe trust, priority, usage constraints, or temporal behavior.

When I build AI-heavy features, I treat the sitemap as a hint for crawling and treat llms.txt as the specification for how the site should behave in an AI context.

So my core position is simple:

  • If you care mostly about SEO, ship a sitemap.
  • If you care about being a good knowledge source for models, ship an llms.txt.
  • If you care about both, link the sitemap from llms.txt and keep them in sync.

How I See This Evolving

Right now, llms.txt is just a convention I use and share with a few other builders. It works because it is simple.

Could this become a broader standard? Maybe. I am not betting on a committee to save us.

What I actually expect:

  • More tools will start checking for /llms.txt when you plug in a URL.
  • Frameworks will ship tiny middlewares that read it and auto-tune their crawlers and RAG pipelines.
  • Clients will start demanding control over how their content gets used in LLM-based products.

That last point is the real driver. Companies hate that random AI tools scrape their site and remix it without any knob they can turn.

llms.txt is not a legal solution. It is just a technical one. A small knob. Still better than nothing.

If You Want To Try This Yourself

If you own a site and you are plugging it into LLMs, I would start ridiculously small:

  1. Create /llms.txt at your root.
  2. Add version, owner, contact, and updated.
  3. Add a [rag] section that includes only the content you would happily see quoted as an answer.
  4. Optionally add [priority] if you know which sections are canonical.

You can ignore the rest until you actually need it.

Then wire your own scripts or agents to respect those rules. Use it first for yourself. If it works, share the convention with others.

That is how I ended up shipping it for Key2Control and WBU. I needed a simple, honest interface between websites and models. XML sitemaps were not it. So I wrote a text file instead.

Sometimes the right tool is literally just a better txt file.

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news

Member discussion