Robots.txt Explained: What It Is, How It Works, and How to Get It Right

A robots.txt file tells search engine crawlers which pages or sections of your site they should and should not access.

It must be placed in the root directory of your domain and is always accessible at <strong>yourdomain.com/robots.txt</strong>.

Robots.txt controls crawling — not indexing. A blocked page can still appear in search results if other sites link to it.

One misconfigured line can block your entire site from Google. Always test before deploying changes.

In 2026, robots.txt is also used to control AI training bots — a new and important consideration for site owners.

📌 Key Takeaways

A robots.txt file tells search engine crawlers which pages or sections of your site they should and should not access.
It must be placed in the root directory of your domain and is always accessible at yourdomain.com/robots.txt.
Robots.txt controls crawling - not indexing. A blocked page can still appear in search results if other sites link to it.
One misconfigured line can block your entire site from Google. Always test before deploying changes.
In 2026, robots.txt is also used to control AI training bots - a new and important consideration for site owners.

What Is a Robots.txt File?

A robots.txt file is a plain text file that sits at the root of your website and instructs search engine crawlers - bots like Googlebot and Bingbot - on which pages or sections of your site they are and are not permitted to crawl.

Think of it as a set of house rules posted at the front door of your website. When a search engine crawler arrives at your domain, the very first thing it does is request your robots.txt file. It reads the instructions, then behaves accordingly before crawling anything else.

Every website has one, whether you have created it intentionally or not. If no robots.txt file exists, crawlers assume they have full access to your entire site - which is not always what you want. According to Google Search Central, robots.txt is the first file Googlebot requests when it visits any domain.

"Robots.txt is one of the simplest files on your website - and one of the most dangerous to get wrong. A single misplaced line can block Google from crawling your entire site."

Why Robots.txt Matters for SEO

Robots.txt plays a quiet but critical role in your site's technical SEO. Here is why it deserves your attention:

Crawl Budget Management - Google allocates a limited crawl budget to each website based on its authority and size. If crawlers waste time on admin pages, duplicate URLs, or internal search results, they may never reach your important content. Robots.txt lets you redirect that budget to pages that actually matter. Google Search Central recommends this approach specifically for large sites with crawl budget concerns.
Blocking Non-Public Pages - Staging environments, login pages, internal search results, and thank-you pages do not belong in search results. Robots.txt prevents crawlers from accessing and potentially indexing them.
Preventing Duplicate Content - URL parameters, session IDs, and faceted navigation can generate thousands of near-identical URLs. Blocking these with robots.txt keeps your index clean and prevents dilution of your ranking signals.
Protecting Sensitive Areas - Directories like /wp-admin/, /cgi-bin/, or internal API endpoints should never be crawled. Robots.txt provides a first line of defence.
Controlling Multimedia Indexing - Unlike meta robots tags, robots.txt works for non-HTML resources like PDFs, images, and videos - giving you control over whether these assets get indexed or not.

That said, robots.txt is not a silver bullet. It controls crawling, not indexing. A page that is blocked in robots.txt can still appear in search results if external sites link to it - Google just will not be able to read its content.

How Robots.txt Actually Works

When a crawler like Googlebot arrives at your domain, it follows a consistent sequence before crawling a single page:

Request the robots.txt file

The crawler visits yourdomain.com/robots.txt first. If the file returns a 200 OK response, it reads and parses the instructions. If it returns a 404, the crawler assumes full access to the site.

Find the matching user-agent block

The crawler identifies the directive group that applies to it - either its specific user-agent name (e.g. Googlebot) or the wildcard * that applies to all bots. If both exist, the specific one takes precedence.

Apply the rules using longest-match logic

When multiple directives could apply to a URL, Google uses the "longest match" principle - the most specific rule wins. For example, Disallow: /blog/drafts/ overrides a broader Allow: /blog/ for URLs inside that subdirectory.

Begin crawling within the allowed boundaries

The crawler proceeds to crawl all permitted pages, following internal links and discovering new content - while skipping any URL paths marked as disallowed.

One important nuance: robots.txt is case-sensitive. The path /Blog/ and /blog/ are treated as different directories on most servers. A rule written for /Blog/ will not block /blog/.

Robots.txt Syntax and Directives Explained

A robots.txt file is made up of groups of directives. Each group starts with a User-agent line and is followed by one or more Allow or Disallow rules. Here are all the directives you need to know:

User-agent

Specifies which crawler the following rules apply to. Use * to target all bots, or a specific name to target one crawler.

User-agent: *          # applies to all crawlers
User-agent: Googlebot  # applies to Google only

Disallow

Tells the crawler which paths it must not access. An empty Disallow: value means the crawler is allowed to access everything.

Disallow: /wp-admin/        # blocks the WordPress admin area
Disallow: /private/         # blocks an entire directory
Disallow: /page.html        # blocks a single page
Disallow: /                 # blocks the entire website ⚠️

Allow

Explicitly permits a specific path, even within a broader disallowed directory. This is particularly useful when you want to block a folder but allow one file inside it.

User-agent: Googlebot
Disallow: /private/
Allow: /private/public-report.pdf    # this single file is still crawlable

Sitemap

Points crawlers to your XML sitemap so they can discover all your important pages efficiently. This is a best-practice addition to every robots.txt file.

Sitemap: https://www.yourdomain.com/sitemap.xml

Crawl-delay

Asks crawlers to wait a set number of seconds between requests, reducing server load. Note that Google does not officially support this directive - use Google Search Console's crawl rate settings instead for Googlebot.

Crawl-delay: 10    # wait 10 seconds between requests

Wildcards: * and $

Two special characters give you pattern-matching power in robots.txt:

Asterisk (*) - matches any sequence of characters. Disallow: /*.pdf$ blocks all PDF files.
Dollar sign ($) - matches the end of a URL. Disallow: /*.php$ blocks any URL ending in .php.

Real-World Robots.txt Examples

Let us look at robots.txt configurations for common real-world scenarios:

Allow all crawlers full access

User-agent: *
Allow: /

Sitemap: https://www.yourdomain.com/sitemap.xml

This is the simplest configuration. No restrictions - crawlers can access everything. The sitemap line helps Google discover your content faster.

Standard WordPress site

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://www.yourdomain.com/sitemap.xml

Blocks the WordPress admin panel from crawlers while keeping the AJAX endpoint accessible, which some themes and plugins require for frontend functionality.

E-commerce site with filtered URLs

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?sessionid=

Sitemap: https://www.yourdomain.com/sitemap.xml

Blocks cart, checkout, and account pages from indexing. The wildcard rules prevent crawlers from following URL parameter variations generated by sorting and filtering - a major source of duplicate content on e-commerce sites.

Block a specific bot entirely

User-agent: AhrefsBot
Disallow: /

User-agent: *
Allow: /

Sitemap: https://www.yourdomain.com/sitemap.xml

Blocks a specific crawler (in this case Ahrefs' bot) from your site entirely while allowing all other crawlers normal access.

Block AI training bots (2026)

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: *
Allow: /

Sitemap: https://www.yourdomain.com/sitemap.xml

Prevents major AI platforms from using your content for model training, while keeping your site fully accessible to standard search engine crawlers.

What to Block - and What to Leave Alone

Knowing what to block is just as important as knowing the syntax. Here is a practical reference:

Pages and directories you should typically block:

Admin and login pages (/wp-admin/, /admin/, /login/)
Staging and development environments
Thank-you and confirmation pages (/thank-you/, /order-received/)
Internal search result pages (/?s=, /search/)
URL parameter variations that create duplicate content (?sort=, ?ref=)
User account and profile pages (/my-account/, /dashboard/)
Cart and checkout pages
Printer-friendly page versions

Pages you should never block:

Your homepage or any core landing page
CSS, JavaScript, or image files that Google needs to render your pages properly
Your XML sitemap (reference it in robots.txt instead)
Any page you want to rank in search results
Canonical versions of pages with duplicate content

Blocking CSS and JavaScript files is one of the most common and damaging robots.txt mistakes. If Googlebot cannot render your pages correctly, it cannot evaluate them - which directly harms your rankings.

Robots.txt vs Meta Robots Tags: What Is the Difference?

These two tools are often confused, but they serve different purposes and operate at different levels:

Robots.txt - Works at the site level. It controls whether a crawler can access a URL at all. If a page is disallowed in robots.txt, the crawler will not visit it, cannot read its content, and cannot follow links on it. However, the URL can still appear in search results if it receives external backlinks.
Meta Robots Tag - Works at the page level via an HTML tag (<meta name="robots" content="noindex">). It controls whether an indexed page should appear in search results. For this tag to work, the page must be crawlable - a page blocked in robots.txt can never read its own meta robots tag.

The practical implication: if you want to prevent a page from appearing in search results, use a noindex meta tag - not robots.txt. If you use robots.txt to block a page, Google cannot read the meta tag and the page may still surface in results based on external links pointing to it.

Use robots.txt to control crawl access and manage crawl budget. Use meta robots tags to control what appears in search results. For a deep dive on meta robots and x-robots-tag, Ahrefs has an excellent reference guide covering every directive and use case.

How to Create a Robots.txt File

Open a plain text editor

Create your robots.txt in Notepad, TextEdit (plain text mode), VS Code, or any code editor. Never use a word processor like Microsoft Word - it adds invisible formatting characters that break the file.

Write your directives

Start with your User-agent block, followed by your Disallow and Allow rules. Add a blank line between different user-agent groups. End with your Sitemap line.

Name the file exactly "robots.txt"

The filename is case-sensitive. It must be lowercase: robots.txt - not Robots.txt, ROBOTS.TXT, or any other variation. Only one robots.txt file can exist per domain.

Upload to your root directory

Place the file in the top-level root of your domain - the same directory as your homepage. It must be accessible at exactly https://yourdomain.com/robots.txt. A file placed in a subdirectory will not be found by crawlers.

Test before going live

Use Google Search Console's robots.txt report to validate your file and check that specific URLs are being blocked or allowed as intended before your changes take effect in production.

How to Test Your Robots.txt File

Never deploy a robots.txt change without testing it first. Here are the tools to use:

Google Search Console - robots.txt Report - The most reliable tool for testing. Enter a specific URL and it tells you whether Googlebot can access it based on your current robots.txt rules. Go directly to the robots.txt report →
Fetch your live file directly - Visit yourdomain.com/robots.txt in your browser. If you see the file contents, it is accessible. If you get a 404, crawlers will assume full site access.
Screaming Frog SEO Spider - Crawls your site the way Googlebot does and highlights pages blocked by your robots.txt. Their guide on crawling with robots.txt rules is also worth reading for staging environments.
Ahrefs Site Audit - Flags robots.txt issues as part of a full technical audit, including blocked resources, syntax errors, and crawl budget concerns.

Common Robots.txt Mistakes That Hurt SEO

Small errors in robots.txt can have catastrophic consequences. Here are the most common mistakes to avoid:

Blocking the entire site - Disallow: / under User-agent: * tells every crawler to leave your site entirely. This is the single most damaging robots.txt error - and it happens more often than you think, especially after a CMS migration or redesign.
Blocking CSS and JavaScript - If Googlebot cannot load your stylesheets and scripts, it cannot render your pages and assess their quality. This can significantly harm rankings, especially for JavaScript-heavy sites.
Using robots.txt to hide sensitive data - Robots.txt is a public file. Any path you list in it is visible to anyone. Do not rely on it to protect truly sensitive information - use server-level authentication instead.
Expecting robots.txt to prevent indexing - A blocked page can still appear in search results if external sites link to it. Use noindex meta tags for pages you want removed from search results.
Syntax errors - Missing a colon, adding an extra space, or using the wrong case in a path can invalidate a directive entirely. Always validate with Google Search Console after making changes.
Blocking pages you are trying to rank - Accidentally disallowing key landing pages, blog posts, or product pages is more common than it sounds. After every change, crawl your site with a tool like Screaming Frog to verify nothing critical has been blocked.

Robots.txt and AI Bots in 2026

In 2022, Google introduced Google-Extended - a separate user-agent that gives site owners control over whether their content can be used to train Google's AI systems, including Gemini. Since then, virtually every major AI company has introduced its own crawler with a distinct user-agent name. Search Engine Journal covered this shift in detail as it unfolded.

This has added a new strategic dimension to robots.txt. Site owners now need to decide not just which search engines can crawl their content, but whether AI companies can use it to train large language models - often without compensation or attribution.

Common AI crawler user-agents in 2026:

GPTBot1122 - OpenAI's web crawler for training data
Google-Extended - Controls use by Google's AI products (separate from Googlebot)
anthropic-ai - Anthropic's crawler
PerplexityBot - Perplexity AI's crawler
Claude-Web - Anthropic's browsing agent
Applebot-Extended - Controls use by Apple's AI features

For a regularly maintained list of all known AI crawlers, the community-maintained ai-robots-txt repository on GitHub is the most comprehensive reference available. nohacks.co also published a full 2026 AI user-agent landscape guide worth bookmarking.

If you want to block AI training bots without affecting your search engine rankings, add specific Disallow: / blocks for each AI user-agent while keeping your rules for Googlebot and Bingbot open. Your search visibility will remain completely unaffected.

Blocking AI training crawlers is now a legitimate and increasingly common business decision - particularly for publishers, media companies, and content-driven brands protecting their intellectual property.

Robots.txt Best Practices Checklist

Place your robots.txt file in the root directory, accessible at yourdomain.com/robots.txt
Always include a Sitemap: directive pointing to your XML sitemap
Use User-agent: * as a catch-all rule, then add specific bot blocks as needed
Never block CSS, JavaScript, or image files that Google needs to render your pages
Use noindex meta tags - not robots.txt - to remove pages from search results
Test every change with Google Search Console's robots.txt Tester before deploying
Keep the file as simple as possible - complexity increases the risk of errors
Check your robots.txt after every major CMS update, migration, or site redesign
Remember the file is public - never list sensitive directory names you want to hide
Add specific blocks for AI training crawlers if you want to protect your content from being used in model training

Conclusion: Simple File, Serious Consequences

Robots.txt is one of the smallest files on your website - typically just a few lines of plain text. But its impact on your SEO is anything but small. Configured correctly, it protects your crawl budget, keeps your index clean, and prevents sensitive pages from being exposed to search engines. Misconfigured, it can quietly block your entire site from Google without a single visible warning.

The good news is that robots.txt is not complicated once you understand how its directives interact. Start with a simple, conservative configuration. Block only what genuinely needs to be blocked. Test every change before it goes live. And revisit your file after every major site update to make sure nothing critical has been accidentally restricted.

In 2026, also take a moment to consider your position on AI training bots. It is a decision that is entirely within your control - and your robots.txt file is exactly where that control is exercised.