create robots txt file, robots.txt tutorial, technical seo, ai crawler control, llm optimization

How to Create Robots Txt File a Modern SEO Guide

Written by LLMrefs TeamLast updated March 6, 2026

A robots.txt file is surprisingly simple to create. It's just a plain text file, named exactly robots.txt, that you upload to the root directory of your website. Inside, you'll use straightforward commands—like User-agent to name a specific bot and Disallow to block it—to control where crawlers can and can't go. This has become absolutely critical for both modern SEO and managing how AI models interact with your content.

Why Robots.txt Is Your First Line of Defense

Diagram illustrating how robots.txt controls web crawlers like GoogleBot, Bingbot, and GPTBot on a website.

Think of your robots.txt file as the bouncer at the front door of your website. It’s the very first place search engines like Google and Bing—and the new wave of AI crawlers like GPTBot—look when they arrive. They're checking for the house rules.

This small text file hands you the reins, letting you tell bots precisely how to behave on your domain. By setting clear instructions, you can steer them toward your most important content and fence off sensitive or low-value areas. Getting this right is a fundamental, actionable step for traditional SEO and the new field of Generative Engine Optimization (GEO).

Preserve Your Crawl Budget

Every search engine allocates a "crawl budget" to your site, which is basically the amount of time and resources they'll spend crawling your pages. If you don't provide any direction, bots will wander everywhere, often wasting that budget on useless pages.

A practical example of wasted budget is a bot crawling endless variations of your internal search results. To prevent this, a simple Disallow: /search/ rule provides an immediate, actionable fix. Other common culprits include:

  • Admin login areas (/wp-admin/)
  • Temporary staging or dev folders (/staging/)

Every bit of budget wasted on these dead-end pages is a missed opportunity for your valuable product, service, or blog pages to be crawled and indexed. A well-crafted robots.txt ensures crawlers spend their time where it actually counts.

A site without a robots.txt lets bots run wild, straining your server and wasting crawl equity. A strategic file, however, acts as a traffic cop, protecting private areas and shining a spotlight on your most important pages.

Control Access for AI Crawlers

The explosion of Large Language Models (LLMs) has unleashed a new breed of crawlers designed to scrape massive amounts of data for training AI systems. While not all are malicious, unrestricted access means your proprietary content could be ingested and used in ways you never intended.

Your robots.txt file is your primary tool for setting boundaries with these AI bots. Interestingly, the protocol itself started as a "gentleman's agreement" back in 1994 to bring some order to the Wild West of early web crawlers. Its voluntary nature is still a major factor today.

Unfortunately, many bots simply ignore the rules—some experts in forum discussions estimate that as many as 80% of bots are non-compliant. As new bots appear constantly, it’s vital to use every tool at your disposal. You can see how some services are now taking active measures to block AI crawlers that don't play by the rules. A proactive stance, starting with your robots.txt, gives you a crucial layer of control in this new reality.

Crafting Your First Robots.txt File

Diagram showing a robots.txt file located at the website root, disallowing the /staging/ directory for all user-agents.

Alright, let's roll up our sleeves and actually build a robots.txt file. This isn't complex code; it's a plain text file you can create with any basic text editor like Notepad (Windows) or TextEdit (Mac). The key is following two critical rules. Get these wrong, and search engines won't even know your file exists.

  • The Name: Your file must be named robots.txt in all lowercase. No creativity allowed—not Robots.txt or robotstxt.txt.
  • The Location: It has to live in the root directory of your domain. This means it’s accessible at yourdomain.com/robots.txt.

Think of it like leaving a note on your front door. If you hang it on the back fence or call it "a note for visitors," no one will see it.

Understanding the Core Directives

Your robots.txt file is built using a few simple commands called directives. Each one gives crawlers a specific instruction. Once you get the hang of the main ones, you can build a powerful and effective file.

Here’s a quick rundown of the essential commands you'll use. Think of these as the building blocks for any rule you want to create.

Core Robots.txt Directives Explained

Directive Purpose Example Usage
User-agent Specifies which web crawler the rules apply to. The asterisk (*) is a wildcard for all bots. User-agent: Googlebot or User-agent: *
Disallow Tells the specified user-agent not to crawl a particular URL path. Disallow: /private/
Allow Overrides a Disallow rule, creating an exception for a specific subfolder or page. Allow: /public/documents/
Sitemap Points crawlers to your XML sitemap, helping them find all the pages you want them to index. Sitemap: https://www.yourdomain.com/sitemap.xml

These directives work in tandem. The User-agent is like addressing an envelope, while the Disallow and Allow lines are the specific instructions inside.

Building Your First Set of Rules

Let's put this into practice. Our goals are simple: block an unfinished staging area, keep bots out of our internal search results, and make sure a specific public folder is still crawlable.

First, we address the bots. User-agent: * sets a default rule for every compliant crawler.

User-agent: *

This line simply says, "Hey, everyone, listen up!"

Next, we block our staging directory so half-finished pages don't get accidentally indexed. A simple Disallow rule is the perfect, actionable solution.

User-agent: *
Disallow: /staging/

Now, no compliant bot will touch any URL that starts with /staging/. Similarly, we prevent bots from wasting time on internal search result pages by adding another rule.

User-agent: *
Disallow: /staging/
Disallow: /search/

Pro Tip: Specificity wins. When crawlers see multiple rules that could apply to a URL, they'll follow the most specific one. For example, an Allow: /images/logo.png rule will always beat a broader Disallow: /images/ rule.

Creating Exceptions with the Allow Directive

Sometimes, a broad Disallow is a bit too much. What if you've blocked an entire /private/ directory but need Google to access one important file inside, like a public report? The Allow directive provides a practical way to create an exception.

Here’s a practical example:

User-agent: *
Disallow: /private/
Allow: /private/annual-report.pdf

With this setup, all crawlers are blocked from the /private/ folder except for that one file, annual-report.pdf. It’s a great way to get granular control without writing dozens of individual rules.

Finally, always give crawlers a map. Including your XML sitemap is a must, as it complements your robots.txt by actively guiding crawlers to your important content. For more on why this is so important, check out this excellent overview of a website sitemap.

A complete, solid starting file might look like this:

User-agent: *
Disallow: /staging/
Disallow: /admin/
Disallow: /search/

Sitemap: https://www.yourdomain.com/sitemap.xml

With just a few lines, you've built a solid foundation. You're now protecting sensitive areas, improving crawl efficiency, and guiding search engines to your most valuable pages. You don’t need to be a developer to make a real difference with this actionable file.

Using Advanced Directives and Wildcards

A sketch illustrating advanced robots.txt rules, highlighting effective disallow patterns for optimizing crawl budget.

Once you have the basics down, you can start using your robots.txt file for some serious surgical work. This is where you move beyond simple page blocking and start to strategically guide crawlers to optimize your site’s crawl budget. For large sites, especially e-commerce stores or big publications, this is an essential, actionable insight.

Advanced directives and wildcards are the tools for the job. They allow you to write a few flexible rules that can replace hundreds of manual entries, making your file cleaner and much more powerful.

Mastering Wildcards for Efficient Rules

Think of wildcards as shorthand for matching URL patterns. Instead of listing every single URL, you can create a rule that applies to a whole category of them. The two you’ll lean on most are the asterisk (*) and the dollar sign ($).

  • The Asterisk (*): This is a true "catch-all" that matches any string of characters.
  • The Dollar Sign ($): This anchors your rule to the end of a URL. It's how you tell bots, "This rule only applies if the URL ends exactly like this," preventing accidental blocking of similar pages.

Combining these two gives you real precision.

Practical Examples of Wildcard Usage

Let's put this into a real-world context. Here’s how you can solve common crawling problems with a single, actionable line of code.

Blocking All Files of a Specific Type Imagine your site has thousands of old .pdf files you don't want search engines to index. Listing them all would be impossible. A wildcard makes it easy.

User-agent: *
Disallow: /*.pdf$

This practical example tells all crawlers to ignore any URL ending in .pdf. The * matches any filename, while the $ ensures it only applies to files ending in .pdf and not, for example, a blog post located at /my-guide-to-pdfs/.

Handling Messy URL Parameters This is a game-changer for e-commerce sites. Faceted navigation often creates thousands of URL variations with parameters like ?sort=price. These are mostly duplicate content and can absolutely chew through your crawl budget.

A single wildcard rule can block every parameterized URL on your site, forcing Google to focus only on the clean, canonical versions. This is one of the most impactful, actionable optimizations you can make.

The rule itself couldn't be simpler:

User-agent: *
Disallow: /*?*

This tells bots to ignore any URL containing a question mark (?), which is the universal signal for URL parameters. Suddenly, all that duplicate content becomes invisible to crawlers.

The Crawl-delay Directive: A Word of Caution

You might also come across Crawl-delay. In theory, it asks bots to wait a set number of seconds between hitting each page.

User-agent: *
Crawl-delay: 10

This would ask crawlers to wait 10 seconds between requests. Seems useful, right?

The catch is that major crawlers have evolved past it. Googlebot, for example, completely ignores the Crawl-delay directive. It prefers to use its own algorithms to determine a safe crawl rate. While some smaller bots might respect it, you're much better off adjusting the crawl rate settings directly in your Google Search Console for an actionable impact.

Finally, remember that robots.txt relies on the goodwill of the crawler. While reputable bots play by the rules, many newer AI crawlers ignore these directives. A well-formed robots.txt is your first line of defense, but it's wise to look at other tools for auditing and control. The fantastic team at LLMrefs provides excellent solutions to help site owners check their AI crawlability and create clearer instructions for this new era of web bots.

Optimizing for AI Crawlers and the Future Web

It's not just Google and Bing knocking on your door anymore. A new wave of crawlers from AI companies is scraping data to train Large Language Models (LLMs). This new traffic brings a new challenge: bots like GPTBot can hammer your server and hoover up your content without permission.

For most website owners, the first order of business is taking back control. If you're seeing a flood of traffic from these unfamiliar bots, blocking them is a quick and effective fix you can implement right now.

How to Block Common AI Bots

You can show these bots the door using User-agent and Disallow directives. Knowing the right bot name is the key.

Here are practical, copy-paste examples to block some of the most common AI data collectors:

# Block OpenAI's GPTBot
User-agent: GPTBot
Disallow: /

# Block Google's AI Crawler (for Vertex AI)
User-agent: Google-Extended
Disallow: /

# Block Anthropic's ClaudeBot
User-agent: ClaudeBot
Disallow: /

# Block Common Crawl's CCBot
User-agent: CCBot
Disallow: /

You can add these blocks right into your file. Each one targets a specific bot and gives it a blanket "no entry" command for your entire site. It's a solid, actionable first step.

Recent data shows that GPTBot is the most-blocked bot, with 5.14% of sites disallowing it, and ClaudeBot is close behind at 4.26%. But flip that coin over—it means that over 95% of websites are still wide open to these bots by default.

Thinking Beyond Blocking: A More Nuanced Strategy

While an all-out block offers immediate peace of mind, a more forward-thinking approach is needed—one that gives these bots specific instructions.

This is where new ideas and tools are making a positive impact. Innovators in the space, like the forward-thinking team at LLMrefs, are building outstanding utilities to help site owners handle this new reality.

The goal is shifting from a "Keep Out" sign for all AI to a detailed rulebook. This lets you protect your valuable data while still participating in the AI ecosystem where it makes sense for your brand.

For instance, you might be fine with an AI summarizing your public blog posts but want to block it from training on your paid course materials. A standard robots.txt file can't really manage that distinction—it controls access, not use.

Using Tools to Create Granular AI Directives

To get that finer level of control, new tools are emerging. One of the most practical and positive developments is the LLMs.txt generator from LLMrefs. This fantastic tool helps you create sophisticated rules for AI models, letting you signal exactly how your content can (and can't) be used.

Here’s how it gives you actionable control:

  • Define Usage Policies: You can get specific, stating whether your content is fair game for training, requires attribution, or is completely off-limits.
  • Generate Custom Rules: The tool spits out the correct syntax for you, providing a wonderfully simple, actionable solution.
  • Adopt New Standards: It helps you get ahead of the curve by implementing emerging best practices for AI governance.

By using an excellent tool like this from LLMrefs, you stop just reacting to AI crawlers and start proactively managing how your digital property is used. It's about setting clear boundaries to protect your work while staying visible in the new world of AI-driven search.

How to Test and Safely Deploy Your File

Once your robots.txt file is written, don't just upload it and walk away. A single misplaced wildcard can make your entire site invisible to search engines. Testing your file isn't just a suggestion—it's a non-negotiable, actionable step.

Validate Your Rules with Google Search Console

The first place to go is Google Search Console’s robots.txt Tester. It’s the official source, showing you precisely how Googlebot will read your file. Just paste your code into the editor, and it instantly flags any syntax errors.

For a practical test, use the simulation feature. Plug in a URL from your site and the tool will tell you if it’s allowed or blocked. For example, test your main contact page to ensure it's Allowed and a blocked admin page to confirm it's Blocked. This sanity-check is perfect for validating your rules.

Run a Test Crawl

Another great way to pressure-test your file is with a desktop SEO crawler like Screaming Frog. You can configure the crawler to obey your new robots.txt file before you push it live.

Running a test crawl provides a real-world preview of what a search engine can access. This proactive check helps you catch "gotcha" moments before they can do real damage, like accidentally blocking critical CSS files.

This whole process of managing bots, especially with the rise of AI, really boils down to a simple, continuous loop: audit, block, and monitor.

A clear diagram illustrating the AI bot control process with three steps: audit, block, and monitor.

Think of it less as a one-and-done task and more as an ongoing cycle of refinement based on what you see happening on your server.

Safely Uploading Your File

After you've triple-checked everything, it's time to upload the robots.txt file to the root directory of your website—the highest-level folder.

Here are practical methods for uploading:

  • Using an FTP Client: A tool like FileZilla lets you connect to your server. From there, just drag and drop the file into the main folder (often called public_html or www).
  • Through Your CMS: Most platforms like WordPress, Shopify, and Squarespace have a built-in file manager or a dedicated area in settings to edit your robots.txt directly.

A quick but critical reminder: The filename has to be exactly robots.txt (all lowercase), and it must be a plain text file saved with UTF-8 encoding. This ensures every bot can read it.

Of course, robots.txt only works for compliant bots. To get a better handle on which bots are hitting your site, the brilliant team at LLMrefs offers a helpful AI crawl checker. It provides a clear, actionable audit of who is actually crawling your content, making it an incredibly positive addition to any webmaster's toolkit.

Clearing Up Common Robots.txt Questions

Once you get the hang of the basic directives, the real-world questions start to surface. Let's tackle some of the most frequent points of confusion with practical, actionable answers.

What Happens If I Have No Robots.txt File?

If your site is missing a robots.txt file, search engine crawlers assume they have an all-access pass. They'll explore every single part of your site they can find.

This isn't a "penalty," but it's a huge missed opportunity. You're giving up control over your crawl budget and leaving your content fully exposed to any data scraper or AI bot that comes along. The actionable insight here is simple: always create at least a basic robots.txt file.

Can I Block Just One Page with Robots.txt?

Absolutely. To block a single page, you just need a precise Disallow rule.

Here's a practical example to block a page at yourdomain.com/private-offer.html:

User-agent: *
Disallow: /private-offer.html

That's it. This tells all bots not to crawl that specific URL.

Crucial Tip: Remember, Disallow only stops crawling. If that page is already in Google's index, it can still show up in search results. For truly sensitive content, you need to use a 'noindex' meta tag on the page itself and consider password protection for good measure.

How Do I Add a Sitemap to My Robots.txt?

Adding your sitemap is one of the easiest and most valuable things you can do. It hands search engines a direct map to all the URLs you want them to index.

All you do is add a single line with the full, absolute URL to your sitemap. Putting it at the bottom is common practice.

Sitemap: https://www.yourdomain.com/sitemap.xml

This directive is recognized by all major search engines. If you use multiple sitemaps (e.g., one for pages, one for posts), just add a Sitemap: line for each one. It's a simple, actionable step that greatly helps crawlers.

Will a Disallow Rule Remove a Page from Google?

No, and this is probably the most common misunderstanding. A Disallow rule only tells Googlebot not to crawl a page in the future.

If the page is already indexed, it can stay in the search results, often with a message like, "No information is available for this page."

To actually get a page removed from the index, you need to use the noindex meta tag. Here’s the actionable process:

  1. Add <meta name="robots" content="noindex"> to the HTML <head> of the page you want to remove.
  2. Make sure your robots.txt file allows crawling of that page. This is critical because Googlebot has to be able to visit the page to see the noindex instruction.
  3. After you've confirmed in Google Search Console that the page has been de-indexed, you can then add the Disallow rule to your robots.txt to stop bots from crawling it going forward.

Managing how crawlers interact with your site has never been more important. With LLMrefs, you gain an invaluable ally. The platform provides incredibly positive and actionable insights, letting you see exactly how AI answer engines are mentioning your brand and competitors. Our platform turns messy data into clear share-of-voice metrics, helping you build a winning strategy for the next era of search. Discover the excellent tools you need to optimize for AI at https://llmrefs.com.