Robots.txt Signal Explained

What is Robots.txt?

Robots.txt is a text file at the root of a website (/robots.txt) that tells search engine crawlers which pages they can and cannot access. It's part of the Robots Exclusion Protocol, a standard followed by most well-behaved bots.

We check whether a robots.txt file exists, what directives it contains (User-agent, Disallow, Allow, Sitemap), and whether it blocks all crawlers or specific ones.

Why Does Robots.txt Matter?

The presence and content of robots.txt can reveal information about a site:

Has robots.txt: Indicates intentional configuration and SEO awareness
Has sitemap: Shows the site wants to be indexed properly
Blocks all crawlers: May indicate the site doesn't want to be found in search engines. Unusual for legitimate businesses
No robots.txt: Not necessarily concerning, but common among hastily-created sites

Phishing sites often either have no robots.txt or block all crawlers to avoid detection by search engines. However, many legitimate sites also fall into these categories.

How to Interpret This Signal

Positive

Sitemap listed in robots.txt

Neutral

No robots.txt found

Attention

Blocks all crawlers

Example Robots.txt

A typical robots.txt file looks like this:

User-agent: *
Disallow: /admin/
Disallow: /private/

Sitemap: https://example.com/sitemap.xml

This tells all crawlers (*) to avoid /admin/ and /private/ paths, while pointing them to a sitemap for better indexing.

Example Domains

See this signal in action:

google.com wikipedia.org github.com

Related Signals

Favicon DNS Records