A robots.txt file is handy for telling search engines which parts of a website should be crawled/indexed and which parts shouldn’t. This can be useful in certain situations where you want to keep a page or an asset hidden from search engines. However, doing so can trigger a warning in Google Search Console for “Sitemap contains URLs which are blocked by robots.txt”. If you had the intention of doing so then you don’t need to worry about the warning and it can be ignored. However, if you’re new to using a robots.txt you might want to check to see what’s going on.
This guide will provide you with a brief introduction into what a robots.txt file is and how it works as well as how to solve a “URLs blocked by robots.txt” error.
What Is Robots.txt?
Robots.txt files are text-based files that tell search engines what they should and shouldn’t crawl. When you publish a new page or post to your website, search engine bots crawl this content in order to index it in search results. However, if you have some parts of your website that you don’t want to have indexed you can tell search bots to skip them so that they don’t appear on the results page.
For example, let’s say you’re running an exclusive giveaway for your newsletter subscribers that you don’t want other visitors to simply come across via the search engine results page. In this case, you could define within your robots.txt file to disallow that page. Therefore once the bots come to your site and scan your robots.txt, they’ll notice there is a disallow rule for your giveaway page telling them not to index it.
The creation process of a robots.txt file is fairly simple and can be either done manually or in the case of many WordPress website – generated automatically via a plugin. There are various rules you can define within your robots.txt file and what you choose to define will depend on your own requirements. A typical WordPress website without any custom modifications to the robots.txt file will contain the following:
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
To learn more about robots.txt files, read our complete What Is a Robots.txt file guide.
Solving Sitemap Contains URLs Which Are Blocked by Robots.txt
Blocked sitemap URLs are typically caused by web developers improperly configuring their robots.txt file. Whenever you’re disallowing anything you need to ensure that you know what you’re doing otherwise, this warning will appear and the web crawlers may no longer be able to crawl your site.
Here are a few things you should check when attempting to solve a “sitemap contains URLs which are blocked by robots.txt” error:
- Check for any Disallow rules within your robots.txt file. The robots.txt file should be located in your root directory as follows:
- If you’ve recently migrated from HTTP to HTTPS, make sure that you created a new property for the HTTPS version and that the robots.txt file is available via HTTPS.
- Use the robots.txt Tester available within the Search Console to check which warnings/errors are being generated
- Your robots.txt file could be cached, give Google some time to recrawl your sitemap. Furthermore, try re-submitting it within Search Console if you’ve found any issues which were addressed.
- Try manually telling Google to crawl your site. You can do this by navigating to your Search Console property > Crawl > Fetch as Google. Add the URL path which Google was warning you about and click Fetch. Once reloaded, click Request Indexing > Crawl only this URL.
- Clear your website’s cache. This includes any caching plugins you have activated as well as your CDN (if you’re using one). Here’s how to purge your cache with KeyCDN.
Once any amendments have been made to your robots.txt file it will likely take some time for Google to recrawl your site. Therefore if you’re sure that you’ve removed any of the conflicting Disallow rules, you’ll probably just need to wait for Google to do its thing. Hopefully, this guide has given you some suggestions which you haven’t already tried for what you can do when receiving a “sitemap contains URLs which are blocked by robots.txt” error.