robots.txt

4th Mar

What is a robots.txt File?

intro to robots.txt

A robots.txt file aka robots exclusion protocol or standard, is a tiny text file, which exists in every website. Designed to work with search engines, it’s been moulded into a SEO boost waiting to be availed. robots.txt file acts as a guideline for the search engine crawlers, as to what pages/files or folders can be crawled and which ones they cannot.

To view a robtots.txt file simply type in the root domain and then add /robots.txt to the end of the URL.


Why a robots.txt is Important for Your Website?

Index and Noindex
  • It helps prevent crawling of duplicate pages by search engine bots.
  • It helps in keeping parts of the website private (i.e. not to show in Search Results).
  • Using robots.txt prevents server overloading.
  • It helps prevent wastage of Google’s “crawl budget."

How to Find Your robots.txt File?

How to Create a robots.txt File?

Where to Save Your robots.txt File?

Basic Format of robots.txt

basic format of robots.txt

Lets Understand the robots.txt Format Line by Line

1. User-agent

Note: It’s highly important for us to know that user-agents are case sensitive in robots.txt. Following example is incorrect because Google’s user-agent is “Googlebot” not “googlebot”

User-agent: googlebot
Disallow:

The correct example would be:

User-agent: Googlebot
Disallow:
2. Sitemap Directive
3. Wildcard/Regular Expressions

The above rule blocks all bots except Googlebot from crawling the site.

4. Some Starter Tips:
5. Non-Standard robots.txt Directives

Most Commonly Used robots.txt Commands

Uses of a robots.txt File

Page Type Description

Web page

For web pages, robots.txt can be used to regulate crawling traffic to avoid crawling of unimportant or similar pages on the website.
robots.txt should not be used to hide web pages from Google, as other pages can point to the hidden web page with descriptive text, and the page would be indexed without visiting the page.

Media files

robots.txt can be used to manage crawl traffic, and to prevent visual and audio files from appearing in the Google search results. This however doesn’t stop other users or pages from linking to the page in question.

Resource file

robots.txt can be used to block resource files like certain images, scripts, or style files.

Google's crawler might find it harder to understand the web page in the absence of such resources and would result in lowered ratings.

Why Your WordPress Needs a robots.txt File

Every search engine bot has a maximum crawl limit for each website i.e. X number of pages to be crawled in a crawl session. If let’s say the bot in unable to go through all the pages on a website, it will return back and continue crawling on in the next session and that hampers your website’s rankings.

This can be fixed by disallowing search bots to crawl unnecessary pages like the admin pages, private data etc.

Disallowing unnecessary pages obviously saves the crawl quota for the site and that in turn helps the search engines to crawl more pages on a site and index faster than before.

A default WordPress robots.txt should look like this:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

The WordPress website creates a virtual robots.txt file when the website is created in the server’s main folder.

Thisismywebsite.com -> website
Thisismywebsite.com/robots.txt -> to access robots.txt file

A code similar to this should be observed, it’s a very basic robots.txt file

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Allow: /wp-admin/admin-ajax.php

In order to add more rules, one needs to create a new text file with the name as “robots.txt” and upload it as the previous virtual files replacement. This can simply be done in any writing software as long as the format remains in .txt.

Creating a New WordPress robots.txt File:

Below we explained 3 methods of implementing robots.txt

Method 1: Yoast SEO

The most popular SEO plug-in for WordPress is Yoast SEO, due to its ease of use and performance.

Yoast SEO allows the optimization of our posts and pages to ensure the best usage of our keywords.

It’s Doable in 3 Simple Steps

Step 1. Enable advanced settings toggle button from features tab in Yoast dashboard.

yoast robots.txt step 1
Note: Yoast SEO has its own default rules, which override any existing virtual robots.txt file.

Step 2. Go to tools and then file editor. You will see .htaccess file and robots.txt creation button. Upon clicking “create robots.txt” an input text area will open where robots.txt file can be modified.

yoast robots.txt step 2
Under Tools>File Editor

Step 3. Make sure to save any changes made to the robots.txt document to ensure retention of all the changes made.

yoast robots.txt step 3

Method 2. Through the All in One SEO Plug-in

Very similar to the above mentioned SEO plug-in, other than being a lighter and faster plug-in, creating a robots.txt file is also as easy in All in One SEO plug-in as it was in the Yoast SEO.

Step 1: Simply navigate to the All in One SEO and into the feature manager page on the dashboard.

Step 2: Inside, there is a tool which states robots.txt, with a bright activate button right under it.

All in One plug-in robots.txt step 1

Step 3: A new robots.txt screen should pop up; clicking on it will allow you to add new rules, make changes or delete certain rules all together.

All in One plug-in robots.txt step 3
Note: Changes cannot be made to the robots.txt file directly using this plug-in. Add or remove input fields grouped with user-agent that automatically updates robots.txt file

Step 4: All in one SEO also allows blocking of “bad bots” straight away via a plug-in.

All in One plug-in robots.txt step 5

Method 3. Create a new robots.txt file and upload it using FTP

Step 1: Creating a .txt file is one of the easiest things, simply open notepad and type in your desired commands.

Step 2: Save the file as .txt type

setup robots.txt through ftp step 2

Step 3: Once a file has been created and saved, the website should be connected via FTP.

Step 4: Upon establishing FTP connection to the site

Step 5: Navigate to the public_html folder.

Step 6: All that is left to do is uploading the robots.txt file from your system onto the server.

setup robots.txt through ftp step 5

Step 7: That can be done via simply dragging or dropping it or it can be done by right clicking on the file using the FTP client’s local

Testing in Google Search Console

1) Upon creation of robots.txt or on updating the robots.txt file, Google automatically updates robots.txt, alternatively it can also be submitted to the Google search console to test before you make changes to it.

robots.txt tester
Google Search Console Robots.txt Tester

2) The Google Search Console is a collection of various tools provided by Google to monitor how the content will appear in the search.

3) In the search console we can observe an editor field where we can test our robots.txt.

submit robots.txt to Google search console

4) The platform checks the file for any technical errors and in case of any; they will be pointed out for you.

robots.txt errors in tester
  • For the website to excel on a global level, one needs to make sure that the search engine bots are crawling only the important and relevant information.

  • A properly configured robots.txt will enable searchers and bots to access the domain’s best part and ensure a rise in the search engine rankings.

Error and warning reports related to robots.txt in Google Search Console

Regularly check for issues in coverage report in the Google Search Console regarding any robots.txt updates

Some Common Issues Are:
  1. Submitted URL blocked by robots.txt - This error is typically caused if an URL blocked by robots.txt is also present in your XML sitemap. Search Console shows it like this:

    submitted URL blocked by robot txt

    Solution #1 – Remove the blocked URL from the XML sitemap.

    Solution #2 – - Check for any disallow rules within the robots.txt file and allow that particular URL or remove the disallow rule.

    You can choose either solution depending on your priority and needs as to whether you want to block it or not.

  2. Indexed, though blocked by robots.txt

    This is a warning related to robots.txt which basically means you have accidently tried to exclude a page or resource from Google’s search results for which disallowing in robots.txt isn’t the correct solution. Google found it from other sources and indexed it.

    Solution - Remove the crawl block and instead use a noindex meta robots tag or x robots-tag HTTP header to prevent indexing.

How to Manage Crawl Budget With robots.txt?

  • Crawl budget is an important SEO concept that is often neglected. It is the rate at which search engine’s crawlers go over the pages of your domain.

  • The crawl rate is “a tentative balance” between Googlebot’s desire to crawl a domain while ensuring the server is not being overcrowded.

    impact of crawl budget limit

Optimising Crawling Budgets with robots.txt

  • Enable crawling of important pages in robots.txt.
  • Within robots.txt disallow crawling of unnecessary pages and resources.

Bonus info:

Other Techniques to Optimize Crawl Budget :

  • Keep an eye out for redirect chains.
    Keepan eye out for redirect chains.
  • Use HTML as often as you can as majority of crawlers are still improving their indexing flash and XML.
  • Make sure there are no HTTP errors (http:// links in the page which may be redirected to https:// version).
  • 404 and 401 errors take up a huge chunk of a domains crawling budget. Don’t ever block a 404 URL, otherwise Search Engines will ever crawl it and will never know it’s a 404 page and needs to be deindexed.
    404 not found401 authorization required
  • Unique URLs are accounted as separate pages and led to wastage of crawling budget.
  • Keep your sitemaps updated, that makes it easier for internal links to be understood much faster and with ease by the crawlers.
  • <link rel="alternate" hreflang="lang_code" href="url_of_page" /> should be included in the page’s header. As even though Google can find alternate language versions of any page, it is better to clearly indicate the language or region of specific pages to avoid wastage of crawling budget.

Meta Robot Tags vs robots.txt

meta robots tag

Meta robot tag provides extra functions which are very page specific in nature and can’t be implemented into a robots.txt file; robots.txt lets us control the crawling of web pages and resources by search engines. On the other hand, Meta robots lets us control the indexing of pages and crawling of link on the page. Meta tags are the most efficient when being used to disallow singular files or pages whereas robots.txt files work to its optimum capacity when being used to disallow sections of sites.

The difference between the two lies in how they function; robots.txt is the standard norm for communicating with crawlers and other bots and it helps set specific commands that guides crawlers to areas of the website that shouldn’t be crawled.

Meta robots tags are exactly what the name suggests, a tag. It guides the search engine like a crutch as to what to follow and what not to. Both can be used together as neither one has any sort of authority over the other.

The meta robots tag should be placed in the <head> section of the website and would look like: <meta name= “robots” content = “noindex”>

Most common meta robots parameters

  1. Follow:Every search engines is able to crawl through every internal link on the webpage. This signals the search engines that it can follow the links on the page in order to discover other pages.
    Example: <meta name= “robots” content = “follow”>
    Note: This is assumed by default on all pages – you generally won’t need to add this parameter.
  2. No follow: It prevents the Google bots from following any links on the page.
    Example: <meta name= “robots” content = “nofollow”>
    Note: It’s unclear and highly inconsistent between the search engines whether this attribute prevents search engines from following links, or prevents them from assigning any value to those links.
  3. Index: It allows search engines to add pages to their index, in order for it to be discovered by people who are searching for content similar to that being provided by you.
    Example: <meta name= “robots” content = “index”>
    Note: This is assumed by default on all pages – you generally won’t need to add this parameter.
  4. No-Index: It disallows search engines from adding pages into their index’s, and as a result disallows them from showing it in search results. <meta name= “robots” content = “noindex ”>
  5. Noimageindex: Tells a crawler not to index any images on a page.
  6. Noarchive:Search engines should not show a cached link to this page on a SERP.
  7. Unavailable_after:Search engines should no longer index this page after a particular date. Reasons may inculde deletion, redirection etc.

Not To-Do in robots.txt File

  • Block CSS, JavaScript and Image files: Blocking these contents in the robots.txt file cause harm as Google will not load the page completely. This means it is unable to see how the page looks like, what is the structure etc. Google may mark it as Not Mobile Friendly as critical resources are blocked by robots.txt and you will lose your rankings.
  • Ignorantly using wildcards and de-indexing your site: For example you may want to block something but don’t exactly how the directives work and you end up blocking the whole site or important pages which may result in deindexing of your web pages or even whole site from Search Engines.

The Right Solution for Every Business

Do you want your business to touch new heights? If you do, we can certainly help your business with the perfect blend of SEO and custom software solutions. In fact, we helped many businesses in achieving massive success over the years with our solutions.

let's talk

X

Quick Enquiry

Drop Us A Line To Know How BThrust Can Turn Your Goals Into Reality. Contact Us For SEO, Custom Software Or Other IT Services We Offer!