“How” we work with robots is an idea that we are all going to have to get used to sooner or later. If you are a website owner or have been speaking to your digital agency there is a definite chance you have begun this conversation already. Today we are not talking about HAL (Heuristically Programmed ALgorithmic Computer) from 2001 a Space Odyssey, we are talking about Robots.txt and your website.
To avoid unnecessary complexity, robots.txt is just a text file that sits on your website. It uses a protocol, or a set of rules, called the Robots exclusion standard and helps you communicate with any and every other computer or bot that wishes to crawl your website.
Being a business website owner, the primary concern for robots.txt is to tell the search engines which pages to crawl, access and index as well which areas to not. Every search engine will only ever expend a certain amount of budget (electricity and computing power) to crawl your site. After all, there are over a billion websites, so it’s vital for you to appeal to the search engine’s algorithm and give them good directions on where the valuable content is to crawl and index.
The server, which contains your website, has internal files, public-facing files and administrative sets of files. The objective of this robots.txt is to present a series of recommendations to search engine crawlers about which files and folders are worthwhile exploring.
There is a general format for these robots.txt files:
User-agent: [user-agent name] Disallow: [URL string not to be crawled] User-agent: [user-agent name] Allow: [URL string to be crawled] Sitemap: [URL of your XML Sitemap]
An example of some user agents are google-bot, Googlebot-News, Googlebot-Image – check out the list of all user agents here). Using the “*” symbol is a wildcard that means all bots.
As a significant number of businesses both big and small are running WordPress installations, below is a typical example of a robots.txt file for WordPress to help illustrate how robots.txt functions:
This robots file allows all crawlers to find the /wp-content/uploads/ file which is great for indexing pages content and media. At the same time this has disallowed /wp-content/plugin/ and /wp-admin/ folders. Essentially, these pages are the administrative (either plugin components or user administration) sections and there is no need for the search engine to waste time exploring these folders or presenting these within the search engine index. Of note, we also see the inclusion of the address of the sitemap which surprisingly is a simple but continually overlooked operation.
As a site owner you might be asking is this necessary and what are the advantages. While it’s true that major search engines can index and explore your site without a robots.txt file, the idea here is to facilitate the most efficient indexation for search engines. The more hospitable you make your website to crawlers, the greater chance of greater indexation. Imagine you are invited into someone’s house but they don’t tell you where the bathroom is. Finding such an important space is going to be more time consuming and no-one wants that!
It is best to understand the robots.txt file as an adviser: there is no obligation for search engine bots or any other bots to obey these rules. Using robots.txt for SEO provides nuance to how your website is presented to search engines.
When it comes to other issues such as duplicate content, legacy pages and security there are substantially more intricate strategies needed and this requires a knowledgeable digital agency to execute.
Next time you are in conversation with a digital strategist, a developer or making your own website, be sure to take this file into consideration, it’s advantages and best practice but proceed with caution to get the most out of it and align yourself with what’s critically important: your presence in search engines.