2021-01-28
|~4 min read
|648 words
We recently discovered that our robots.txt
file wasn’t configured properly and pages we were expecting to be excluded from search engines.
In investigating the fix, I found several noteworthy resources:
robots.txt
, robotstxt.org was quite helpful in explaining how it works.robots.txt
approach, you can also provide directives to robots via <meta>
tags.InFrontDigital had a nice summary of the differences between using a robots.txt
and the Meta tag:
In general terms, if you want to deindex a page or directory from Google’s Search Results then we suggest that you use a “Noindex” meta tag rather than a robots.txt directive as by using this method the next time your site is crawled your page will be deindexed, meaning that you won’t have to send a URL removal request. However, you can still use a robots.txt directive coupled with a Webmaster Tools page removal to accomplish this.
Using a meta robots tag also ensures that your link equity is not being lost, with the use of the ‘follow’ command.
Robots.txt files are best for disallowing a whole section of a site, such as a category whereas a meta tag is more efficient at disallowing single files and pages. You could choose to use both a meta robots tag and a robots.txt file as neither has authority over the other, but “noindex” always has authority over “index” requests.
On the other hand, if your concern is bandwidth, you should use the robots.txt
to prevent the robot from navigating to your site at all.
If you use the <meta>
approach, there are a few ways to tailor it.
You can specify which robots to target. For example, <meta name="robots">
is a generic meant to apply to all robots while <meta name="googlebot">
would be Google’s only.1
You can also supply various directives
Per MDN, the list of directives is:
Value | Description | Used by |
---|---|---|
index | Allows the robot to index the page (default). | All |
noindex | Requests the robot to not index the page. | All |
follow | Allows the robot to follow the links on the page (default). | All |
nofollow | Requests the robot to not follow the links on the page. | All |
all | Equivalent to index, follow | |
none | Equivalent to noindex, nofollow | |
noarchive | Requests the search engine not to cache the page content. | Google, Yahoo, Bing |
nosnippet | Prevents displaying any description of the page in search engine results. | Google, Bing |
noimageindex | Requests this page not to appear as the referring page of an indexed image. | |
nocache | Synonym of noarchive. | Bing |
It’s worth noting that how any crawler responds to these directives is ultimately determined by the crawler and using these directives are not a guarantee that a crawler will respect them. The same is true of using a robot.txt
.
Google’s documented list of directives is found here.
You can provide multiple directives in a few different ways:
<meta>
tags to specify different crawlersFor example, to not allow any robots from indexing while still allowing Google’s to follow links, you could do:
<head>
<meta name="robots" contents="noindex" />
<meta name="googlebot" contents="follow" />
<head></head>
</head>
On the other hand, you can supply multiple directives simultaneously, for example, preventing indexing while allowing link following:
<head>
<meta name="robots" contents="noindex,follow" />
</head>
Hi there and thanks for reading! My name's Stephen. I live in Chicago with my wife, Kate, and dog, Finn. Want more? See about and get in touch!