Errors in the technical setup of a website can result in too many low quality and unimportant pages being indexed. What can be done when it’s too late, and multiple pages that shouldn’t be indexed have been? Monitoring the volume of pages indexed for a website is important to ensure that it is friendly to search engines. Recently on a website that I work on I noticed the number of crawled pages in some of the tools I use to monitor it had increased dramatically. I discovered that this was because the website’s search functionality had been updated and the URL generated when listing internal search results had changed.
The following explanation may be helpful to anyone familiar with the basic functionality of the robots.txt file and of metatags used to communicate information about web pages to search engines.
How can hundreds of duplicate pages become added to Google’s index?
On the sites that I was working on robots.txt had been implemented in the past with instructions to web crawlers to exclude search result pages. However at some point the website’s internal search module had been updated in a way that modified the URL strings generated by search result pages. A corresponding update to robots.txt should have been made at the same time, but it was not:
- Old internal search result URLs:
mywebsite.com/search/?query=...
- New internal search result URLs:
mywebsite.com/search?query=...
- Robots.txt directive that stopped crawling of the old URLs but not the new ones:
Disallow: /search/
As a result, Google’s crawler had started to crawl and index the search results.
According to Google’s webmaster guidelines, pages with low value for visitors to your website should be excluded from the pages that you make available to appear in search engines. This is the case for any inevitable duplicates of pages that you may have, and is particularly relevant to search result pages generated by a website’s internal search functionality:
Use the robots.txt file on your web server to manage your crawling budget by preventing crawling of infinite spaces such as search result pages.
In case there is still any doubt, Google tells us that internal search results should not be made available for crawling or indexing:
Use the robots.txt file on your web server to manage your crawling budget by preventing crawling of infinite spaces such as search result pages.
Note that this should be done using robots.txt rather than setting a noindex
metatag in order to avoid showing unnecessary pages to Google’s crawler.
What happens when your website’s search results get indexed by Google
This resulted in pages within the search results being indexed with exactly the same meta-description as the real pages. It caused three main issues:
- Duplicate content: meta-descriptions that should have been associated with just one unique landing or product page were now associated with long URLs generated by the internal search functionality
- Too many pages on the website: Google was spending time crawling and indexing innumerable internal search pages (an ‘infinite space’), which was making the website seem bloated with too much content. This is the perfect example of wasting what is known as search engine ‘crawl budget’, as important pages were being crawled less often for updates as a result.
- Lost visibility and traffic: I am convinced that this issue – which caused the website to appear to have duplicate content, to be bloated with too many pages, to have too much thin content, and to have too many pages with extremely long URLs – caused the website to lose a few positions in rankings and caused an overall drop in organic search traffic.
This was not a good experience for people looking for information on the website using the Google search engine, as on some occasions the internal search results would show on the first page of results along with the original page.
How to check to see if search pages are being indexed
Check your robots.txt file to see if a string associated with search results has been added with a ‘disallow’ command. For example:
Disallow: /search/
Check the pages that Google has indexed by using search parameters.
For example, to get a list of all of the pages that have been indexed, type the following into Google’s search box:
site:www.yourwebsite.com
And to quickly see if search pages have been indexed, add the string generated by your website to the search:
site:www.yourwebsite.com inurl:/search/
How to remove your web pages from Google’s index
What should you do when your website’s search results – or other pages that shouldn’t be indexed – have already been indexed? Unfortunately, once pages have been indexed, it is not a simple case of adding the instruction telling crawlers to ignore certain pages or directories to robots.txt that should have been there in order to avoid the situation.
This is because adding the ‘disallow’ instruction after pages are already indexed does not remove pages from the index – it will only stop new crawls of these pages. Google does not immediately understand that these pages need to be removed from the index, and it could potentially take a very long time for them to be removed. (Similarly, adding both a robots.txt exclusion and no index tags to the pages that need to be removed from the index will not work: “If the page is blocked by a robots.txt file, the crawler will never see the noindex
tag, and the page can still appear in search results”.)
For these reasons, it’s necessary to find a way to inform Google that the search results pages that have already been indexed should not be indexed. I would recommend undertaking the following steps:
- Get a list of all of the indexed URLs that need to be removed from the index. A tool to help with this task is the Mozbar for the Chrome browser – if you have a large volume of pages to deal with, set your Google results preferences to the maximum 100 results in order to download a CSV file with up to 100 URLs per download.
- Set a ‘noindex’ metatag on all of these URLs
- Create a new orphan page on your website with a link to all of these URLs
- Submit this new orphan page in Google Search Console (previously known as Google Webmaster Tools). This asks Google to crawl the page and all of the links on the page. As Google’s bot revisits each of these indexed pages, it should see the new ‘noindex’ that has been added, and therefore update your website’s index by removing these pages.
- Only once you are sure that all of the pages have been removed from the index, add the disallow instruction to robots.txt in order to ensure that crawlers ignore your website’s search results pages
If you have dozens or hundreds or thousands of pages that you need to de-index, you may need to repeat steps 1 to 4 above several times (as the Google crawler may not crawl all of the links on your page immediately if there are too many of them).
Leave a Reply