Even after you’ve uploaded a sitemap to help Googlebot navigate your web pages, search bots can discover additional pages you don’t need — or want — indexed. Index bloat can affect your search rankings. Learn how it hinders your SEO and how to fix index bloat so Google focuses on crawling and serving your most important pages.
What Is Index Bloat?
Index bloat happens when a search crawler indexes pages you don’t want appearing in search results. This wastes your limited crawl budget, which could be better used for indexing strategically important pages.
Pages that contribute to index bloat include:
- Filtered product pages that use URL parameters
- Internal search results
- Printer-friendly versions of pages
- Thank you or confirmation pages
- Test URLs and placeholders
- Thin, low-quality content
Why Index Bloat Matters
Index bloat inflates your search engine presence with content that doesn’t serve a purpose or isn’t of interest to readers.
When search bots index these unnecessary pages, it’s:
- Harder for search engines to rank your pages. Search crawlers need to understand your website to best match content to user queries and rank it effectively. Pages without a clear, logical purpose make it more difficult for Google and other search engines to understand and retrieve information.
- Detrimental to search engine rankings. Pages with similar content compete with each other when they target the same keywords. Low-quality pages or duplicate content may not rank well or engage users, which can affect the overall authority of your site.
- An inefficient use of crawl budget. Index bloat means search bots waste limited crawl budget collecting information Google doesn’t need. This takes time and resources away from pages that you want to rank for.
Understanding Crawl Budget
Every website has a crawl budget, which is the number of URLs Googlebot will crawl during each visit. Once Googlebot hits its crawl budget, it moves on to the next domain.
A couple of things determine your site’s crawl budget:
- Site health. Errors and slow server response times result in a smaller crawl budget.
- Demand. Sites with popular pages tend to be crawled more often.
Ideally, search bots will spend their time on the pages you want to rank and are important to users. You can maximize crawl budget by telling search engine crawlers how you want them to treat individual pages and which URLs they don’t need to crawl or index.
Index Bloat SEO: 4 Reasons Why Index Bloat Happens
Common causes of index bloat include:
1. Accidental Page Duplication
There are a few reasons duplicate content exists. Ecommerce sites may let customers filter products by price or color, generating web pages that shouldn’t be added to a search engine index. Other dynamically created pages include archives organized by date, internal search results, and blog category pages. The creation or existence of these pages is problematic only when they’re indexed. Use of tracking URLs and URL parameters may also lead to duplicate content.
2. Missing or Incorrect Robot.txt File
By default, search crawlers can visit any page on your site. You can use a robots.txt file to prevent search engines from adding certain URLs and subdirectories to their indices. Commands such as “allow” and “disallow” indicate whether you want specific URL paths followed. This file should be placed in a site’s top-level directory so crawlers read the file before accessing your pages.
Long articles or lengthy product categories can be easier to view when content is separated into multiple pages. Users navigate between the content using links to next, previous, or individual page numbers.
Search engines can sometimes index each page of a series as separate content if the pagination isn’t set up correctly. This can cause page three of a five-page post to appear in search engine results pages (SERPs) instead of a complete article, for example. Since it’s not useful for a reader to land in the middle of a piece of content, this is unnecessary indexing.
4. Poorly Performing or Thin Content
Each page on your site should have a clear purpose. Your site gets bloated when it contains thin content or underperforming pages that don’t enhance user experience. These pages should be pruned regularly.
Diagnose Index Bloat: Check Your Indexed URLs
Google Search Console reports the number of pages that have been added to the search index. The biggest indicator of index bloat? The number of indexed pages is larger than it should be.
How To Check Your Crawl Report in Google Search Console
Use the Index Coverage Report to see a summary of all of the pages that Google has crawled on your site. This tool tells you which pages are valid and indexed and which are excluded from the index.
Start by looking at how many pages are valid — which means they’ve been added to the index — and compare the number of indexed pages to the number of pages submitted on your XML sitemap.
If you have significantly more pages indexed than expected, your site may have index bloat.
For more information, expand the valid category of the report by clicking on ‘Valid’ under details.
This will generate a list of URLs that have been indexed. If you see URLs listed that you don’t want included in search results, you should update your robots.txt file or meta robots tags to noindex the pages, delete and redirect the URLs, and submit the URLs to the URL removal tool in Google Search Console.
How To Fix Index Bloat
To fix index bloat, you’ll need to remove internal links, give crawl bots instructions on which pages to index, use canonical tags, and delete excess content from your site. Once you’ve identified which unnecessary pages are in the Google index, you can determine how best to deal with them and request removal from Google SERPs.
Remove Internal Links
If you plan to noindex your content, removing internal links to that content will limit Google’s ability to find it and index it. Because Google uses internal links to locate new content on your site, when you remove that path, Google will turn its attention to other internal links on your page and crawl those instead.
If you want to delete your extraneous pages, removing internal links to those pages will reduce the chance of broken links and provide you with an opportunity to link to more relevant content that you want Google to index.
Update or Install Robot.txt
Create a robot.txt file if your site doesn’t already have one. It’s good practice to regularly review existing robot.txt files and update the directives to ensure search crawlers visit the right pages.
A robot.txt file blocks search engine bots from accessing a subdirectory. For example, our blocks Google from crawling user-generated search results. If our robots.txt file didn’t do this, Google might access, crawl, and index thousands of pages we wouldn’t want to show up in search results and exhaust its crawl budget.
Use Meta Robots Tags and X Robots
The robots meta tag can be added to an HTML document to provide instructions about that specific page without making changes to the site-wide robots.txt file. It gives you greater control of how an individual page is crawled. You can even leave instructions for specific crawlers (“Googlebot” or “bingbot”), and exclude pages from Google image, video, and news searches. A meta robots tag should only be used on pages that aren’t covered by your robots.txt file. If you inadvertently add a noindex tag to a page that’s blocked via your robots.txt file, Google won’t be able to read your directives.
The X-Robots tag is an HTTP header response. It has the same functionality as a meta robots tag and controls the indexing of images, videos, PDFs, and other non-HTML files.
Add Canonical Tags
Canonical tags ensure Google doesn’t index all instances of similar or duplicate content. These tags are placed in the header of a web page and tell Google which URL you prefer to use as the master copy of the page in search results.
When you divide related content across a series of pages, use pagination best practices so Google understands the relationships between the pages. Create a single master page containing all of the content, such as a “view all” page for a product category that’s spread across several pages. You can add a canonical tag to ask Google to index this page in search results instead of partial listings of products.
Remove or Consolidate Pages
Poor-performing content that delivers little organic traffic also contributes to index bloat. These include forgotten pages with outdated content or pages with similar content.
However, before you start deleting pages, create a plan. Content pruning should be done thoughtfully to avoid a negative impact on SEO and site authority.
A content audit can help determine whether to fix index bloat by merging pages and consolidating keywords or removing them altogether. You also need to use permanent redirects to ensure users don’t end up on dead pages. Proper redirects can carry through any link equity the pages have built up and ensure correct URLs are indexed.
Remove Indexed Pages With URL Removal Tool
You can request the removal of specific URLs from Google in Google Search Console. On the left-hand side, select ‘Removals.’
Click ‘New Request,’ and enter your URL.
Note the instructions — you have about six months to delete the URL or noindex it. If you fail to update your robots.txt or meta robots tag and choose to keep the URL, Google will crawl and index the page again. Remember to also remove internal links pointing to any page you want removed from Google’s index.
Need a Site Audit?
Support your growing business with a website optimized to pull in organic traffic. Our SEO audit service can identify the obstacles that hinder your online visibility and prevent your site from ranking well in search results. Request a free consultation and find out how to make the most of your digital presence.