Fine-tuning which parts of your WordPress blog are indexed by Google, without a plugin

The problem with WordPress’s dynamically-generated index pages

One of the big drawbacks of WordPress is the incredible number of pages it generates, matching all sorts of combinations including tags, categories, months and years. While it’s convenient to have the ability to generate these pages on-the-fly when the corresponding queries are entered, it’s not always realised that these pages will normally also get indexed by Google in the same way as individual posts and pages.

This matters more than one would spontaneously think, for two reasons:

  • firstly, it means the index of one’s site will end up being rather messy, with a lot of duplicate and overlapping content, making it rather difficult for someone who’s looking for an item to find it;
  • secondly, that duplicate content has long been one of the factors in Google’s page-ranking algorithm, and a rather negative one—if your site’s cluttered up with pages basically recycling identical content, you’l be doing your page rank no favors.

The drawbacks of plugins

As with everything in WordPress, there are plugins—plugins for everything under the sun—that claim to help you solve this issue, and plugins are the best solution for many people. The trouble with plugins, however, is that by installing them you’re effectively giving up control of one aspect of your site setup to a third party. Because a plugin will never be custom built with your needs in mind and because they’re often written to cater for every possible requirement, you’ll almost almost always end up with unnecessary bloat.

One of the most irritating things about plugins is the habit they have of hooking in to one’s html code, inserting inline styles that pollute one’s code and slow down one’s loading time.

Because of this, it’s good practice to regularly run down one’s list of installed WordPress plugins, and check on which ones can be dispensed with. Simple tasks can usually be accomplished as well, if not better, and certainly more cleanly, by googling the issue and then writing a few lines of your own code, armed with what you have found.

How to use conditional statements in WordPress

WordPress makes abundant use of PHP conditional statements, which in their most basic form look like this:

<?php if ( ) { ?>......<?php  } ?>

Using WordPress-specific conditional statements, you can (and probably do in your blog) target certain parts of your site to apply formatting or produce certain events. Because the site is dynamic and everything is displayed on the fly, the conditional statements fetch information from the database, and display it after applying the conditions stated in the tag. Popular statements include is_page(), is_category(), is_single() and is_page(). The WordPress Codex provides a full list of conditional tags.

Conditional tags are often misused in WordPress. Database queries are very resource-intensive and ought to be reserved for content that changes constantly and cannot be known in advance. Using conditional statements to display your name, for instance, when you’re the only author on your blog, is a bit absurd, yet occurs quite a lot. I prefer to hard-code as much of my theme as possible, using html in preference to PHP, which speeds the site up considerably.

PHP, however, comes into its own for one purpose: fine-tuning what HTML gets displayed in your site’s source code: I use this extensively in my own header, allowing me to call different css and scripts depending on which page is displayed:

<!-- Stylesheets  -->
if ( is_front_page() ) :{ ?>
<link rel="stylesheet" href="/home.css" /><?php }
else :{ ?><link rel="stylesheet" /style.css" /><?php }

With this in your header.php file, you do not even need to have a separate robots.txt

Fine-tuning which of your content gets indexed by Google using PHP conditional tags

You can use exactly the same structure to determine which content is indexed by Google: and there’s certainly no harm in having a robots.txt file. But the trouble with them is that not all robots obey the robots rules: even Google has been reported to ignore certain robots rules.

<!-- Meta tags -->
<?php if (is_front_page() || is_page('about') || is_single() ) :{ ?>
<meta name="robots" content="follow,index" /><?php }
else :{ ?>
<meta name="robots" content="noindex, nofollow">
<?php }

The above code will tell robots to index the front page (which WordPress allows you to have as a separate page, like it is in this site), About page and individual blog posts. Everything else is given a <meta name="robots" content="noindex, nofollow"> tag, meaning your duplicate content won’t be indexed. Your page rank won’t be wasted on pages whose purpose is to help internal navigation and not external indexation. And only your fresh, relevant content will be indexed.

Targeting only part of each page’s content for indexing with googleon and googleoff tags

But indexing only certain pages may not be enough. On each page, only part of the content is actually relevant for indexation purposes. Navigation text, sidebar items about similar posts, tweets, and other tangential (or, to use the html5 phrase, asides) serves no useful purposes being indexed on the same page as an article featuring original content about one subject. If you let these parts of your blog post pages get indexed, they will show up in searches even though they aren’t actually relevant to the main article in that page.

An article by Perishable Press, Tell Google to Not Index Certain Parts of Your Page, alerted me to the fact that there’s an accepted way of fine-tuning the content you designate for indexing even within one page: googleon and googleoff tags [i].

<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>

By applying these tags to my single.php, front_page.php and about.php pages (but you could target any content you wanted), and applying googleon and googleoff tags to tell robust only to index the actual title and text of the blog post within the template, I’m certain that only relevant content will show up in searches [ii].

  1. There are four types of googleon/googleoff tags: (1) index — content surrounded by “googleoff: index” will not be indexed by Google; (2) anchor — anchor text for any links within a “googleoff: anchor” area will not be associated with the target page; (3) snippet — content surrounded by “googleoff: snippet” will not be used to create snippets for search results; (4) all — content surrounded by “googleoff: all” are treated with all attributes: index, anchor, and snippet
  2. These tags are specific to Google, so other search engines will continue to crawl your pages from top to bottom.