For the love of URLs

Aula Polska is a regular meetup of Warsaw’s entrepreneurial community, that I loved to attend until… its website had a facelift. It took a while to realize that I wasn’t seeing announcements of new meetings anymore. They broke their URLs, including the RSS feed I was subscribing to. I scoffed, how dumb! And then I thought “wait a second… did I check the URL for Michał’s Bites’ RSS feed after moving to WordPress?” I was the pot calling the kettle black.

The process that breaks the links around the web is known as link rot – an epidemic deteriorating the online experience:

  • readers, who bookmarked pages, cannot find them back,
  • links from other websites stop working, and
  • search engines cannot crawl the missing pages, so they impose ranking penalties.

Clearly, the damage hits everyone who is ever interacting with your content. Luckily, yours is also the power to save everyone the trouble, by showing some love for your URLs. In particular, when changing CMS platforms:

  • strive to keep the URLs identical,
  • setup permanent redirects for inevitably changed URLs,
  • communicate truly removed pages and offer alternatives.

In March 2013 I had to move Michał’s Bites from Posterous and chose a self-hosted WordPress platform, partly because it allowed me to configure and maintain identical URLs for most pages, particularly the posts and tags. The Safe Redirect Manager plugin took care of the remaining pieces. I forgot about the RSS feed because it’s somewhat removed from sight and only ever accessed by feed readers. Clicking through the new platform, or even running automated link checking didn’t bring it up.

What should have helped were server logs, parsed by any reasonable software like AWStats. Place your new platform online, then keep an eye on the reports, looking particularly at the 404 errors section. You’ll see the most commonly accessed, but missing URLs. Fix or redirect them, if possible.

Some pages will be truly gone – removed and nowhere present in your new website structure. These should correctly return a 404 (not found) or 410 (gone) HTTP code. You can still go the extra mile and make the experience of hitting links like those slightly more bearable.

GitHub's 404 page

It can be fun, self mocking, but on top of that you should provide users with a way to move forward from there:

  • At the very least offer a big, prominent link to your home page.
  • The list of latest, or most popular articles.
  • A search box for the website’s content.
  • Or a list of somewhat related pages to the one that’s gone now, if you can reliably setup that kind of intelligence. Links to random pages help nobody.

Don’t expect miracles – most readers will still bounce back and head somewhere else upon hitting a 404, but at least they may smile, leaving with a pleasant memory. Some might stay and dive into your content.

Remember, the Internet is forever and once you publish something, everybody’s free to link, index and bookmark that piece of content. Help them find it back – by any means they choose – and you’ll be building relationships that last.

The crazy state of structured data markup

You want to look good for Google. You want it to understand your website, so that it comes up in results often and stands out. You also want Facebook and Twitter to display your links prominently in their crowded timelines. Because you want all that, you’ll likely turn to structured data markup, like I did with the new design of Michał’s Bites. And then you’ll shake your head in disbelief.

Designing a fresh look for Michał’s Bites nudged me to look into Schema.org – a co-product of Bing, Google, Yahoo! and Yandex, meant to help them make sense of the contents of a page. In the words of its publishers:

On-page markup enables search engines to understand the information on web pages and provide richer search results in order to make it easier for users to find relevant information on the web.

And while choosing the right entity for your particular page element isn’t always straightforward, it’s easy to mark it up:

<article itemscope itemtype="http://schema.org/BlogPosting">
  <h1 itemprop="name headline">The crazy state of structured data markup</h1>
  <section itemprop="articleBody">
    <p>...</p>
  </section>
  <p>Written by <span itemprop="author" itemscope itemtype="http://schema.org/Person"><span itemprop="name">Michał Paluchowski</span></span></p>
</article>

It’s consistent. Most entities will have a way to markup name, url or description. Some have unique attributes, like a BlogPosting has an articleBody above. It’s readable for both computers and humans.

But wait, there’s more.

Since I’m using WordPress, some of its widgets output markup of the microformats brand. These serve a similar purpose as Schema.org, but with a much smaller dictionary of entities. Little did I know that Google scanned it too and started complaining via Webmaster Tools that my implementation was incorrect:

Google Webmaster Tools Microformats Error

I complied, included the missing markup, and the code became:

<article class="h-entry" itemscope…
  <h1 class="p-name" itemprop=...

There’s still more. Facebook developed its OpenGraph, and Twitter has their Cards, all of which – with some extra markup – allow me to control and improve the way content will appear in the services’ respective timelines. Otherwise Facebook may display a random snippet of text with a link, starting with something as “meaningful” as “Comments closed”.

These meant adding some more markup to my code:

<meta property="og:type" content="article">
<meta property="og:title" content="The crazy state of structured data markup">
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="The crazy state of structured data markup">

Now my content was nicely highlighted on Twitter (note: doesn’t always show up):

As a consequence, I have the same data three-four times in the page:

  1. reader-visible HTML, with the overhead of Schema.org and Microformats markup,
  2. META markup for Facebook,
  3. META markup for Twitter.

It’s like adding extra CSS for some older versions of Internet Explorer with <!--[if lt IE 9]>. Overhead and waste.

There is, perhaps, an end to this in sight. The W3C, just a few months ago, published a draft specification of Microdata, which essentially is Schema.org as part of HTML5.

I like the Schema.org specification best, because it’s rich, consistent and impossible to confuse with other markup. Using the class attribute for structured data is logically sound (<article class="blog-post"> speaks well of the type of the article), but if you place it in any slightly complex web layout, maintained by many people, it’s easy to mix and confuse with styling values. That means it’s likely to be accidentally removed or changed.

At the same time, Schema.org markup gets added right onto content markup, without the overhead of duplication in the <head> or elsewhere – again a maintenance nightmare waiting to happen.

Both Facebook and Twitter are certainly powerful enough to enforce their own solutions (think Facebook’s latest announcement of Hack), but if Google and Bing were able to come together and agree on one standard, I’m sure other big names can join the party too. The fewer standards on the market, the broader the adoption, easier parsing and ultimately better content is served to end users.