Web Authoring Statistics: Metadata

The <meta> element

The meta element has few attributes:

content, http-equiv, name, lang, scheme, charset, value, "", and, and the.

The first thing to notice is the huge number of markup errors involving the meta element. Markup such as:

<meta name=description value=the best site for hot air balloons>

...which results in a meta element with eight attributes, and which doesn't help anyone (least of all the search engines it's aimed at, since the second attribute should have been content, not value, and therefore the entire element is likely to be ignored). These markup errors explain the value, "", the and and "attributes" on the graph above, as well as most of the thousands of other attributes that aren't shown here.

Before those we have the mysterious charset attribute. This comes from another, just as common, markup error:

<meta http-equiv=Content-Type content=text/html; charset=windows-1252>

If there are no quotes around the content attribute's value, this looks like an element with three attributes, the third of which is called charset.

Continuing up the chart we see scheme and lang. Further research will be needed to find out how scheme is used.

Finally we have the three most commonly used attributes of the meta element, present on most pages: http-equiv and name (the two possible key attributes for the metadata) and content (the value attribute).

Here are the most common values for the http-equiv attribute:

content-type, content-language, pragma, expires, content-style-type, imagetoolbar, cache-control, content-script-type, pics-label, and keywords.

...and for the name attribute:

keywords, description, robots, generator, author, revisit-after, copyright, progid, distribution, and language.

Both together:

content-type, keywords, description, robots, generator, author, revisit-after, content-language, copyright, progid, pragma, expires, distribution, content-style-type, language, rating, publisher, title, resource-type, and imagetoolbar.

Content-Type is naturally the most-used value, since it's the standard way for giving the character encoding of an HTML page.

Next we have two name values: keywords, which these days is mostly useless, ironically, and description, which is still somewhat useful.

With progressively less usage are four more name values: robots, to control whether spiders should index the page or follow any of its links; generator, used to indicate what tool was used to generate the page; author, used to give the name of the author; and revisit-after, supposedly used to tell search engines how often to recrawl the page. To our knowledge only one search engine has ever supported it, and that search engine was never widely used — at this point, it is nothing more than a good luck charm. A remarkably widely used one. More pages use the completely worthless <meta name="revisit-after"> than use the <em> element!

Next is the Content-Language value (used on the http-equiv attribute). Almost as many people use this as specify the lang attribute on the html element. In the HTML5 spec currently the http-equiv attribute is only allowed for the one case of setting the character encoding, which can't really be dropped, as the graph above demonstrates. However, http-equiv="Content-Language" is supported by at least one browser, and as we see here, it is widely used — maybe http-equiv should not be removed after all.

Next we have the last sane name value worth talking about, namely copyright. This, and the fact that copyright is a really popular class name, suggests that either <meta name="copyright"> should be an official way of giving the copyright, or that the Web needs a <copyright> element, or both.

progid seems to be a sign of pages made by Microsoft editors (yes, that's a lot of Microsoft-generated pages... or a lot of copy-and-paste from pages made by Microsoft tools).

The http-equiv values pragma and expires are attempts at bypassing caches without having to set the HTTP headers correctly. These are probably unnecessary uses; any scenario where there is a legitimate reason to limit caching, the author is going to have enough control over the server to send the appropriate headers. In addition, the meta tags can't be considered reliable (e.g. proxies and transparent caches aren't going to honour them).

The distribution value is supposedly used to control who can access the document. Search engine "optimisers" tell people to set it to "global" to ensure that search engines index their pages.

The http-equiv="Content-Style-Type" value is an HTTP header that HTML4 defines that supposedly controls the language that the page uses in the style attribute. In theory, it has no default value, and so any page that uses the style attribute must specify it. Since there is only one language that can be used in the style attribute, it can only ever usefully be set to one value, text/css. And since all browsers assume that it is set to text/css by default, there is really no reason to give it. We were very surprised to see this many people set it. The Content-Script-Type value is the same but for event handler attributes like onclick.

The presence of the imagetoolbar value as being one of the top ten most frequently used http-equiv attributes is probably a sign to Microsoft that people don't like their popup image toolbar. One of the common name values is mssmarttagspreventparsing, too. Sorry guys!

The Dublin Core people can take some comfort from the fact that although their keywords didn't appear in the top ten chart above, they were quite well featured in the next few dozen. Here are the ten most used dc.foo values, most popular first: dc.title, dc.language, dc.creator, dc.subject, dc.publisher, dc.description, dc.identifier, dc.date, dc.format, dc.rights. In fact the order maps relatively closely to the frequency of similar metadata in other constructs, like class names or rel values. Nice to know people are consistent!

Finally it is worth noting the confusion of having two "key" attributes on the meta element. the most common http-equiv value, ignoring the magic content-type incantation, is content-language. The 19th most common name value is content-language. In fact this name value is specified once for every five occurrences of the http-equiv value! One in five pages specifying its language using meta is confused as to which key attribute is appropriate! Similarly, the most common name value is the same as the 10th most common http-equiv value, keywords. The http-equiv form is specified once for every fifty or so occurrence of the name form.