“Why is your website HTML 4.01?” (9th May 2006)

Sean Fraser from Elementary Group Standards [website offline since early 2011] e-mailed me that question. Here’s my reply, as I sent it. Following is my lengthy comment on 456 Berea Street, on a page which raised much awareness about Sean’s survey of 50 ‘elite’ standardista website.

Response to Sean Fraser

Hi Sean, thanks for your interest I’ll be happy to answer questions on this subject.

It takes me a long time to write articles because I have to fit them in around my day job I’m not a professional blogger!

Site Surgeon actually uses XHTML 1.1 sent as application/xhtml+xml to devices which support it Devices which do not mention that MIME type in their HTTP Accept header get an HTML 4.01 Strict document The script doesn’t check “q values” and has potential bugs, but is available here:

[These scripts are pointless so I removed the page some time later.]

It always sends the XHTML 1.1 version to the W3C validator.

I apply this technique on Site Surgeon to demonstrate the (lack of) differences between correctly authored XHTML and the equivalent HTML It also shows clients that I can work with modularised XHTML if they need to In truth, HTML 4.01 Strict would be the better choice for this website since it only uses the elements and attributes available in HTML.

I do use HTML exclusively on my other websites, though:

I do this because HTML performs slightly better than XHTML when sent as text/html XHTML 1.0 “may” be sent as text/html but it has:

Item 2 and 3 are treated as invalid markup by HTML user agents These slightly slow rendering times as the user agent must perform some (fairly simple) markup corrections They add filesize, too.

In the text/html environment, many end tags are optional (or even forbidden) Leaving these out can reduce page sizes by a significant amount, especially if there are lots of list items or any data tables.

As such, an XHTML 1.0 page will always be slightly less efficient than the equivalent HTML, even without using all the optmisations HTML allows Because XHTML 1.0 uses the same elements and attributes as HTML, it can only do the things HTML can do There is no advantage to using XHTML in the text/html environment.

Since HTML is more efficient in terms of download speeds and processing times, that tips the balance for me.

Longer Reply on 456 Berea Street

To cut a long story short, HTML is the best choice. Indeed, that short list of proofs is what this longer reply became after I boiled it down.

Summary

  1. An XHTML DOCTYPE doesn’t make browsers process your document using XHTML rules. Only Content-Type can do that and only in browsers which support it (many don’t).
  2. XHTML 1.0 compatible with Appendix C is limited to the elements, attributes and techniques of HTML 4.01.
  3. XHTML 1.1 must not be sent with a Content-Type of text/html because it is not compatible with HTML rules.
  4. HTML 4.01 is always more efficient than the equivalent XHTML.
  5. HTML does everything XHTML 1.0 compatible with Appendix C can do, yet is more efficient.
  6. HTML is the better format for use in text/html documents.

Content-Type

The key to this is the Content-Type header being used.

When a server sends a file to a browser, it first sends a few lines of text explaining what it is about to send. These lines of text are called the HTTP Response Headers.

The HTTP Response Headers for this page are:

Transfer-Encoding: chunked
Date: Tue, 20 Jun 2006 11:02:23 GMT
Content-Type: text/html; charset=iso-8859-1
Server: Apache/2.2.0
X-Powered-By: PHP/5.1.2
Vary: Accept,User-Agent

200 OK

You can set these up using your server configuration files. For Apache, these include the httpd.conf and .htaccess files. For example, to make the server include a Content-Type header for all HTML files (.htm or .html), you’d do something like this:

AddType 'text/html; charset=utf-8' .htm .html

This applies to all types of file. To send all Cascading Style Sheet (CSS) files (.css) with the correct Content-Type header, you’d use something like this:

AddType 'text/css; charset=utf-8' .css

For PNG images (.png) you’d use something like this:

AddType 'image/png' .png

For XHTML documents (.xhtml) you’d use something like this:

AddType 'application/xhtml+xml' .xhtml

The Content-Type header tells the browser what format the data they are about to receive is in. The browser decides how to handle the data according to this header.

When the text/html header is used, the browser processes the document using the rules of HTML.

When the application/xhtml+xml header is used, the +xml part means browsers which support it will process the document using XML rules. The /xhtml part means they can treat the elements as being part of the XHTML namespace (<p> means ‘paragraph’, etc) as standardised in RFC3023.

DOCTYPE

The DOCTYPE does not make browsers switch from HTML rules to XHTML rules. Only Content-Type has this effect. If you send a document which uses XHTML markup but uses a Content-Type of text/html, browsers will attempt to process it using HTML rules.

HTML is Here to Stay

Many devices do not support application/xhtml+xml, so you must provide a text/html version to make sure everyone can access your website. Your website will mainly be processed using HTML rules because most people are using devices which do not support the rules of XHTML.

IE 7.0 will not support for the rules of XHTML, so HTML will remain the mainstream for some years. HTML browsers will be using the web indefinitely.

Furthermore, HTML5 could become a more practical format for commercial use than XHTML 2. This means that XHTML rules may never become the mainstream. Instead, the HTML rules may simply be developed every few years in new versions of HTML much like it was during the 1990’s.

HTML is more Efficient

When you write an XHTML 1.0 page compatible with Appendix C and send it as text/html, your markup is processed using HTML rules. This means your pages have a fair amount of needless baggage:

In the HTML 4 Elements Table, you can see that the “Start Tag” and “End Tag” of many elements are “Optional”. Optional tags have been allowed in HTML since HTML 2.0 and are a fundamental part of the language, so they are safe to use.

On pages with many paragraphs, tables or lists these add up to be significant. Around 5% of filesize can be sometimes be saved by using HTML and removing the optional end tags.

Correction from Tommy Olsson

The below was due to my misreading of RFC3023, which I accept. (The bottom of page 5, specifically.)

@Ben: Good summary, but there’s one error in what you wrote. The Content-Type header is only used (by browsers) to choose which parser to use (XML or SGML). That header does not make an XHTML document XHTML, only XML (despite the xhtml in the MIME media subtype).

The thing that says that an XHTML document is really XHTML is the xmlns attribute, with the correct value, on the root element. Of course, that’s ignored for non-XML documents, so it takes a combination of Content-Type: application/xhtml+xml and the proper xmlns attribute.

You can use application/xml, or even text/xml, as the Content-Type and still have the document recognised as XHTML, provided that you have the correct xmlns attribute.