“Why is your website HTML 4.01?” (9^th May 2006)

Sean Fraser from Elementary Group Standards [website offline since early 2011] e-mailed me that question. Here’s my reply, as I sent it. Following is my lengthy comment on 456 Berea Street, on a page which raised much awareness about Sean’s survey of 50 ‘elite’ standardista website.

Response to Sean Fraser

Hi Sean, thanks for your interest I’ll be happy to answer questions on this subject.
It takes me a long time to write articles because I have to fit them in around my day job I’m not a professional blogger!
Site Surgeon actually uses XHTML 1.1 sent as application/xhtml+xml to devices which support it Devices which do not mention that MIME type in their HTTP Accept header get an HTML 4.01 Strict document The script doesn’t check “q values” and has potential bugs, but is available here:
[These scripts are pointless so I removed the page some time later.]
It always sends the XHTML 1.1 version to the W3C validator.
I apply this technique on Site Surgeon to demonstrate the (lack of) differences between correctly authored XHTML and the equivalent HTML It also shows clients that I can work with modularised XHTML if they need to In truth, HTML 4.01 Strict would be the better choice for this website since it only uses the elements and attributes available in HTML.
I do use HTML exclusively on my other websites, though:

Project Cerbera
Calthorpe Park School

I do this because HTML performs slightly better than XHTML when sent as text/html XHTML 1.0 “may” be sent as text/html but it has:

A slightly longer DOCTYPE than HTML.
Extra baggage in the form of XML specific attributes such as xmlns and xml:lang These are ignored by HTML devices, so are dead weight.
The “space-slash” character pairs are added to elements which do not need to be closed in the text/html environment More dead weight.

Item 2 and 3 are treated as invalid markup by HTML user agents These slightly slow rendering times as the user agent must perform some (fairly simple) markup corrections They add filesize, too.
In the text/html environment, many end tags are optional (or even forbidden) Leaving these out can reduce page sizes by a significant amount, especially if there are lots of list items or any data tables.
As such, an XHTML 1.0 page will always be slightly less efficient than the equivalent HTML, even without using all the optmisations HTML allows Because XHTML 1.0 uses the same elements and attributes as HTML, it can only do the things HTML can do There is no advantage to using XHTML in the text/html environment.
Since HTML is more efficient in terms of download speeds and processing times, that tips the balance for me.

Longer Reply on 456 Berea Street

To cut a long story short, HTML is the best choice. Indeed, that short list of proofs is what this longer reply became after I boiled it down.

Summary

An XHTML DOCTYPE doesn’t make browsers process your document using XHTML rules. Only Content-Type can do that and only in browsers which support it (many don’t).
XHTML 1.0 compatible with Appendix C is limited to the elements, attributes and techniques of HTML 4.01.
XHTML 1.1 must not be sent with a Content-Type of text/html because it is not compatible with HTML rules.
HTML 4.01 is always more efficient than the equivalent XHTML.
HTML does everything XHTML 1.0 compatible with Appendix C can do, yet is more efficient.
HTML is the better format for use in text/html documents.

Content-Type

The key to this is the Content-Type header being used.
When a server sends a file to a browser, it first sends a few lines of text explaining what it is about to send. These lines of text are called the HTTP Response Headers.
The HTTP Response Headers for this page are:
Transfer-Encoding: chunked
Date: Tue, 20 Jun 2006 11:02:23 GMT
Content-Type: text/html; charset=iso-8859-1
Server: Apache/2.2.0
X-Powered-By: PHP/5.1.2
Vary: Accept,User-Agent

200 OK
You can set these up using your server configuration files. For Apache, these include the httpd.conf and .htaccess files. For example, to make the server include a Content-Type header for all HTML files (.htm or .html), you’d do something like this:
AddType 'text/html; charset=utf-8' .htm .html
This applies to all types of file. To send all Cascading Style Sheet (CSS) files (.css) with the correct Content-Type header, you’d use something like this:
AddType 'text/css; charset=utf-8' .css
For PNG images (.png) you’d use something like this:
AddType 'image/png' .png
For XHTML documents (.xhtml) you’d use something like this:
AddType 'application/xhtml+xml' .xhtml
The Content-Type header tells the browser what format the data they are about to receive is in. The browser decides how to handle the data according to this header.
When the text/html header is used, the browser processes the document using the rules of HTML.
When the application/xhtml+xml header is used, the +xml part means browsers which support it will process the document using XML rules. The /xhtml part means they can treat the elements as being part of the XHTML namespace (<p> means ‘paragraph’, etc) as standardised in RFC3023.
DOCTYPE

The DOCTYPE does not make browsers switch from HTML rules to XHTML rules. Only Content-Type has this effect. If you send a document which uses XHTML markup but uses a Content-Type of text/html, browsers will attempt to process it using HTML rules.
HTML is Here to Stay

Many devices do not support application/xhtml+xml, so you must provide a text/html version to make sure everyone can access your website. Your website will mainly be processed using HTML rules because most people are using devices which do not support the rules of XHTML.
IE 7.0 will not support for the rules of XHTML, so HTML will remain the mainstream for some years. HTML browsers will be using the web indefinitely.
Furthermore, HTML5 could become a more practical format for commercial use than XHTML 2. This means that XHTML rules may never become the mainstream. Instead, the HTML rules may simply be developed every few years in new versions of HTML much like it was during the 1990’s.
HTML is more Efficient

When you write an XHTML 1.0 page compatible with Appendix C and send it as text/html, your markup is processed using HTML rules. This means your pages have a fair amount of needless baggage:

Slashes in <img> and <meta> tags are treated as invalid attributes or as garbage characters.
Your DOCTYPE is little longer than the HTML equivalent.
You have an xmlns attribute adding filesize. This attribute is not valid HTML, so it is ignored when sent in a text/html document.
You have xml:lang as well as lang. All of the xml:lang attributes are redundant since HTML browsers will use the lang attribute.
You have lots of tags which are not required (especially end tags).

In the HTML 4 Elements Table, you can see that the “Start Tag” and “End Tag” of many elements are “Optional”. Optional tags have been allowed in HTML since HTML 2.0 and are a fundamental part of the language, so they are safe to use.
On pages with many paragraphs, tables or lists these add up to be significant. Around 5% of filesize can be sometimes be saved by using HTML and removing the optional end tags.

Correction from Tommy Olsson

The below was due to my misreading of RFC3023, which I accept. (The bottom of page 5, specifically.)

@Ben: Good summary, but there’s one error in what you wrote. The Content-Type header is only used (by browsers) to choose which parser to use (XML or SGML). That header does not make an XHTML document XHTML, only XML (despite the xhtml in the MIME media subtype).
The thing that says that an XHTML document is really XHTML is the xmlns attribute, with the correct value, on the root element. Of course, that’s ignored for non-XML documents, so it takes a combination of Content-Type: application/xhtml+xml and the proper xmlns attribute.
You can use application/xml, or even text/xml, as the Content-Type and still have the document recognised as XHTML, provided that you have the correct xmlns attribute.

“Why is your website HTML 4.01?” (9^th May 2006)

Response to Sean Fraser

Longer Reply on 456 Berea Street

Summary

`Content-Type`

`DOCTYPE`

HTML is Here to Stay

HTML is more Efficient

Correction from Tommy Olsson

Life of Ben (Blog)

All Years

Recent Months

Archive

Site Menu

Irrepressible.info A government doesn’t want people to read this:

“Why is your website HTML 4.01?” (9th May 2006)

Response to Sean Fraser

Longer Reply on 456 Berea Street

Summary

Content-Type

DOCTYPE

HTML is Here to Stay

HTML is more Efficient

Correction from Tommy Olsson

Life of Ben (Blog)

All Years

Recent Months

Archive

Site Menu

Irrepressible.info A government doesn’t want people to read this:

“Why is your website HTML 4.01?” (9^th May 2006)

`Content-Type`

`DOCTYPE`