As part of my ongoing series on client-side web programming, I decided to write a guide to HTML.
To many programmers, this may seem like a big waste of time. Nowadays, everyone has worked with HTML. Plenty of programmers (myself included) started coding personal websites well before they even glanced at a programming language like C or Java. The comments on most websites accept certain HTML tags, so even non-programmers know how to mark up text with
<em> tags. What’s the point in telling people what they know already?
The answer is that many people think they know HTML, but few actually do. HTML has been around for a long time, and many practices that used to be common (or necessary “hacks”) are simply bad practices today. Also, the HTML5 standard is fairly new, and many people aren’t used to using it. People who learned HTML before 2006 or so are probably using it wrong. Their HTML is bad, and they should feel bad.
This, as it turns out, includes me. Researching this article has led me to tags that I didn’t know existed, and to discussions about Web standards that I had never read before. So, I wrote this guide for me as much as for all of you. Still, I hope you all find it helpful – and if not, or if I make mistakes, then please contact me to let me know.
First, let’s take a look at what HTML is – and what it is not.
HTML: the Web’s semantic markup language
HTML stands for HyperText Markup Language. As the name suggests, an HTML page is simply textual information, like a page in a book (hence the name). That information is marked up by HTML tags, specially-formatted codes that annotate the information.
Most HTML tags have an opening and closing version, and surround the information that they mark up. The information inside an HTML tag is called the tag’s content. HTML tags can also have attributes: name/value pairs that specify data about the tag itself. The combination of an HTML tag, its attributes, and its content is called an HTML element. I will go over the specific syntax a little later.
An HTML tag’s content may include other HTML tags, in which case, their representative elements are said to be nested. Nesting elements results in a heirarchical, tree-like structure. If it helps, you can think of it like a file system, where each folder (element) can contain files (textual information), but also other folders (nested elements). This is relevant when constructing the DOM, which I will talk about a little later.
So far, HTML is not very different from typesetting markup languages (like LaTeX), the markup formats used by word processors (like Rich Text Format), or the “lightweight markup languages” used by many Web sites (e.g. wikicode, BBCode, or WordPress shortcodes). In fact, HTML was originally derived from one such markup language, called SGML (Standard Generalized Mark-up Language).
But there is one significant difference between HTML and these other markup languages. HTML tags should be semantic: they should describe the meaning of the information (and not, say, how it appears to the user). Thinking of HTML tags semantically might be the hardest part of writing good HTML. After all, we experience a web page through its presentation, so it’s understandable to think about the underlying HTML code in the same way. It’s a hard habit to break, but break it you should.
Text is marked up so that the information can be easy for readers to understand. A piece of software that reads HTML is called a user agent, so-called because the software acts as an agent for a user. These “users” are not only human users, but also machines. In the case of Web pages, the most important “machine” would be the Web crawlers used by search engines. Writing HTML that can be easily parsed by search engines is one of the foundations of SEO (Search Engine Optimization). Of course, there’s more to SEO than that (some of it ethically questionable), but the proper use of HTML goes a surprisingly long way.
Among humans, HTML pages are usually viewed using a web browser. This is a stand-alone piece of software that is installed on the user’s computer or mobile device. (You know, like the one you’re using to read this article.) There are a huge number of different web browsers, and each may use a different layout engine – the part of the program that translates the text of the HTML file to the page that the user actually sees. Unfortunately, because layout engines are vendor-specific, the same HTML page may look different to different users, and some tags (mainly HTML5 tags) may not be supported at all.
Here are the most common layout engines, and the web browsers that use them:
- WebKit (including KHTML and Blink)
- By far the most common layout engine, it is used by Safari, Opera (as of 2013), and Chrome. KHTML is the rendering engine used by Konqueror, the browser for KDE-based Linux systems. WebKit was forked from KHTML in 2002 (by Apple). Blink, in turn, was forked from WebKit in 2013 (by Google) and is used by newer versions of Chrome and Opera. All are open-source, and so far as I know, HTML renders the same in all.
- The engine used by Internet Explorer, a.k.a. “the web developer’s worst nightmare.” Also the only proprietary layout engine in this list.
- The layout engine used by Mozilla Firefox and, for you old-timers, Netscape 6 and 7.
- The layout engine used by Opera before February 2013 (Opera 15), when they switched to WebKit (then Blink).
These layout engines are not just used by web browsers, of course. It is common to send email in HTML format, and the engine used to display the email will be determined by the recipient’s email client (so Outlook uses Trident, while Thunderbird uses Gecko).
Increasingly, users are viewing HTML using their mobile devices, and not just through web browsers. Most programming API’s for mobile devices include classes that can render HTML (e.g. WebView for Android). Moreover, plenty of mobile applications are actually web applications, built using frameworks like PhoneGap or Apache Cordova. In these situations, the layout engine will depend upon the phone’s operating system.
This is done on purpose. The creators of HTML were designing for longevity, and they were very successful.
A Brief History of HTML
HTML was created by Tim Berners-Lee in 1989, and has gone through many, many iterations since then. Berners-Lee did not place any copyright or patent restrictions upon HTML, believing that it should be free for the world to use. He may have lost some money by doing so, but the result has been of immense benefit to the world. HTML caught fire rapidly, as more and more people were connected to the Internet.
But the fact that HTML was quickly, and widely, adopted has its drawbacks. HTML was initially designed for academics, who are interested in things like correct data or accurate citations, and not so much about how an article looks. This is not true of the “real world.” HTML quickly reached the hands of artists, magazine publishers, or businesses, and these people care much more about how a website appears to the user. The people who wrote HTML were often graphic designers, not programmers. The ease of starting a personal website attracted many amateurs. Word processing programs quickly added functionality to output HTML, and they too made appearance a priority, since users expected the HTML output to look exactly the same as a paper printout.
The end result was the widespread use of terrible HTML. Tags were not nested properly, and closing tags were left off, resulting in what is colloquially known as tag soup. The semantics of HTML tags were ignored; tags were used according to how they rendered in (competing, proprietary) web browsers. Fonts and text styles were specified in the HTML itself. Tables were used for layout, not for tabular data. HTML’s focus on text data meant that multimedia websites required proprietary plug-ins, like Flash, Qucktime, or RealVideo. And so on. Making matters worse, widespread use meant that terrible HTML became a kind of de facto standard, and sometimes made it into books on the subject.
Pretty much everyone recognized these problems, and new Web technologies were created in order to solve them. In 1994, Berners-Lee founded the W3C (the World Wide Web Consortium), a multi-stakeholder (and increasingly multi-national) group designed to create and maintain Web standards. In 1996, they developed CSS (Cascading Style Sheets), so that a web page’s appearance could be separated from its content. This would solve many of the most aggregious problems with bad HTML (and would have solved them quicker, had CSS standards been adopted universally across all browsers).
Another significant standards organization is WHATWG (the Web Hypertext Application Technology Working Group). This group was formed in 2004 by Apple, Mozilla, and Opera, though it is open to everyone. It was formed as a reaction against the W3C’s focus on XML over HTML, and continued developing HTML standards, particularly HTML5. Eventually, the W3C’s HTML working group merged its standards with the WHATWG standards, and both continue working on the new HTML5 standard in parallel.
There have been several revisions of HTML over the years. As of this writing, there are three “flavors” of HTML that are in widespread use. From most-used to least-used, they are:
- This is the newest version of HTML. In fact, the standard was only finalized in October 2014. This does not mean that it is a bad idea to use it – most browsers have supported HTML5 since around 2010, and roughly half of all websites use HTML5 right now.
Part of the reason that HTML5 was developed was to move proprietary, third-party multimedia capabilities into the open HTML standard. These would be things like audio, video, or drawing on a canvas area. Unsurprisingly, most articles about HTML5 focus on these capabilities. But HTML5 also added many other kinds of tags, and redefined some old tags to have new meanings.
- XHTML 1.0
- XHMTL is a variant of HTML that uses the syntax of XML (the eXtensible Markup Language). XML is very similar to HTML, and shares the same basic tag syntax, but the tags names are not defined by any universal standard. Instead, they must be defined by a DTD (Document Type Definition), which tells XML clients what the tags mean. XHTML, then, is XML whose DTD is written such that the XML tag definitions match the HTML tag types.
In practice, this means very little to HTML designers, except that the syntax is a little stricter. Tags must be lower-case, you can’t leave off the closing tag, and tags must be nested properly. These are things that good HTML designers should have been doing anyway.
Because XHTML is also XML, it can be used by any tools that can transform XML documents, particularly XSLT (eXtensible Stylesheet Language Transformations). But XSLT never really caught on, and is becoming less popular by the minute, so I won’t mention it again.
There is actually a newer version of XHTML (version 1.1), but it is not widely used. There were not many differences between 1.0 and 1.1, so most HTML designers stuck with XHTML 1.0. Additionally, there are two versions of XHTML 1.0: strict and transitional. The former is the standardized version of XHTML 1.0. The latter was designed for designers who were transitioning to XHTML from the more forgiving syntax of HTML (hence the name). Few people use the transitional version anymore, and the things it forgives are bad design anyway.
- HTML 4.01
- This is the latest finalized HTML4 standard, the one used prior to HTML5. HTML 4.01 is now considered deprecated, but the standard does specify the tags that are used in XHTML 1.0, so it is still a good idea to know it. Since the HTML specification is forgiving, valid XHTML 1.0 is also valid HTML 4.01.
(Note: the links are to the W3C specification for that version.)
Because HTML is a living language, many tags or attributes are either not in widespread use, or have been deprecated in subsequent HTML versions. But that doesn’t mean that they’ve entirely disappeared, especially if you have to deal with legacy HTML code (which you eventually will). In this article, I will alert you when tags and attributes fall into these categories:
- Deprecated: These are tags or attributes that have been officially removed from the HTML standard, often in HTML5, but sometimes in HTML 4.01. You should not use these tags, and if you run across them in legacy code, you should remove them or change them to other tags.
- Discouraged: These are tags or attributes that are not officially deprecated, but are still a bad idea to use for one reason or another (usually because they’re archaic, or non-semantic). I recommend that you treat them as deprecated.
- Uncommon: These are tags or attributes that are neither dipricated nor discouraged, but were never widely adopted by web designers for one reason or another. There is nothing wrong with using them, but most people won’t need to.
- XHTML/HTML5 only: Certain tags (mainly in the header section) are only applicable if you are writing XHTML. They are not deprecated or discouraged, and may in fact be necessary for valid XHTML, but you don’t need them if you’re writing HTML5 (or HTML 4.01). Conversely, there are many tags that are new to HTML5, and will not be recognized by the browser in XHTML or HTML 4.01 documents. (A tag that is valid HTML 4.01 but not valid HTML5 is deprecated.)
- Non-standard: As explained above, both the W3C and WHATWG are pushing for HTML5 standards. Those standards usually agree, but in some cases, a tag is supported in one standard but not another. (Usually it’s part of WHATWG and not W3C.) Other tags may be in both standards, but the definitions of the tags differ. These tags are therefore not “standard” (since a standard needs to be agreed upon), but still have some browser support. Naturally, I’ll give the details when I discuss these tags. My suggestion is to treat these tags as deprecated.
Obviously, the middle two categories are fairly subjective, and depend upon the ever-evolving consensus of Web designers (and possibly my own biases). I’ll do my best to explain why something is considered discouraged or uncommon, or why the W3C deprecated it.
Because HTML is a markup language, it is not “executed” per se. There are no variables, no control structures (
while loops), and no functions. HTML pages are static: once a page is rendered in the browser, it is fixed. No text can be added or removed, and the page’s appearance will never change.
“Hold on,” you are probably saying right now, “that can’t be right. Nearly every Web site I visit is dynamic in some way. Images change when my mouse hovers over them. Search suggestions appear as I type. I can drag-and-drop page elements. If web pages were static, I couldn’t do any of that.”
Behind the scenes, the Web browser translates the HTML document into the DOM (Document Object Model). This is an abstract data structure, where each HTML element is translated into a node object. Because of the format of an HTML document, and the fact that tags may be nested, the DOM ends up having a tree-like structure, so tree terminology is used. A parent node contains another node in its contents; a node that is contained in another element is a child of that element; and two nodes with the same parent are siblings of each other. If a node is the parent of a distinct section of a document (like a table), it is sometimes called the root of that section. The root of the entire DOM tree is called the document root.