A Guide to HTML

As part of my ongoing series on client-side web programming, I decided to write a guide to HTML.

To many programmers, this may seem like a big waste of time. Nowadays, everyone has worked with HTML. Plenty of programmers (myself included) started coding personal websites well before they even glanced at a programming language like C or Java. The comments on most websites accept certain HTML tags, so even non-programmers know how to mark up text with <strong> or <em> tags. What’s the point in telling people what they know already?

The answer is that many people think they know HTML, but few actually do. HTML has been around for a long time, and many practices that used to be common (or necessary “hacks”) are simply bad practices today. Also, the HTML5 standard is fairly new, and many people aren’t used to using it. People who learned HTML before 2006 or so are probably using it wrong. Their HTML is bad, and they should feel bad.

This, as it turns out, includes me. Researching this article has led me to tags that I didn’t know existed, and to discussions about Web standards that I had never read before. So, I wrote this guide for me as much as for all of you. Still, I hope you all find it helpful – and if not, or if I make mistakes, then please contact me to let me know.

One more thing. The last couple of paragraphs notwithstanding, I think it’s great that HTML is used by non-programmers. HTML can (and should) be used by people who have never heard of a closure, and don’t know the difference between pass-by-value and pass-by-reference. The fact that it can be understood by graphic designers or content authors is one of its strengths. So, I’m not going to assume any knowledge of programming at all, and only a basic knowledge of how the Internet works. In particular, I do not expect the reader to know JavaScript, and I won’t cover it here. (If you’re so inclined, you can read A Programmer’s Tour Of Javascript for that.)

First, let’s take a look at what HTML is – and what it is not.

HTML: the Web’s semantic markup language

HTML stands for HyperText Markup Language. As the name suggests, an HTML page is simply textual information, like a page in a book (hence the name). That information is marked up by HTML tags, specially-formatted codes that annotate the information.

Most HTML tags have an opening and closing version, and surround the information that they mark up. The information inside an HTML tag is called the tag’s content. HTML tags can also have attributes: name/value pairs that specify data about the tag itself. The combination of an HTML tag, its attributes, and its content is called an HTML element. I will go over the specific syntax a little later.

An HTML tag’s content may include other HTML tags, in which case, their representative elements are said to be nested. Nesting elements results in a heirarchical, tree-like structure. If it helps, you can think of it like a file system, where each folder (element) can contain files (textual information), but also other folders (nested elements). This is relevant when constructing the DOM, which I will talk about a little later.

So far, HTML is not very different from typesetting markup languages (like LaTeX), the markup formats used by word processors (like Rich Text Format), or the “lightweight markup languages” used by many Web sites (e.g. wikicode, BBCode, or WordPress shortcodes). In fact, HTML was originally derived from one such markup language, called SGML (Standard Generalized Mark-up Language).

But there is one significant difference between HTML and these other markup languages. HTML tags should be semantic: they should describe the meaning of the information (and not, say, how it appears to the user). Thinking of HTML tags semantically might be the hardest part of writing good HTML. After all, we experience a web page through its presentation, so it’s understandable to think about the underlying HTML code in the same way. It’s a hard habit to break, but break it you should.

Text is marked up so that the information can be easy for readers to understand. A piece of software that reads HTML is called a user agent, so-called because the software acts as an agent for a user. These “users” are not only human users, but also machines. In the case of Web pages, the most important “machine” would be the Web crawlers used by search engines. Writing HTML that can be easily parsed by search engines is one of the foundations of SEO (Search Engine Optimization). Of course, there’s more to SEO than that (some of it ethically questionable), but the proper use of HTML goes a surprisingly long way.

Among humans, HTML pages are usually viewed using a web browser. This is a stand-alone piece of software that is installed on the user’s computer or mobile device. (You know, like the one you’re using to read this article.) There are a huge number of different web browsers, and each may use a different layout engine – the part of the program that translates the text of the HTML file to the page that the user actually sees. Unfortunately, because layout engines are vendor-specific, the same HTML page may look different to different users, and some tags (mainly HTML5 tags) may not be supported at all.

Here are the most common layout engines, and the web browsers that use them:

WebKit (including KHTML and Blink)
By far the most common layout engine, it is used by Safari, Opera (as of 2013), and Chrome. KHTML is the rendering engine used by Konqueror, the browser for KDE-based Linux systems. WebKit was forked from KHTML in 2002 (by Apple). Blink, in turn, was forked from WebKit in 2013 (by Google) and is used by newer versions of Chrome and Opera. All are open-source, and so far as I know, HTML renders the same in all.
Trident
The engine used by Internet Explorer, a.k.a. “the web developer’s worst nightmare.” Also the only proprietary layout engine in this list.
Gecko
The layout engine used by Mozilla Firefox and, for you old-timers, Netscape 6 and 7.
Presto
The layout engine used by Opera before February 2013 (Opera 15), when they switched to WebKit (then Blink).

These layout engines are not just used by web browsers, of course. It is common to send email in HTML format, and the engine used to display the email will be determined by the recipient’s email client (so Outlook uses Trident, while Thunderbird uses Gecko).

Increasingly, users are viewing HTML using their mobile devices, and not just through web browsers. Most programming API’s for mobile devices include classes that can render HTML (e.g. WebView for Android). Moreover, plenty of mobile applications are actually web applications, built using frameworks like PhoneGap or Apache Cordova. In these situations, the layout engine will depend upon the phone’s operating system.

Luckily, most HTML tags are standardized across layout engines. (The same cannot be said for CSS or JavaScript, unfortunately.) Furthermore, the HTML specification demands that browsers be incredibly forgiving. Pages will still render, even if the tags are badly formatted (e.g. tags are improperly nested, or a closing tag is missing). Any tags that the browser does not recognize are simply rendered as if the tags weren’t there. This means that even bad designers can make websites that work (for better or worse). More importantly, it makes HTML forward compatible: if a new version of HTML is being rendered on an old browser, it will just ignore the new tags, and the rest of the page will render as usual.

This is done on purpose. The creators of HTML were designing for longevity, and they were very successful.

A Brief History of HTML

HTML was created by Tim Berners-Lee in 1989, and has gone through many, many iterations since then. Berners-Lee did not place any copyright or patent restrictions upon HTML, believing that it should be free for the world to use. He may have lost some money by doing so, but the result has been of immense benefit to the world. HTML caught fire rapidly, as more and more people were connected to the Internet.

But the fact that HTML was quickly, and widely, adopted has its drawbacks. HTML was initially designed for academics, who are interested in things like correct data or accurate citations, and not so much about how an article looks. This is not true of the “real world.” HTML quickly reached the hands of artists, magazine publishers, or businesses, and these people care much more about how a website appears to the user. The people who wrote HTML were often graphic designers, not programmers. The ease of starting a personal website attracted many amateurs. Word processing programs quickly added functionality to output HTML, and they too made appearance a priority, since users expected the HTML output to look exactly the same as a paper printout.

The end result was the widespread use of terrible HTML. Tags were not nested properly, and closing tags were left off, resulting in what is colloquially known as tag soup. The semantics of HTML tags were ignored; tags were used according to how they rendered in (competing, proprietary) web browsers. Fonts and text styles were specified in the HTML itself. Tables were used for layout, not for tabular data. HTML’s focus on text data meant that multimedia websites required proprietary plug-ins, like Flash, Qucktime, or RealVideo. And so on. Making matters worse, widespread use meant that terrible HTML became a kind of de facto standard, and sometimes made it into books on the subject.

Pretty much everyone recognized these problems, and new Web technologies were created in order to solve them. In 1994, Berners-Lee founded the W3C (the World Wide Web Consortium), a multi-stakeholder (and increasingly multi-national) group designed to create and maintain Web standards. In 1996, they developed CSS (Cascading Style Sheets), so that a web page’s appearance could be separated from its content. This would solve many of the most aggregious problems with bad HTML (and would have solved them quicker, had CSS standards been adopted universally across all browsers).

Another significant standards organization is WHATWG (the Web Hypertext Application Technology Working Group). This group was formed in 2004 by Apple, Mozilla, and Opera, though it is open to everyone. It was formed as a reaction against the W3C’s focus on XML over HTML, and continued developing HTML standards, particularly HTML5. Eventually, the W3C’s HTML working group merged its standards with the WHATWG standards, and both continue working on the new HTML5 standard in parallel.

There have been several revisions of HTML over the years. As of this writing, there are three “flavors” of HTML that are in widespread use. From most-used to least-used, they are:

HTML5
This is the newest version of HTML. In fact, the standard was only finalized in October 2014. This does not mean that it is a bad idea to use it – most browsers have supported HTML5 since around 2010, and roughly half of all websites use HTML5 right now.

Part of the reason that HTML5 was developed was to move proprietary, third-party multimedia capabilities into the open HTML standard. These would be things like audio, video, or drawing on a canvas area. Unsurprisingly, most articles about HTML5 focus on these capabilities. But HTML5 also added many other kinds of tags, and redefined some old tags to have new meanings.

XHTML 1.0
XHMTL is a variant of HTML that uses the syntax of XML (the eXtensible Markup Language). XML is very similar to HTML, and shares the same basic tag syntax, but the tags names are not defined by any universal standard. Instead, they must be defined by a DTD (Document Type Definition), which tells XML clients what the tags mean. XHTML, then, is XML whose DTD is written such that the XML tag definitions match the HTML tag types.

In practice, this means very little to HTML designers, except that the syntax is a little stricter. Tags must be lower-case, you can’t leave off the closing tag, and tags must be nested properly. These are things that good HTML designers should have been doing anyway.

Because XHTML is also XML, it can be used by any tools that can transform XML documents, particularly XSLT (eXtensible Stylesheet Language Transformations). But XSLT never really caught on, and is becoming less popular by the minute, so I won’t mention it again.

There is actually a newer version of XHTML (version 1.1), but it is not widely used. There were not many differences between 1.0 and 1.1, so most HTML designers stuck with XHTML 1.0. Additionally, there are two versions of XHTML 1.0: strict and transitional. The former is the standardized version of XHTML 1.0. The latter was designed for designers who were transitioning to XHTML from the more forgiving syntax of HTML (hence the name). Few people use the transitional version anymore, and the things it forgives are bad design anyway.

HTML 4.01
This is the latest finalized HTML4 standard, the one used prior to HTML5. HTML 4.01 is now considered deprecated, but the standard does specify the tags that are used in XHTML 1.0, so it is still a good idea to know it. Since the HTML specification is forgiving, valid XHTML 1.0 is also valid HTML 4.01.

(Note: the links are to the W3C specification for that version.)

Because HTML is a living language, many tags or attributes are either not in widespread use, or have been deprecated in subsequent HTML versions. But that doesn’t mean that they’ve entirely disappeared, especially if you have to deal with legacy HTML code (which you eventually will). In this article, I will alert you when tags and attributes fall into these categories:

  • Deprecated: These are tags or attributes that have been officially removed from the HTML standard, often in HTML5, but sometimes in HTML 4.01. You should not use these tags, and if you run across them in legacy code, you should remove them or change them to other tags.
  • Discouraged: These are tags or attributes that are not officially deprecated, but are still a bad idea to use for one reason or another (usually because they’re archaic, or non-semantic). I recommend that you treat them as deprecated.
  • Uncommon: These are tags or attributes that are neither dipricated nor discouraged, but were never widely adopted by web designers for one reason or another. There is nothing wrong with using them, but most people won’t need to.
  • XHTML/HTML5 only: Certain tags (mainly in the header section) are only applicable if you are writing XHTML. They are not deprecated or discouraged, and may in fact be necessary for valid XHTML, but you don’t need them if you’re writing HTML5 (or HTML 4.01). Conversely, there are many tags that are new to HTML5, and will not be recognized by the browser in XHTML or HTML 4.01 documents. (A tag that is valid HTML 4.01 but not valid HTML5 is deprecated.)
  • Non-standard: As explained above, both the W3C and WHATWG are pushing for HTML5 standards. Those standards usually agree, but in some cases, a tag is supported in one standard but not another. (Usually it’s part of WHATWG and not W3C.) Other tags may be in both standards, but the definitions of the tags differ. These tags are therefore not “standard” (since a standard needs to be agreed upon), but still have some browser support. Naturally, I’ll give the details when I discuss these tags. My suggestion is to treat these tags as deprecated.

Obviously, the middle two categories are fairly subjective, and depend upon the ever-evolving consensus of Web designers (and possibly my own biases). I’ll do my best to explain why something is considered discouraged or uncommon, or why the W3C deprecated it.

The Separation of Concerns: HTML, CSS, and JavaScript

Because HTML is a markup language, it is not “executed” per se. There are no variables, no control structures (if/else or switch/case statements, for or while loops), and no functions. HTML pages are static: once a page is rendered in the browser, it is fixed. No text can be added or removed, and the page’s appearance will never change.

“Hold on,” you are probably saying right now, “that can’t be right. Nearly every Web site I visit is dynamic in some way. Images change when my mouse hovers over them. Search suggestions appear as I type. I can drag-and-drop page elements. If web pages were static, I couldn’t do any of that.”

You are correct. The answer is: none of that is done using HTML. The only thing HTML represents is the content of the page; everything else is handled using CSS or JavaScript.

The computer science term for this is the separation of concerns. This stands for the notion that different aspects of a program should be handled by different, loosly-coupled systems. In this case, the “program” is a web page, and the “concerns” are content, presentation, and behavior. HTML defines the content; CSS determines its presentation; and JavaScript handles the behavior.

Behind the scenes, the Web browser translates the HTML document into the DOM (Document Object Model). This is an abstract data structure, where each HTML element is translated into a node object. Because of the format of an HTML document, and the fact that tags may be nested, the DOM ends up having a tree-like structure, so tree terminology is used. A parent node contains another node in its contents; a node that is contained in another element is a child of that element; and two nodes with the same parent are siblings of each other. If a node is the parent of a distinct section of a document (like a table), it is sometimes called the root of that section. The root of the entire DOM tree is called the document root.

The DOM can be used by JavaScript, which has various built-in methods to access and manipulate the node objects. CSS also uses the DOM, but this is done behind the scenes; the CSS syntax targets HTML tags, not DOM nodes.

This is precisely why HTML tags should be purely semantic. When you look at HTML code, you should have no idea what the colors will be, or what the font is, or how text is formatted (bold, italic, underline, etc). Likewise, you should never specify what text will do when it is clicked, when the mouse hovers over it, etc. In other words, you should not put style information or JavaScript calls in an HTML document.

The combination of HTML, CSS, and JavaScript creates what used to be called DHTML (Dynamic HTML). This term seems to have fallen out of favor lately; the latest book to use the term “DHTML” was written in 2007, and a search for “DHTML” on Google Trends shows that its usage has been declining since 2004. This is not because dynamic web pages are more scarce; it’s because the term is too limiting. Client-side technology is now so powerful and stable, that it does not just create dynamic pages, but full-fledged web applications.

About Karl

I live in the Boston area, and am currently studying for a BS in Computer Science at UMass Boston. I graduated with honors with an AS in Computer Science (Transfer Option) from BHCC, acquiring a certificate in OOP along the way. I also perform experimental electronic music as Karlheinz.
This entry was posted in HTML and tagged , . Bookmark the permalink.

4 Responses to A Guide to HTML

  1. Ben says:

    You probably know this, but the self-closing-ness of your tags are likely to be totally ignored by the browser unless you set the mime type to xhtml! For example, in most browsers this
    This is a paragraph
    will render exactly the same as this:
    This is a paragraph

    A common gotcha is thinking that you can use a self-closing script tag, e.g.

    instead of

    More on this on stackoverflow:
    http://stackoverflow.com/questions/69913/why-dont-self-closing-script-tags-work

    So there can be arguments for doing this for style or tool support, but the browser really doesn’t care.

    • Ben says:

      wow, looks like wordpress doesn’t escape tags. Sorry, but hopefully the stackoverflow article explains well enough

      • Karl says:

        Also, about the tags – nope, WordPress does not excape HTML; you can use it to mark up comments (as I did just now), so it really can’t do that. For HTML tags, you have to use the &lt; and &gt; escape sequences. I also had to do this when I wrote the article, so I know how much of a PITA it is.

    • Karl says:

      Ben: First of all, thanks for taking the time to read the article. I need all the help I can get…

      The “self-closingness” issue applies only to tags that do not represent empty elements, and that includes the <script> tag. It’s supposed to contain text data (the actual JavaScript code). Trying to make these tags self-closing is not valid XHTML, and browsers will consider it “tag soup.”

      If the tag is actually self-closing – like the <br /> or <input /> tags – then the XHTML standard demands that they be properly terminated, or they won’t validate. The HTML standard (even HTML5) does not, but since XHTML is the one that has been used for years, I think it’s better to include the terminating slash.

      However, the Stack Overflow post did show something else that I wasn’t aware of: the <p> tag cannot contain other block-level tags, like <div>. (Inline tags are fine.) If you try to do this, the browser will consider it “tag soup” and automatically treat it as a tag with empty content. In other words, <p><div>Hello, world!</div></p> will turn into <p></p><div>Hello, world!</div><p></p>.

      I’ll update the article with this info. So, thank you for pointing this out. Please let me know if you find anything else in the article that needs work.

Leave a comment