A Guide to HTML

HTML Headers and Structure

An HTML page is fundamentally divided into two sections: the header section, and the body. The header section specifies metadata about the HTML page itself. The body of the HTML page is the part of the HTML document that is displayed in the browser. This means that all visible markup tags go in the body section.

There are also tags that tell the browser to recognize the file as an HTML file (and not some other form of plaintext). These are technically not HTML tags, but they are needed in the header section, so I’ll also cover them here. I’ll go through the tags in the order that they usually appear.

XML declaration (XHTML only)
This is an XML tag, which is needed at the start of all XML documents. It is also called the XML prolog. The tag is used to specify the XML version, and the character encoding you are using. (See below for a discussion about character encoding.) You only need this tag if you are writing XHTML; HTML documents should not use this tag at all.

Here is the tag for XHTML 1.0, using the UTF-8 character set:

<?xml version="1.0" encoding="UTF-8"?>

You can also use XML version 1.1, or another character set, but these are the most common. In fact, if you use the default UTF-8 character set, this tag isn’t necessary at all; the browser will pick up the fact that it is XHTML from the DOCTYPE tag (see below).

If you are not using XHTML, then you should supply the character encoding using a <meta> tag with the http-equiv attribute. See below for information about the <meta> tag, and further below for a discussion about character encoding.

<!DOCTYPE>
This tag specifies the document type for the file. It is not an HTML tag, though it has a similar syntax. It tells the browser that the file is HTML, so all HTML documents must have this tag, and it must appear at the very beginning of the document. Here are DOCTYPE tags for the most common versions of HTML:

HTML5
<!DOCTYPE html>
XHTML 1.0 Strict
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
HTML 4.01 (deprecated)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
This HTML tag encloses the entire HTML document (except for the DOCTYPE tag, and the XML prolog if applicable). All other HTML tags, including the header and body, should be nested inside this one.

Attributes:

xmlns (XHTML only)
This attribute determines the XML namespace. Unsurprisingly, it is only necessary if you are creating an XHTML document; HTML documents shouldn’t use this attribute. If you’re using XHTML, the attribute’s value should be http://www.w3.org/1999/xhtml.
lang
This attribute specifies the document’s language. It is actually a global attribute, but this is where it is most commonly used. If it is omitted, the web browser will try to determine the language by other means, but in practice it will probably default to English. On the other hand, if the attribute specifies a language that is different from the user’s, then browsers that integrate with translation services (like Google Chrome) will offer to translate the page. If your document is written in any language other than English, you should always use this attribute; even if it is in English, setting the language never hurts.
<head>
Encloses the entire HTML header; everything else should go in the body. The rest of the tags in this section should all go in the header section, before the closing </head> tag. (The possible exception might be the <script> tag – see below.)
<title>
Specifies the title of the page. This tag is required. There can be only one <title> tag in a valid HTML document. The title is not displayed on the page, but it is displayed in the browser’s tab bar (if there is one). The contents should match the title of the page as displayed to the user (using e.g. a header tag – see the next section).
<meta />
The <meta> tag specifies various types of metadata about the HTML page to non-human readers (such as search engines). Keep in mind that providing metadata does not mean that anyone will actually use it. It is an empty tag; all the metadata is provided by its attributes.

Attributes:

content
This attributes supplies the metadata associated with the name or http-equiv attribute values (see below).
name
The name attribute specifies the type of metadata that the tag specifies, when it is not a content-type or http-equiv attribute. This is some kind of metadata that you want machines to read (though there’s no guarantee they will). The type of data is the value of this attribute; the actual data is the value of the content attribute.

The name attribute must be one of these values:

application-name
The content value will be the name of the web application associated with the HTML page. This is usually put there by the application itself.
author
The content value will be the name of the author of the HTML page. If there are multiple authors, you should use a separate tag for each.
description
The content value will be a short description of the web page’s content. Search engines commonly display the description below a link to the web page.
generator
The content value will be the name and version of the software used to create the HTML file (e.g. Dreamweaver). This is usually put there by the application itself.
keywords
The content value will be a comma-separated list of keywords for the web page. In theory, if the user types some of these keywords into a search engine, your page should come out nearer to the top in search engine rankings. I say “in theory” because search engines all but ignore keywords nowadays. Unsurprisingly, this metadata was widely abused by unscrupulous websites seeking to game their search engine rankings. Even if the page author is not being abusive, the plain fact is that the user – not the website owner – determines which page is the best fit for certain keywords. Still, it doesn’t hurt to provide keywords, especially if those keywords are rare or topic-specific.
robots
This attribute is used to tell search engines to ignore the page in specific ways. The content value will be either noindex (to not index this page), nofollow (to not index outgoing links), or both (separated by a comma). Reputable search engines will follow these directives; unreputable ones will not. Obviously, this only applies to the current page, so the same content will still be indexed if it is linked from another website.

This value applies to all search engines; specific search engines may use their own values. For example, Google Search uses googlebot, so you could make your website available to all search engines except Google, if for some reason you wanted to.

charset (HTML5 only)
In HTML5, this attribute specifies the character encoding. If you are using XHTML, then you should specify the character encoding in the XML declaration instead. You can also specify the character encoding using the http-equiv attribute, but this is outdated. See below for a discussion about this.
http-equiv
Specifies a string of “HTTP equivalent” data. It is not very useful, since anything this attribute can do is better accomplished by other means. The http-equiv attribute can have one of these values:

content-type (discouraged)
Specifies the content type (which will always be “text/html”) and character encoding, separated by a semicolon. For example, this tag specifies that this is an HTML document with UTF-8 character encoding:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

This attribute is rarely used anymore. If you are using HTML5, then you should use the charset attribute instead. On the other hand, if you are using XHTML, then you should specify the character encoding in the XML declaration. See below for a discussion about this.

default-style (uncommon)
Specifies the stylesheet that will be used as the default. The value must be the name of a stylesheet imported using the <link> tag (see below). It is rarely needed; multiple stylesheets will be imported in the order they appear in the code, so it’s more common to simply use the first <link> tag for the default stylesheet.
refresh (discouraged)
If this attribute is used, the page is refreshed at regular intervals. The value is the refresh delay, in seconds; it may be followed by a URL to go to upon refresh (separated with a semicolon). This is discouraged by the WC3, because it takes control of the page away from the user. It was usually used as a “hack” to do URL redirection, to prevent link rot (or for more nefarious reasons). This is also discouraged because it may break the browser’s back button. It is better to use URL rewriting, or to have the server send an send an HTTP redirection status code.

There are a lot more <meta> attributes and values in use today, but these are the most common. You can also use an attribute starting with data- as a custom data attribute, though the cases where this is useful are rare. If you want the details, read “Embedding custom non-visible data with the data-* attributes” from the WHATWG HTML5 living standard.

This tag defines a link between the HTML page, and an external resource. This other resource is commonly a CSS file. Multiple <link> tags are allowed in the header, so you may import more than one CSS file. These are imported in order, so if there are conflicts, the last CSS file imported is the winner.

The most common use is to import CSS, but there are other uses as well. For example, this tag could be used to link to a translated version of the document, or to the same document in a non-HTML format. It is also used to define a link to a favicon, the small image that is displayed in a browser’s favorites menu.

Attributes:

This attribute specifies the MIME type of the linked document. The default value is text/css, which is the MIME type for a CSS file. This is almost always what you want, so this attribute is only specified if you’re linking to something other than CSS. For example, if you’re linking to a favicon, the type might be image/x-icon.
“Hyperlink reference.” This is the URL of the file that you want to link to. Like any URL used in HTML documents, it can be either a relative or fully-qualified URL.
This attribute is required. It defines the relationship between the HTML document and the linked file. The value will usually be stylesheet, since this is used with CSS. If you’re defining a link to a favicon, you can use icon instead.
This attribute can be used to only include the linked resource when the user is browsing on specific media. This is usually used to provide different CSS files for different viewing contexts, such as print or mobile versions. If this attribute isn’t specified, the default value is all (all media types). Other common values are screen and print. HTML5 and CSS3 allow you to go into even more detail, specifying such things as screen resolution or device orientation. However, this is probably handled better in the CSS files themselves. If you want details, read the W3C recommendation on CSS3 media queries.
<style> (discouraged)
This tag is used to embed CSS styles in your HTML. Since HTML is a semantic language, it should never contain information about presentation. So, don’t use this tag. Instead, put your CSS in a separate file, and import it using the <link> tag.
<script>
This tag is used to embed a script in an HTML page. The scripting language is almost always JavaScript. The same tag can be used to either import a JavaScript file, or to include JavaScript in the HTML page itself. The content of the <script> tag would be the actual JavaScript code. But remember that HTML is semantic, not behavioral, so you should never put JavaScript inside the <script> tag. Instead, use the tag’s attributes to import the JavaScript from a separate file.

Whether it’s in the tag contents or imported, the script is parsed and executed when the <script> tag is encountered. If the tag is in the HTML header, this will be before the HTML body is even encounterd (and before the DOM is fully created). This can lead to some unexpected behavior in JavaScript, and it may also delay the loading of the HTML page. For this reason, some people recommend putting the <script> tag at the bottom of the body, and not in the header. This is often done, but the vast majority of people still put the tag in the header section.

It is a fairly common mistake to treat the <script> tag as an empty tag when importing a JavaScript file. In other words, you should never do this:

<script src="myscript.js" />

Instead, you should always include the closing tag:

<script src="myscript.js"></script>

Attributes:

async/defer (uncommon)
If you use the defer attribute, you are telling the browser to defer execution of the script until the entire page has been loaded. This is a Boolean attribute, so the value of this attribute must also be defer.

The reason this attribute is uncommon is because browser support is spotty. It was not supported by Internet Explorer before version 10, Firefox before version 3.6, and Opera before version 15.0.

If you want to make sure the script follows the default behavior – that it is not deferred – then HTML5 offers the async attribute as well. It is also a Boolean attribute (so its value should be async), and support for it is just as spotty.

src
This is the URL to an external source file. The value can be either a relative path, or a fully-qualified URL.

If you’re curious, you might be wondering what the difference is between src, and the href attribute used in other tags. The answer is that replaced elements use the src attribute, while non-replaced elements use the href attribute. A replaced element can have intrinsic dimensions, determined by the external resource, so that resource must be parsed before the rest of the page.

type
This attribute specifies the script’s MIME type. For JavaScript, the value would be text/javascript. This is the default value in HTML5 (and was assumed by most browsers for many years), so you may omit it.
language (deprecated)
This attribute used to be used to specify the script’s programming language. The value of this attribute was never standardized, and has been deprecated for quite some time. Instead, use the type attribute (if necessary).
<base /> (uncommon)
This tag specifies the base address for all relative URL’s in the document, and/or the default target for all links on the page.

There are two reasons it is not widely used. First, there are not many cases where it is useful; the default linking behavior usually works fine, and can be overridden on a per-link basis. Second, it is not consistently implemented across browsers. Internet Explorer requires a closing tag, but the official specification defines it as an empty element.

Attributes:

href
A hyperlink reference (URL) to be used as the base for all relative URL’s in a document.
target
The default target for all links on a page. The values of the target attribute can be:

_self
The same browser window as the current HTML page. This is the default.
_blank
A new window or tab in the browser.
_parent
The parent of an <iframe> element. (It could have also been the name of a frame in a frameset, but framesets are deprecated.)
_top
The topmost browser window. Unless the HTML page is in an <iframe>, it is the same as _self.
Frame ID
The ID of an <iframe> element.

After all of these header tags, you’re finally ready to present the body of the HTML to the user. This is done using the <body> tag. This tag supports the usual global attributes and event handlers, but they’re generally not specified in the HTML itself; you use JavaScript to attach event listeners, and use CSS to define the document’s presentation styles.

A Note About Character Encoding

A character encoding is a mapping of machine-readable numbers to human-readable characters. A character set is the set of characters that can be represented by a specific encoding. These may be characters from English, Arabic, Chinese, Cyrillic, or any other alphabet (depending upon the encoding). Thus, the character encoding determines which set of languages can be successfully represented by a machine. In the context of an HTML page, the “machine” is the web browser.

Character sets have been growing as the Internet evolved to cover more of the world. The first websites used ASCII, which is incredibly limited, so other character sets were adopted almost immediately. From HTML 2.0 to HTML 4.01, the default character encoding was ISO-8859-1, and it is still occasionally used. However, the default character encoding used in XHTML and HTML5 is UTF-8. If you’re creating a web page today, you should use UTF-8 unless you have a specific reason not to.

As you can see above, the character encoding can be specified in multiple places in the HTML header. In fact, it can also be specified in the HTTP header, sent by the web server before the HTML page is transmitted. If there are conflicts, which one is used? According to the W3C, the browser will determine the character encoding in this order:

  1. The HTTP Content-Type header
  2. The XML declaration (if the document is XHTML)
  3. A <meta> tag, using the charset attribute (or if HTML 4.01, the http-equiv attribute)
  4. The default character set is assumed (usually UTF-8)

If the above doesn’t work (i.e. you’re using an encoding other than UTF-8, but don’t specify what it is), then the browser will do whatever the hell it wants. This is almost certainly not what you want, so unless you’re using UTF-8, you should always specify the character encoding.

Templates

If you’re writing HTML, you’re most likely writing the same version of HTML, with roughly the same data in all of the headers. So, just for you, I’ve created a couple of standard HTML templates. Just fill in your information, and put whatever you want into the body.

HTML5 Template
<!DOCTYPE html>
<html>
<head>
  <title>YOUR_TITLE_HERE</title>
  <meta charset="UTF-8" />
  <meta name="author" content="YOUR_NAME_HERE" />
  <meta name="description" content="YOUR_DESCRIPTION_HERE" />
</head>
<body>
<!-- Content of web page -->
</body>
</html>
XHTML 1.0 Template
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>YOUR_TITLE_HERE</title>
  <meta name="author" content="YOUR_NAME_HERE" />
  <meta name="description" content="YOUR_DESCRIPTION_HERE" />
</head>
<body>
<!-- Content of web page -->
</body>
</html>

Framesets (Deprecated)

HTML framesets allow multiple HTML documents to be displayed in the same browser window. The frameset specifies a number of resizable frames, each of which has an associated HTML document, and are displayed on different sections of the screen.

Before we go any further: framesets are bad. They were officially retired in HTML5, but they were discouraged in XHTML 1.0, and have not been used on websites since the 1990’s. There are many good reasons for this; if you want detalis, read Jakob Nielsen’s 1996 article, Why Frames Suck (Most of the Time). (I personally would have left off the part in the parentheses.) If you really need to embed one HTML page in another, you can do that using the <iframe> tag. I will talk about that when we get to the section on including media.

Unfortunately, some people may still encounter framesets in specific situations. For example, code documentation generators (like Doxygen or Javadoc) may still output HTML with framesets.

If you’re not one of those people, then you should stop reading now, and skip ahead to the next section. The sooner people stop even thinking about framesets, the better.

DOCTYPE
A website that uses frames must specify that it is a frameset, and not an HTML page. Here are the DOCTYPE tags for HTML 4.01 and XHTML 1.0 (which, again, are only for reference):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
<frameset>
This tag encloses the entire frameset, in the same way that the <html> tag encloses an entire HTML document.

Attributes:

cols
Specifies the size of each column in a frameset (and, thus, the number of columns).
rows
Specifies the size of each row in a frameset (and, thus, the number of rows).

Specifying one of these attributes is required. If the number of framesets do not match the number of rows times the number of columns, then either the latter frames won’t be rendered, or the frame will be rendered with a blank HTML page (depending on the mismatch).

<frame />
The frame tag specifies the data about a specific frame. It is an empty tag, so it should be terminated with a slash before the closing angle bracket.

Attributes:

name
Specifies the name (that is, ID) of the frame. If the links of one frame want to target another frame, they would use this name. You could also use the standard, non-deprecated id attribute for this, but if you’re using framesets, you’re using deprecated HTML already.
src
Specifies the URL of the HTML file to be displayed in the frame.
noresize
Prevents the frame from being resized by the user. The value must be noresize.
scrolling
Determines whether the frame will have scrollbars. Values can be yes, no, or auto.

There are more attributes, but they were not widely used even before framesets were deprecated, so I won’t go into them here.

<noframes>
The contents of this tag would be displayed to the user in the event that their browser couldn’t handle frames. When frames were around, it would usually be a “helpful” note to the users telling them to update their browser. You can probably guess how that went over.
Advertisements

About Karl

I live in the Boston area, and am currently studying for a BS in Computer Science at UMass Boston. I graduated with honors with an AS in Computer Science (Transfer Option) from BHCC, acquiring a certificate in OOP along the way. I also perform experimental electronic music as Karlheinz.
This entry was posted in HTML and tagged , . Bookmark the permalink.

4 Responses to A Guide to HTML

  1. Ben says:

    You probably know this, but the self-closing-ness of your tags are likely to be totally ignored by the browser unless you set the mime type to xhtml! For example, in most browsers this
    This is a paragraph
    will render exactly the same as this:
    This is a paragraph

    A common gotcha is thinking that you can use a self-closing script tag, e.g.

    instead of

    More on this on stackoverflow:
    http://stackoverflow.com/questions/69913/why-dont-self-closing-script-tags-work

    So there can be arguments for doing this for style or tool support, but the browser really doesn’t care.

    • Ben says:

      wow, looks like wordpress doesn’t escape tags. Sorry, but hopefully the stackoverflow article explains well enough

      • Karl says:

        Also, about the tags – nope, WordPress does not excape HTML; you can use it to mark up comments (as I did just now), so it really can’t do that. For HTML tags, you have to use the &lt; and &gt; escape sequences. I also had to do this when I wrote the article, so I know how much of a PITA it is.

    • Karl says:

      Ben: First of all, thanks for taking the time to read the article. I need all the help I can get…

      The “self-closingness” issue applies only to tags that do not represent empty elements, and that includes the <script> tag. It’s supposed to contain text data (the actual JavaScript code). Trying to make these tags self-closing is not valid XHTML, and browsers will consider it “tag soup.”

      If the tag is actually self-closing – like the <br /> or <input /> tags – then the XHTML standard demands that they be properly terminated, or they won’t validate. The HTML standard (even HTML5) does not, but since XHTML is the one that has been used for years, I think it’s better to include the terminating slash.

      However, the Stack Overflow post did show something else that I wasn’t aware of: the <p> tag cannot contain other block-level tags, like <div>. (Inline tags are fine.) If you try to do this, the browser will consider it “tag soup” and automatically treat it as a tag with empty content. In other words, <p><div>Hello, world!</div></p> will turn into <p></p><div>Hello, world!</div><p></p>.

      I’ll update the article with this info. So, thank you for pointing this out. Please let me know if you find anything else in the article that needs work.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s