HTML and the World Wide Web
While these technologies are not directly relevant to writing HTML files, you should probably know a little about them, so I’ll give a brief overview. But, be forewarned: this is a deep subject, and I’ll only have a chance to scratch the surface. Moreover, web technologies change rapidly, so this section may be obsolete in a couple of years. (Whether this makes you discouraged or enthusiastic is up to you to decide, but if you want to be a programmer, you’d better get used to it.) I don’t expect you to remember everything, but hopefully I will at least be able to give you a vague notion of how everything works.
Transmitting HTML over the Internet
The most common way that HTML pages are viewed is over the World Wide Web. If this is the case, then the HTML page is delivered using HTTP (the HyperText Transfer Protocol). Nearly all websites use HTTP/1.1, though it was technically superseded in 2014. This protocol is relatively simple, and starts with a short header that specifies some pretty basic metadata, such as the document type, the character encoding, the type of transfer request, and a code representing the request status. You’ve probably seen many of the HTTP status codes already, such as 404 (Not Found), 403 (Forbidden), and possibly 500 (Internal Server Error). If the status is 200 (OK), then the HTML page will be delivered as planned, and the status code is not displayed to the user.
All HTTP transfers are simple plaintext, and are therefore not secure. For this reason, many websites are now using HTTPS (Hypertext Transfer Protocol Secure, or colloquially “secure HTTP”). HTTPS is just HTTP data, encrypted using SSL (Secure Sockets Layer) or its successor, TLS (Transport Layer Security). Most security-conscious websites (e.g. banking websites) have been using HTTPS for years, but with the revelations about NSA spying, it’s becoming common to use HTTPS for everything.
But it is not necessary to view HTML pages via the Web. HTML documents are simply plaintext files, saved with an
.htm extension. You can create them using any plaintext editor, and view them simply by opening the file in a browser, e.g. by double-clicking on it. This means that writing HTML is staggeringly easy. It also means that HTML files are easy to distribute – for example, many programs’ “help systems” are just directories containing documentation in HTML format, which are installed when the program is installed. Naturally, there is no HTTP transfer in these situations, so any information that would be in the HTTP header is absent. This isn’t a big deal, but it does mean that you should include any relevant metadata in the HTML page itself. You do this in the HTML header section. Additionally, the type of HTTP request is relevant when talking about HTML forms. I’ll cover all of this later in the article.
All devices that are connected to the Internet (computers, mobile phones, smart TV’s, etc.) have an IP address (Internet Protocol address). This is a simple numeric address that is used for routing Internet data to the device. The most commonly used IP version is IPv4, which is a 32-bit address. It is usually specified in quad-dotted notation: each byte (8 bits) is written individually in decimal notation, and separated by a period (the “dot”). So, a typical IPv4 address might be
However, a 32-bit number can only handle 232 devices worldwide – a little over 4 billion devices. Considering that there are about 7 billion people on the planet (as of this writing), and some of them have two or three Internet-enabled devices, the world is rapidly running out of IPv4 addresses. This situation is called IPv4 address exhaustion, and forward-thinking Web architects have been aware of it for many years. The newest version of IP addressing is IPv6, which has a 128-bit address space. This means it can hold more addresses than there are stars in the known universe, so it should do for a while. An IPv6 address is represented as eight groups of four hexadecimal digits, separated by colons. Leading zeros can be left off, and one or more groups that hold values of zero can be collapsed into two colons. For example, an IPv6 address might be
2e0a:00d8:b5a3:0000:0000:8a2c:0370:73d4, which would be shortened to
2e0a:d8:b5a3::8a2c:370:73d4. Right now, most people are still using IPv4, but this will surely change in the near future.
If you find these numbers difficult to understand, then imagine how the average Internet user would feel. Humans are much better at remembering words than numbers, so nearly all websites have a human-readible name that is translated into the relevant IP address behind-the-scenes. This is done using the DNS (Domain Name System). The DNS is heirarchichal in nature, where each level in the heirarchy is a different “zone;” authority over each zone is delegated to different organizations. DNS names are resolved by computers running database software, located around the world, and the computers that do this are called name servers.
The “suffix” of the domain name (.com, .org, .gov, etc.) is called the TLD (Top-Level Domain), and is the root zone of the DNS. A TLD may be a gTLD (Generic Top-Level Domain), like
.net, or a ccTLD (Country-Code Top-Level Domain), like
.uk. There are other kinds of TLD’s, but these are the most common. As the domain name goes from right to left, each string of characters (separated by a period) refers to a subdomain of the domain to its right. When a domain name can be mapped onto a specific IP address, it is called a hostname. Once a hostname is found, the job of the DNS name server is over, and the web server at that IP address takes over. Because hostnames usually resolve to the same IP address for long periods of time, the results of DNS queries are often cached at various locations (possibly including the user’s computer), so changes often take time to propagate.
The entire root zone is managed by ICANN (the Internet Corporation for Assigned Names and Numbers), under the authority of the U.S. Department of Commerce. The most widely used TLD’s (including
.edu) are managed by Verisign, a U.S. company. Many people are troubled that so much of the DNS falls under U.S. jurisdiction, and ICANN announced in March of 2014 that it will be moving away from U.S. control. Pretty much everyone agrees that this is long overdue.
“Internet” means more than the Web, and the same hostname (or IP address) can be used for email, file transfers, Telnet, and so forth. This is accomplished by using port numbers: 16-bit numeric values used to route traffic to a specific application. Each application listens on a specific port number, and handles whatever traffic is routed to that port. The default port number used with HTTP is port 80, while HTTPS uses port 443. Another common HTTP port number is 8080, which is often used when testing server-side applications on a local computer. Eventually, you will also encounter ports 21 and 22, which are the default ports used with FTP (File Transfer Protocol) and SFTP (SSH File Transfer Protocol), respectively. Most people use these protocols to transfer files from a local computer to a remote web server, and vice versa.
Naturally, each type of information has its own transmission requirements, so HTTP is far from the only transfer protocol. Collectively, these protocols are considered part of the transport layer – the part of a communication suite that handles message passing. There are two major “families” of protocols at the transport layer: TCP (Transmission Control Protocol), and UDP (User Datagram Protocol). The major difference between the two is that TCP is connection-oriented, while UDP is connectionless. That is, TCP requires that a connection be established at both ends before any data is transferred, while UDP does not. UDP is therefore faster than TCP, but less reliable. The major Internet protocols (including HTTP, HTTPS, POP3, IMAP, FTP, and Telnet) are TCP protocols, and this suite of protocols is often referred to as TCP/IP. The primary exception is DNS resolution, which uses UDP over port 53.
Other than HTTP, another commonly-encountered protocol is the
mailto protocol. This protocol is used to send an email to an email address. In this case, the URL will simply be
mailto: followed by the email address. This protocol used to be common in HTML pages, but it fell out of use fairly quickly, because there are a number of problems with it:
- The email address is visible in the source of the HTML page, thus visible to malicious user agents (like spambots).
- Web browsers will simply use the operating system’s email client. If the user reads email through a webmail service (like Gmail or Outlook.com), then they won’t have an email client set up on the operating system itself. They will either not be able to send the email with this method, or (more likely) the OS will attempt to set up whatever email client came with the OS. This will result in a barrage of questions to the user, whose only purpose is to set up software that they don’t want.
Of course, the above information is not enough to locate a specific resource on the Web site (an HTML page, PDF file, picture, or whatever). You also need to name the resource that you’re trying to get – for example, the HTML page’s filename. That resource may be inside (virtual) directories on the Web site, in which case the directories are separated by a forward slash, exactly as they would be on a Unix file system. And like a file system, the combination of folders and filename is called the path to the resource. This is not accidental; you’ll see why when I talk about web servers.
The combination of protocol, port, host, and path is called a URL (Uniform Resource Locator). It follows this format:
[protocol]://[IP address or hostname]:[port]/[path]
There are a few miscellaneous things you should know about URL’s:
- The protocol and hostname are not case-sensitive; DNS name servers ignore case. The path, however, is case-sensitive (just as file and folder names are case-sensitive on an operating system). It is standard practice to use lower case for everything.
- The port is optional; if not specified, the browser will use the default port.
- Most browsers will automatically add
http://if you enter the rest of the URL into the location bar.
- If no specific resource is given in the URL, then the web server will look for a default page, usually called
index.html. More on this later.
When a URL has all of the above parts, it is called a fully qualified URL. But you do not always need (or want) a fully-qualified URL. When you create a link to another document with an HTML tag, then you use a URL to specify where that document is. But if all these URL’s were fully-qualified, it would be disastrous if you moved your website to a different URL; you would have to change all the URL’s in all the HTML pages of your entire website.
For this reason, a URL may also be a relative URL. A relative URL is a path to another document, using the current document’s directory as the current directory. For example, say your HTML page is at
http://www.example.com/posts/my-post.html, and inside that page, you want to create a link to
http://www.example.com/home/index.html. The relative URL would be
../home/index.html. The two dots mean “go up one directory,” exactly as if you were traversing a file system on the command line. Again, this is no accident; when you use a relative URL, this is essentially what you are doing. If you are referencing another file in the same directory, you only need the file name.
There are two other parts of a URL, and they are optional. They are:
- A query string
- This is appended to the URL, and is usually used to provide data to dynamic resources. I’ll go into details when I talk about dynamic resources (below), and when I talk about HTML forms (later in the article). It can also be used with the
mailtoprotocol, to specify default text in the subject line or email body. But using the
mailtoprotocol is a bad idea, so I won’t talk about it in this article.
- A fragment identifier
- This is an identifier for a “fragment” of a resource. In an HTML file, it is the value of a tag’s
idattribute. The browser will load the page, then scroll down to the HTML tag with that ID. The fragment is preceded by a pound symbol (
#), and must be the last part of the URL (after the query string, if there is one).
A URL is actually a type of URI (Uniform Resource Identifier). The main difference, at least according to the W3C, is that “URL” is an informal term for a URI that specifies the location of a resource and the scheme used to access it. A URI may also be a URN (Uniform Resource Name), which provides a global identification for the name of a document, but not its location, nor how to access it. As an analogy, a URN might give you the ISBN number of a book, but it will not give you directions to your local library, or tell you whether it will be available when you get there.
In my experience, most people use “URL” when they refer to resources available on the Internet, and “URI” in all other situations.
Of course, simply knowing the location of a resource does not guarantee that you can access it. (It would be catastrophic if you could retrieve any file on any computer simply by typing a URL into a browser.) You need a program that will take an incoming URL, and route it to an actual file sitting in the computer’s file system.
This is the primary purpose of HTTP server software, usually called web server software. The term “web server” can refer to the HTTP server software, or the hardware on which it runs, or both. Some people differentiate between the two using capitalization: “server” is the software, “Server” is the hardware. I won’t do this; I think it should be clear from the context.
A web server listens for incoming HTTP requests, on specific ports, and routes them to a specific resource on the machine. This process is called path translation. If the resource is static, then it is simply returned as-is. If a resource is dynamic, it is processed on the server, and the results of that process are returned instead. Both types of resources are returned using the same HTTP connection as the incoming request, so the response will naturally use the HTTP protocol.
Information can be passed to dynamic content in the URL itself, using a query string. This is a string with field/value pairs, preceded by a question mark, where field names an values are paired using an equals sign, and pairs are separated by an ampersand. It is appended to the URL of the dynamic resource. Here is an example of a URL with a query string:
Query strings like this are commonly generated using HTML forms, and I’ll go into detail when I talk about those.
A web server treats one directory on the hard drive as the document root, set up specifically to hold web content. Common directory names are
www, and that folder is usually located in
var directories on Linux systems. On multi-user systems, each user might have their own document root directory.
By default, directories in the URL are translated to sub-directories of the document root. For example, this URL:
…would be translated to this file in the document root directory:
If no filename for a resource is given, then the web server will search for a default page. This is usually called “index,” and web servers are set up to look for an HTML file by default (thus, the default page is usually
index.html). It may also be set up to search for a default script file (like
But this is not the only way that path translation can be accomplished. A web server can take a URL path, and route it to a resource that is not located at that path on the filesystem. It can also transform a “directory” in a URL path into a query string, and send that query string to a dynamic resource. For example, this URL:
…may be routed to the same resource, with the same query string, as this URL:
This process is called URL rewriting, and the software that does it is called a rewrite engine. (Though URL rewriting is often done by the web server software, it doesn’t have to be.) There are many advantages to rewriting URL’s:
- Semantic URL’s: users are more likely to understand a short, “clean” URL than a long URL or one with a query string.
- SEO: web crawlers won’t follow URL’s with query strings. By rewriting query strings as URL paths, dynamic content will show up in search engines.
- Permalinks: if Web content (static or dynamic) moves to a new location, URL rewriting can make the old URL point to the new location. Otherwise, old links would point to permanently unavailable resources, a condition known as link rot.
- Security: A rewritten URL can hide the implementation used to produce dynamic content. This will help protect the site against attackers, who may target a specific implementation’s security flaws.
By a wide margin, the most popular web server software is the Apache HTTP Server. (The Apache Foundation abbreviates it “httpd,” but most people just call it “Apache.”) It is open-source and cross-platform, and is often included in Linux distributions, all of which have unquestionably contributed to its widespread success. Also very popular is Microsoft’s IIS (Internet Information Services), which is not open-source, and requires Windows. Both of these have been around for many years, and are very secure and stable. What they are not, however, is lightweight, and several newcomers have stepped up to fill that gap. Of those, the widest-used are Nginx and lighttpd.
Most larger websites don’t have just one computer running web server software. Instead, they have a private network of many web servers, possibly thousands of them. Usually, one web server acts as a “master” server, with others acting as “slave” servers. The job of the master server is to route HTTP requests on one public, virtual IP address to a private, “real” address of a slave server. This is done so that no one server is overloaded with requests. This process is called load balancing, and there are a number of different algorithms to decide which slave server gets which request.
Of course, most people don’t want to worry about this stuff, and they certainly don’t want to worry about maintaining the necessary hardware. Instead, they pay money to private companies who worry about it for them. Companies that offer this sort of service are called web hosts, and there are dozens of them, offering many kinds of services. If users are running a personal website, or have a small business, they will pay for shared hosting. This is space on a shared web server, where websites use the same operating system, software, CPU, and RAM as everyone else. Usually, when you pay for shared hosting, you’re simply paying to be a user on a Linux mainframe. Services like these usually have tiered pricing, where each tier pays for pre-determined amounts of disk space and bandwidth.
For larger businesses, or businesses that are built around web applications, shared hosting won’t be enough. In the past, they would spend the extra money for dedicated hosting, where they rent out their own private machine, usually with server software pre-installed. Today, most larger companies are taking advantage of cloud services. Cloud services work like dedicated hosting, in the sense that clients rent out their own virtual machine (and possibly pre-installed software). But these virtual machines actually span clusters of hardware machines, so have the advantage of automatically scaling hardware resources (like memory or CPU’s) as the need arises. Cloud services are priced like utilities, where the client pays for the precise amount of resources that they use per hour. Examples include Amazon EC2, Google Compute Engine, or Microsoft Azure.
Most of you won’t need to worry about cloud services for quite a while; shared hosting works perfectly fine for personal websites or smaller businesses. In fact, if you’re a CS student at a university, you probably have an account on your university’s Linux system. This is usually the same system that is used to host the professors’ web pages, so they probably have Apache running already. You may be able to create a website simply by putting HTML files inside a
public_html folder in your home directory. If that’s the case, your web pages would be at URL’s that look something like this:
Generating Dynamic Content
Back in the days of yore, a web site was nothing more than a collection of HTML pages. But those days are long past. Because an HTML page is static, it cannot represent data that is in flux – data from a database, say, or even the current time. Unless your webstie consists of content that never changes, you need something better. Different technologies and techniques gradually evolved to fix this problem.
One solution was to make pages that were almost entirely HTML, but had small amounts of scripting code embedded directly into them that would be interpreted on the server. The code was separated from the HTML by language-specific tags; the server would process this code, but not the HTML around it. When this technique was widespread, common choices of server-side scripting technologies were PHP (PHP Hypertext Preprocessor, a recursive acronym), JSP (Java Server Pages), or ASP (Active Server Pages – not to be confused with ASP.NET, which is completely different).
Though embedded scripting is still occasionally used, it isn’t a very elegant solution, for a variety of reasons. There is no separation of concerns at all – presentation and behavior are hopelessly intertwined. The pages have to be parsed by the server every time they are accessed. The scripts are necessarily short, and usually do relatively trivial things. The pages are still fundamentally HTML, so the server can’t provide the same data in other formats. And the pages are bound to specific HTML markup; different websites (with wildly different HTML pages) couldn’t easily re-use the same code. When this technique is used nowadays, it is usually done when writing templates for a CMS or web framework (see below).
The general solution is to reverse the roles of HTML and server-side scripts. It is the scripting language that handles the HTML, and not the reverse. This means that a website’s dynamic content can be generated by a full-fledged web application, written in a specific scripting language. HTML files are no longer stored on the server as individual pages, but instead function as templates for pages. Often, these will not be templates for an entire page, but for different sections of a page (header, footer, etc). Like puzzle pieces, they are assembled on the server as needed. Different websites can use the same software, and simply change the templates.
Of course, full-fledged applications are not easy to create, so many people started creating software solutions that would enable rapid development. The general term for this is a web application framework, or just “web framework.” Web frameworks typically integrate with software development tools (debuggers, IDE’s, version control systems, etc.), commonly have their own URL rewrite engines, and are sometimes bundled with their own web servers. Nearly all of them follow the MVC (Model-View-Controller) software architectural pattern. Many will automatically create “generic” web application code, which is later customized; this is called scaffolding.
Right now, popular web frameworks include ASP.NET, Ruby on Rails, Spring, node.js, and Django. But new web frameworks are being created on what seems like a daily basis, and it’s likely that today’s most popular framework will be tomorrow’s ColdFusion.
The most popular CMS’s right now are WordPress, Drupal, and Joomla, but like web application frameworks, more are being created all the time. The vast majority are open-source, and most are written in PHP. This makes them the go-to solution for personal or small business websites. If you haven’t already set one up yourself, you probably will eventually. In fact, many people first encounter HTML when they customize the template files for a CMS.
HTML is great for representing data in a web browser, but it is not great for representing large amounts of data on a server. (If it were, nobody would need dynamic websites in the first place.) Instead, data on a web server is stored and retrieved using some form of dedicated database software, often housed on a completely different computer. Dynamic websites use the database to store nearly everything: not just content like blog posts or comments, but also usernames and passwords, configuration settings, or even the website’s color scheme.
The software that handles a database is called a DBMS (Database Management System). Server-side software (e.g. a CMS or application framework) communicates with a DBMS using an API (Application Programming Interface), which is usually specific to a particular programming language. Once a connection with the DBMS is set up, the database can be sent a query, which is a command to manipulate the data in some way, and get a result (or set of results) in return. Most database queries are simple CRUD commands (Create, Read, Update, Delete), but they may also be commands to join related tables together in one query, commands to modify the database, and more that I won’t get into here. The DBMS also manages the security of the database, so web application software can be restricted from performing some of these queries.
The most widely-used form of database is a relational database, so called because it is based on the relational model. A relational database stores information in tables, where each column is a simple type of information (e.g. a number or string), and each row is the information for one specific entry in the table. The set of column types is called the schema, and all entries in a table must share the same schema. Each entry in a table has a primary key, a column (or set of columns) that uniquely identify that entry. Entries in one table may reference entries in other tables through their primary keys.
Software that handles a relational database is called, unsurprisingly, an RDBMS (Relational Database Management System). Popular RDBMS’s include Oracle, MySQL, PostreSQL, Microsoft SQL Server, and SQLite.
Since RDBMS’s work roughly the same way, it makes sense to have a common language that can be used to query any of them. This language is called SQL (Structured Query Language). SQL is standardized by the ISO/IEC 9075 specification, but in practice, every RDBMS has its own particular quirks. Even so, the most common SQL commands are supported by every RDBMS, so switching to a different RDBMS requires only modifications to server-side code (and not a complete re-write). Even so, many CMS’s and web frameworks use objects (or occasionally functions) to abstract away the SQL entirely, handling the quirks of each RDBMS itself. An object that is an abstraction of a database table, and whose state is saved in that table, is called an active record. They are very common in MVC frameworks, where they are part of the model.
Relational databases are good general-purpose databases, and most RDMS’s have been around for many years, making them stable, secure, and well-understood by developers. However, they may not be the best solution for aggregate data structures that can’t easily be stored in tables, and are not optimized for running on clusters of computers. So in recent years, other database types have been created. These include document, key-value, graph, or wide column stores, and are collectively called NoSQL (Not Only SQL) databases. Right now, NoSQL databases are used by a tiny minority of websites, but their use is rapidly growing, especially in cloud computing. The most widely-used are MongoDB, Cassandra, and Redis. But the field is evolving rapidly, so who knows how long that will last.
…Like I said, this is a deep subject. Take comfort in the fact that this part of the article is over. Still, if you’re serious about web programming, you will want to learn more.
But for now, let’s return to the nitty-gritty of HTML.