Structure of an HTML 4.0 Document

Elements and Tags
Attributes
Special Characters
Comments
A Complete HTML 4.0 Document
Validating your HTML

Elements and Tags

Elements are the structures that describe parts of an HTML document. For example, the P element represents a paragraph while the EM element gives emphasized content.

An element has three parts: a start tag, content, and an end tag. A tag is special text--"markup"--that is delimited by "<" and ">". An end tag includes a "/" after the "<". For example, the EM element has a start tag, , and an end tag, . The start and end tags surround the content of the EM element:

This is emphasized text

Element names are always case-insensitive, so , , and  are all the same.

Elements cannot overlap each other. If the start tag for an EM element appears within a P, the EM's end tag must also appear within the same P element.

Some elements allow the start or end tag to be omitted. For example, the LI end tag is always optional since the element's end is implied by the next LI element or by the end of the list:

<UL>
  <LI>First list item; no end tag
  <LI>Second list item; optional end tag included</LI>
  <LI>Third list item; no end tag
</UL>

Some elements have no end tag because they have no content. These elements, such as the BR element for line breaks, are represented only by a start tag and are said to be empty.

Attributes

An element's attributes define various properties for the element. For example, the IMG element takes a SRC attribute to provide the location of the image and an ALT attribute to give alternate text for those not loading images:

<IMG SRC="wdglogo.gif" ALT="Web Design Group">

An attribute is included in the start tag only--never the end tag--and takes the form Attribute-name="Attribute-value". The attribute value is delimited by single or double quotes. The quotes are optional if the attribute value consists solely of letters in the range A-Z and a-z, digits (0-9), hyphens ("-"), and periods (".").

Attribute names are case-insensitive, but attribute values may be case-sensitive.

Special Characters

Certain characters in HTML are reserved for use as markup and must be escaped to appear literally. The "<" character may be represented with an entity, <. Similarly, ">" is escaped as >, and "&" is escaped as &. If an attribute value contains a double quotation mark and is delimited by double quotation marks, then the quote should be escaped as ".

Other entities exist for special characters that cannot easily be entered with some keyboards. For example, the copyright symbol ("©") may be represented with the entity ©. See the Entities section for a complete list of HTML 4.0 entities.

As an alternative to entities, authors may also use numeric character references. Any character may be represented by a numeric character reference based on its "code position" in Unicode. For example, one could use © for the copyright symbol or ا for the Arabic letter ALEF.

Comments

Comments in HTML have a complicated syntax that can be simplified by following this rule: Begin a comment with "", and do not use "--" within the comment.

A Complete HTML 4.0 Document

An HTML 4.0 document begins with a DOCTYPE declaration that declares the version of HTML to which the document conforms. The HTML element follows and contains the HEAD and BODY. The HEAD contains information about the document, such as its title and keywords, while the BODY contains the actual content of the document, made up of block-level elements and inline elements. A basic HTML 4.0 document takes on the following form:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
        "http://www.w3.org/TR/REC-html40/strict.dtd">
<HTML>
  <HEAD>
    <TITLE>The document title</TITLE>
  </HEAD>
  <BODY>
    <H1>Main heading</H1>
    <P>A paragraph.</P>
    <P>Another paragraph.</P>
    <UL>
      <LI>A list item.</LI>
      <LI>Another list item.</LI>
    </UL>
  </BODY>
</HTML>

In a Frameset document, the FRAMESET element replaces the BODY element.

Validating your HTML

Each HTML document should be validated to check for errors such as missing quotation marks (<A HREF="oops.html>Oops</A>), misspelled element or attribute names, and invalid structures. Such errors are not always apparent when viewing a document in a browser since browsers are designed to recover from an author's errors. However, different browsers recover in different ways, sometimes resulting in invisible text on one browser but not on others.

The W3C HTML Validation Service checks the validity of HTML 4.0 documents.

Note that some programs claim to be validators but really are not. A validator checks a document against a formal document type definition (DTD) while other programs such as lints warn about valid but unsafe HTML. Both kinds of programs are useful, but validation should never be forgotten.