| HTML Guide: Text and Markup | cern archive | HTTP only |
| home archive home about | |||||||||||||
Text and MarkupThis part of the HTML reference is an explanation of SGML syntax as it applies to HTML. For lexical issues, the purpose is to take the standard and reduce it from the abstract system that is SGML to a concrete language, HTML. For structural issues, the purpose is to give you enough background to read the DTD.
Structured TextAn HTML document is a hierarchy of elements. Each element has a name, some attributes, and some content. Most elements are represented in the document as a start tag, which gives the name and attributes, followed by the content, followed by the end tag. For example:
For the rest of the elements, the content is a sequence of data characters and nested elements. The content must match the element's model group from its declaration in the DTD. Using the example from above, the content of the UL element is the sequence "LI, #PCDATA, A, LI, #PCDATA". This matches the model group from the UL element declaration: "(#PCDATA|LI|A)+". Parsing Content Into Data and MarkupAn HTML document is like a text file, except that some of the characters are interpreted as markup, rather than document content. The following table lists the special character sequences that separate data from markup in an HTML document.SGML delimiters
Normal Text: Parsed Character DataIn the DTD, the symbol PCDATA stands for parsed character data, the normal text characters in an HTML document.The text consists of a stream of lines. The division into lines has no significance apart from indicating a word end. All of the SGML delimiters listed in the table of delimitersare recognized in PCDATA.
Raw Text: Character DataIn the DTD, the symbol CDATA stands for character data, the text without markup in an SGML document. Only the end tag open delimiters is recognized in CDATA.
TagsThe characters in an SGML document are organized into a heirarchy of elements by the use of tags. Tags are set off from the data characters by angle brackets: '<' and '>'.
NamesThe element name immediately follows "<". Names consist of a letter followed by up to 33 letters, digits, periods, or hyphens. Names are not case sensitive.
AttributesFollowing the element name, whitespace and attributes are allowed. An attribute consists of a name, an equal sign, and a value. Spaces are allowed around the equal sign.The value is either a token or a literal. A token is up to 34 letters, digits, periods, or dashes. Tokens are case sensitive. A literal is a string surrounded by single quotes or a string surrounded by double quotes. Entity references are processed inside attribute values as inside PCDATA. The length of an attribute value (after entity processing) is limited to 1024 characters. Each attribute has a type, which puts constraints on the values it can have. For example, the NAME attribute of the A element is an ID. An ID is a name that must be unique among all IDs in the document. EntitiesIn order to include characters that would otherwise be parsed as markup, you can use entity references refer to some of characters.An entity reference is an ampersand, followed by a name, followed by a semicolon. No spaces are allowed within an entity reference. For example:
CommentsComment declarations can be used include information aimed at persons and tools that read the document in source form. This information will be ignored when the document is processed by an SGML parser.Comments begin with the character sequence "<!--" and end with "--", which must be followed by '>'. (Technically, whitespace is allowed between the closing "--" and '>'.) They are only allowed in PCDATA. |
|
||||||||||||