A Look at XML
by Adam Rifkin
First published January 27, 1999
HTML is the HyperText Markup Language standardized by W3C for storing and exchanging documents on the World Wide Web. HTML was designed to be simple enough to support ease of authoring Web pages, rich enough to support multimedia embedding in documents, and flexible enough to support hypertext linking.
HTML is based on SGML, the Standard Generalized Markup Language standardized by ISO for defining and using portable document formats. SGML was designed to be formal enough to allow proofs of document validity, structured enough to handle complex documents, and extensible enough to support management of large information repositories.
W3C's SGML working group, with present efforts given in their activity page, is attempting to standardize the delivery (in Web documents) of self-describing data structures with arbitrary depth and complexity. To that end, they are simplifying SGML for use with the Web (and Web technologies such as Java).
XML, Extensible Markup Language, is a simplified (but strict) subset of SGML that maintains the SGML features of validation, structure, and extensibility. XML is a standardized text format designed specifically for transmitting structured data to Web applications. In addition, XML's goals of being easier to learn, use, and implement than full SGML will have clear benefits for World Wide Web users, making it easier to define and validate document types, to author and manage SGML-defined documents, and to transmit and share them across the Web.
The Extensible Markup Language specification describes XML documents, a class of data objects stored on computers, and partially describes the behavior of XML processor programs used to read XML documents and provide access to their content and structure. XML allows generic SGML to be served, received, and processed on the Web in a manner similar to what is done with HTML today. XML has been designed for ease of implementation and for interoperability with both SGML and HTML.
XML documents are composed of entities, which are storage units containing text and/or binary data. Text is composed of character streams that form both the document character data and the document markup. Markup describes the document's storage layout and logical structure. XML also provides a markup mechanism to impose constraints on the storage layout and logical structure of documents.
XML and SGML
XML, like SGML, is a meta-language for describing the markup of different types of documents. However, its specification is 26 pages (versus 500 for SGML!). The W3C hopes that offering a simplified version of SGML will make implementing SGML much more palatable to vendors of Web authoring and browsing tools.
XML is not a replacement for SGML. Many features of SGML were left out to keep XML simple. Current SGML users may choose XML for network delivery, and since XML is a valid subset of SGML, the translation from SGML to XML is straightforward. XML was developed as an easy on-ramp to SGML for people who are not yet using it.
To simplify SGML, the W3C working group dropped support for certain features that put a heavy processing burden on SGML client software. For example, a well-formed XML document is unambiguous, so a browser or editor can read the tags and create a tree of the hierarchical structure without having to read its document type definition. XML also does not allow markup minimization, requires that empty elements be self-identifying, and does not support several other complex SGML standard features.
XML and HTML
XML is not a replacement for HTML, either: HTML is a useful tool for storing and exchanging small hypermedia documents across the Internet. Furthermore, it is easy to generate HTML documents on the fly from XML (or SGML) documents. XML is designed to complement HTML by enabling different kinds of data to be exchanged over the Web.
For example, current limitations in World Wide Web technologies do not allow the extensibility, structure, and data checking necessary for large-scale commercial Web publishing. Jon Bosak's excellent paper "XML, Java and the Future of the Web" explains how XML can enable advanced Web applications, allowing Java applets to embed powerful, automatable data manipulation facilities directly into Web clients.
Unlike HTML, which has a fixed (though ever-changing) set of tags, XML lets you define your own tags and attributes. Support for XML by the Internet community would open up vast new possibilities for Internet publishing: instead of shoehorning all documents into HTML, or having to invent a browser to handle non-HTML documents, XML would enable a wide array of documents with user-defined tagsets to be handled by generic Web application software. As Tim Bray pointed out, "[XML allows us to] finally get off the HTML treadmill."
XML and Java
Presently, an author can create rich documents with an application, and then use a Java applet viewer to attach those documents to Web pages. As long as the browsers continue to provide only crude formatting, such measures are unfortunate but inevitable, much in the same way people use desktop publishing applications to get better typography than can be done with off-the-shelf word processors.
But there is no reason why the concept of a "basic Web page" needs to be limited to a single tag set! The appeal of the Web is its simple hypertext scheme, which provides a simple, unambiguous method of pointing to files with unique names. Although it is handy that HTML is also simple, the success of word processors has demonstrated that consumers can cope with multiple document types.
When XML becomes more widespread, Web authoring tools will become much more flexible in handling basic document constructs. WordPerfect and Word will export directly into XML, using the style names as tags instead of filtering everything into 90 (or however many currently exist) predefined tags.
In such a brave new World Wide Web, Java's role will be to do interesting things with the content, such as mediation between formats, computation and event handling, automation of tasks and dynamic content, presentation of different views to different viewers, and even intelligent filtering of content. XML specification co-editor Tim Bray succinctly put it, "XML gives Java something to chew on."
Learning More about XML
Bert Bos' simple XML is a good place to find a few examples of XML files. He also describes the XML data model, which represents information content of an XML document as the linearization of a tree structure with several character strings at each node of the tree.
For answers to frequently asked questions about XML, see the overviews at Textuality and University College Cork.