[Mirrored from: http://www.ucc.ie/xml/, May 08, 1997; text only version]
Maintained on behalf of the W3C SGML Working Group by Peter Flynn (Silmaril Consultants), with the collaboration of Terry Allen (), Tom Borgman (Harlequin Ltd), Tim Bray (Textuality, Inc), Robin Cover (Summer Institute of Linguistics), Christopher Maden (Inso Corp), Eve Maler (Arbortext, Inc), Peter Murray-Rust (Nottingham University), Liam Quin (SoftQuad Inc), Michael Sperberg-McQueen (University of Illinois at Chicago), Joel Weber (MIT), and many other members of the SGML Working Group of the W3C as well as FAQ readers around the world.
Revision history
0.1 (31 January 1997) PF (First draft. Sample questions devised by participants.)
0.2 (3 February 1997) PF (Revised draft. Additional questions and answers.)
0.3 (17 February 1997) PF (Extensive revision following comments from the group. Changes to markup and organization.)
0.4 (23 February 1997) PF (Minor editorial changes)
0.5 (1 April 1997) PF (Added Multidoc Pro as SGML browser; question on XML math; fixed ambiguity in explanation of NETs; added JUMBO; ERB changes of March 26; more details of linking and tools; adding element declaration minimization to the forbidden list.)
1.0 (1 May 1997) PF (Added reference to ToC and printed URLs; added disclaimer at A6; combined old A11 with A5 to explain SGML/XML/HTML; clarified explanation of XML not replacing HTML at C1; added new course and conference at (new) A11; clarified B1, C4, C8; added FPI server at C12; removed examples in C13;)
Paragraphs which have been added since the last version are shown in magenta and prefixed with a pilcrow (¶). Paragraphs which have been changed since the last version are also shown in magenta but are prefixed with a section sign (§). Paragraphs marked for future deletion but retained at the moment for information are shown in pale gray and are prefixed with a plus/minus sign (±).
Summary
This document contains the most frequently-asked questions (with answers) about XML, the Extensible Markup Language. It is intended as a guide for users, developers, and the interested reader, and should not be regarded as a part of the XML Draft Specification.
Organization
The FAQ is divided into four parts: a) General, b) User, c) Author, and d) Developer. The questions are numbered independently within each section. As the numbering may therefore change with each version, comments and suggestions should refer to the version number (see Revision History above) as well as the part and question number.
There is a form at the end of this document which you can use to submit bug reports, suggestions for improvement, and other comments relating to this FAQ only. Comments about the XML Draft Specification itself should be sent to the W3C.
Availability
The SGML file for use with any conforming system is available at http://www.ucc.ie/xml/faq.sgml (this can also be used online with SGML browsers like Panorama or Multidoc Pro; you can also download the DTD and stylesheet installation self-extractor for faster local access with these browsers, or the DTD set as ASCII files).
The same text is available in an HTML version for use with an HTML browser (eg Netscape Navigator, Microsoft Internet Explorer, Spry Mosaic, NCSA Mosaic, Lynx, Opera, etc) at http://www.ucc.ie/xml/.
A plaintext (ASCII) version is available from the Web and (eventually) by anonymous FTP to one of several FAQ repositories. The versions above are also available by electronic mail to the WebMail server (for users with email-only access).
For printed copies there are PostScriptTM versions for A4 and Letter sizes of paper. The Table of Contents with all the hypertext (Web) link URLs listed is available as separate files (A4 and Letter sizes).
The document is also available in oil-based ink on flattened dead trees by sending $10 (or equivalent) to the editor (email first to check currency and postal address).
¶ Murata Makoto has kindly offered to make this document available in Japanese: watch for an announcement.
You can download the XML logo and an icon for your files in ICO or XBM format.
A.5 Aren't XML, SGML, and HTML all the same thing?
A.6 Who is responsible for XML?
A.7 Why is XML such an important development?
A.8 How does XML make SGML simpler and still let you define your own document types?
A.9 Why not just carry on extending HTML?
A.10 Why do we need all this SGML stuff? Why not just use Word or Notes?
A.11 Where do I find more about XML?
A.12 Where can I discuss implementation and development of XML?
B.1 Do I have to do anything to use XML?
B.2 What can XML offer that current Web technology can't?
C.2 What does an XML document look like inside?
C.3 If I can define it all myself, how does XML know what it all means?
C.4 Is an HTML document legal XML?
C.5 Can I create my own XML documents without explicitly defining a document type?
C.6 Can I prettyprint my XML documents (hand-format my source)?
C.7 Which parts of an XML document are case-sensitive?
C.8 How can I make my existing HTML files work in XML?
C.9 If XML is just a `subset' of SGML, can I use XML files directly with SGML tools?
C.10 I'm used to authoring and serving HTML. Can I learn XML easily?
C.11 Will XML be able to use non-Latin characters?
C.12 What's a Document Type Definition (DTD) and where do I get one?
C.13 How will XML affect my document links?
D.2 What's this difference between `valid' and `well-formed'?
D.3 What are all these Processing Instructions at the top of XML files?
D.4 What else has changed between SGML and XML?
D.5 Do I have to change any of my server software to work with XML?
D.6 Can I still use server-side
INCLUDE
s?D.7 Can I (and my authors) still use client-side
INCLUDE
s?D.8 I'm trying to understand the XML Spec: why does SGML (and XML) have such difficult terminology?
XML stands for `Extensible Markup Language' (extensible because it is not a fixed format like HTML). It is designed to enable the use of SGML on the World-Wide Web.
A regular markup language defines what you can do (or what you have done) in the way of describing information for a fixed class of documents (like HTML). XML goes beyond this and allows you to define your own customized markup language. It can do this because it's an application profile of SGML. XML is thus a metalanguage, a language for describing languages.
XML is intended to be a standard `to make it easy and straightforward to use SGML on the Web: easy to define document types, easy to author and manage SGML-defined documents, and easy to transmit and share them across the Web.'
It defines `an extremely simple dialect of SGML which is completely described in the Draft XML Specification (DXS). The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.'
`For this reason, XML has been designed for ease of implementation, and for interoperability with both SGML and HTML' [quotes from the DXS].
SGML is the Standard Generalized Markup Language (ISO 8879), the international standard system for defining, identifying and using the structure and content of documents.
HTML is the HyperText Markup Language (RFC 1866), a specific application of SGML used in the World-Wide Web.
Not quite. SGML is the `mother tongue', used for describing thousands of different types of document.
HTML is just one document type, used in the Web. It defines a fixed type of document with markup to let you describe a common class of simple office-style report, with headings, paragraphs, lists, illustrations, etc, and some provision for hypertext and multimedia.
XML is an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them. It omits some more complex and some less-used parts in return for the benefits of being easier to write applications for, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files can be parsed and validated the same as any other SGML file (see the question on XML software).
Programmers may find it useful to think of XML as being SGML-- rather than HTML++.
XML is a project of the World-Wide Web Consortium (W3C), and the development of the specification is being supervised by their SGML Editorial Review Board (ERB). The work of definition and specification is being done by a Working Group appointed by the ERB, with co-opted contributors and experts from various fields.
XML is a public format: it is not a proprietary development of any company.
It removes two constraints which are holding back Web developments: dependence on a single document type (HTML), based on a generic system (SGML) whose syntax allows many powerful but complex options.
XML simplifies the levels of optionality in SGML, and allows the development of user-defined document types on the Web.
XML redefines some of SGML's internal qualities and quantities, and removes a large number of the more complex and sometimes less-used features which made it harder to write processing programs (see list).
It also introduces a new class of document which does not require the formal declaration of a predefined document type. See the questions about document type declarations, `valid' vs `well-formed' documents, and how to define your own document types in the Developers' Section.
HTML is already overburdened with dozens of interesting but often incompatible inventions from different manufacturers, because it provides only one way of describing your information.
XML will allow groups of people or organizations to create their own customized markup languages for exchanging information in their domain (music, chemistry, electronics, hill-walking, finance, surfing, linguistics, knitting, history, engineering, rabbit-keeping etc).
HTML is at the limit of its usefulness as a way of describing information, and while it will continue to play an important role for the content it currently represents, many new applications require a more robust and flexible infrastructure.
Information on a network which connects many different types of computer has to be usable on all of them. Public information cannot afford to be restricted to one make or model or manufacturer, or to cede control of its data format to private hands. It is also helpful for such information to be in a form that can be reused in many different ways, as this can minimize wasted time and effort.
SGML is the international standard which is used for defining this kind of application, but those who need an alternative based on different software are entirely free to implement similar services using such a system, especially if it is for non-public use.
Online, there's the XML Draft Specification available from the W3C; a brief summary of XML with an extensive list of online reference material in Robin Cover's SGML pages; and a summary and condensed FAQ from Tim Bray.
± The GCA and SGML Open are conducting a three-day conference on The new publishing business case: XML, SGML, and the Internet in San Diego, California, on 10-12 March 1997. Further details from Julie Morrison Desmond or from the GCA's Web site.
± Technology Appraisals Ltd are holding a three-day conference in London, England, on XML and network publishing technologies, XML ready for prime time?, and The XML Technology Bootstrap on 21-23 April 1997. They are also running a three-day seminar on business on the Internet and the Web, including material on XML, on 27 April-1 May (also in London). Details from David Hitchcock at TAL.
± The Sixth International World Wide Web Conference (WWW '97) is being held on 7-12 April 1997 in Santa Clara, California.
± The Irish SGML Users Group are holding a meeting entitled XML and the future of the World Wide Web presented by Jon Bosak (Sun Microsystems USA) on Wednesday April 30, 2.30-5pm in the Maxwell Theatre, Hamilton Building, Westland Row, Trinity College Dublin.
¶ Peter Murray-Rust is preparing an XML/Java Virtual Course entitled Scientific Information Components using Java and XML Details are at http://www.vsms.nottingham.ac.uk/vsms/java/advert/advert.txt. The XML will be very low-level (ie well-formed, only balanced tags, and quoted attributes; no DTDs, entities, marked sections, catalogs, links, etc.) It concentrates on building element trees (including those from legacy files).
¶ The annual SGML Conference run by the Graphic Communications Association has been renamed the SGML/XML Conference. SGML/XML '97 will be held in Washington DC. The date is to be confirmed, provisionally the week of the 11th December 1997 (further details on the GCA's Web site).
¶ There are some articles on XML beginning to appear in the computing press, for example XML will take the Web to the next level and Designing Web sites for non-human audiences by Eamonn Sullivan in PC Week (April 28, 1997 v14 n17 pp 38 & 46).
There is a mailing list called xml-dev
for those committed to developing components for XML. You can subscribe
by sending a 1-line mail message to
majordomo@ic.ac.uk
saying:
subscribe xml-dev yourname@yoursite
The list is hypermailed for online reference at http://www.lists.ic.ac.uk/hypermail/xml-dev/.
Note that this list is for those people actively involved in developing resources for XML. It is not for general information about XML (see this FAQ and other sources) or for general discussion about SGML implementation and resources (see comp.text.sgml).
Not yet. XML is still being developed, but there are already some pilot browsers, so you can experiment with them. When the specification is more complete, more browsers should start to appear, and you may be able to download them and use them to browse the Web much as you do with current software now.
¶ You can use such browsers to look at some of the emerging XML material, such as Jon Bosak's Shakespeare plays and the molecular experiments of the Chemical Markup Language.
If you want to start preparations for writing your own XML, see the questions in the Authors' Section.
Because authors and providers can design their own document types using XML, browser presentation will be able to benefit from greatly improved facilities, both for graphical display and for performance.
Document types can be explicitly tailored to an audience, so the cumbersome fudging that has to take place with HTML to achieve special effects should become a thing of the past: authors and designers will be free to invent their own markup elements.
Information content can be richer and easier to use, because the hypertext linking abilities of XML are much greater than those of HTML.
Because XML removes many of the underlying complexities of SGML in favor of a more flexible model, writing programs to handle XML will be much easier than doing the same for full SGML.
Information will be more accessible and reusable, because the more flexible markup of XML can be utilized by any XML software instead of being partly restricted to specific manufacturers as has become the case with HTML.
Valid XML files remain fully conformant SGML, so they can be used outside the Web as well, in any normal SGML environment.
There are already some browsers emerging (see below), but the XML specification is still under development. As with HTML, there won't be just one browser, but many. However, because the potential number of different XML applications is not limited, no single browser should be expected to handle 100% of everything.
Expect to see the generic parts of XML (eg parsing, tree management, searching, formatting, and the use of architectural forms) combined into a general-purpose browser library or toolkit to make it easier for developers to take a consistent line when writing XML applications. Such applications could then be customized by adding semantics for specific markets, or using languages like Java to develop plugins for generic browsers and have the specialist modules delivered transparently over the Web.
JUMBO is a prototype GUI browser/editor/search/rendering tool for the output of XML parsers. It displays the abstract document tree which can be queried and edited in limited fashion. Java classes can be dynamically loaded for the current DTD and allow complex transformation and rendering. The emphasis is on the import of legacy files into structured documents, and the management of non-textual data, including common data structures (trees, tables, lists, etc). Currently JUMBO parses a subset of XML files (ie only elements and their attributes) and will be grafted onto other parsers as soon as possible. The software and a wide range of XML demo files, including Jon Bosak's PLAY, can be downloaded for any Java-enabled browser from http://www.venus.co.uk/omf/cml/
§ The DynaWeb server from Inso Corporation (formerly EBT) can serve other forms of SGML translated on-the-fly to XML (demonstrated at the GCA's XML Conference in San Diego, March 1997). Sun Microsystems are currently serving XML using this software on an experimental basis (message to the xml-dev mailing list dated Mon, 17 Mar 1997 16:49:42 -0800 from Jon Bosak).
¶ Microsoft are defining their proposed Channel Definition Format (CDF) as an application of XML.
No, existing SGML and HTML applications software will continue to work with existing files. But as with any enhanced facility, if you want to view or download and use XML files, you will need to use XML-aware software.
Authors should also read the Developers' Section, which contains further information about the internals of XML files.
No, XML itself does not replace HTML: instead, it provides an alternative by allowing you to define your own set of markup elements. HTML is expected to remain in common use for some time to come.
You could of course also use XML to redefine all or part of HTML itself for your own use, as you could with any Document Type Definition. XML is designed to make the writing of DTDs much simpler than is the case with full SGML.
XML documents can be very simple, with no formal document type declaration, and straightforward nested markup of your own design:
<?XML VERSION="1.0" RMD="NONE"?> <conversation> <greeting>Hello, world!</greeting> <response>Stop the planet, I want to get off!</response> </conversation>
Or they can be more complicated, with a DTD specified, and maybe a local subset, and a more complex structure:
<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?> <!doctype titlepage system "typo.dtd" [<!entity % active.links "INCLUDE">]> <titlepage> <whitespace type="vertical" amount="36"/> <title font="Baskerville" size="24/30" alignment="centered">Hello, world!</title> <whitespace type="vertical" amount="12"/> <!--* In some copies the following decoration is hand-colored, presumably by the author *--> <image location="http://www.foo.bar/fleuron.eps" type="URL" alignment="centered"/> <whitespace type="vertical" amount="24"/> <author font="Baskerville" size="18/22" style="italic">Munde Salutem</author> </titlepage>
Or they can be anywhere between: a lot will depend on how you want to define your document type and what it will be used for. See the question on valid and well-formed files.
You can use a stylesheet, like you can with HTML, to define the appearance, but the way in which an XML processor recognizes your markup depends on whether your file is valid or well-formed.
In both cases there are some simple rules to follow, and you may need to use a Document Type Definition (DTD) which will both guide you in creating the document and guide the user's software in reading it. If there isn't a suitable DTD in existence for your type of document, you can write one of your own.
Unlike some current ways of using HTML, you can't just make it up as you go along and hope that any old tag will do: for software to make sense of it, you need to follow a pattern or plan. In the case of a valid file, this is the DTD; in the case of a well-formed file, the structure has to be implicit in your markup.
§ Yes, almost. There are two ways an HTML document can be made legal XML: it can either be valid or it can be well-formed. Many HTML authoring tools already produce almost (but not quite) well-formed XML document instances.
The differences are small but significant (see the question on XML document classes). Existing HTML browsers are tolerant of invalid markup, so they will probably display XML files which use an XML version of a HTML DTD, or which are simply well-formed HTML, even though there are some slight differences. See the question on how to make existing HTML files work in XML.
Yes, this is what the `well-formed' document class is there for:
<?XML VERSION="1.0" RMD="NONE"?> <FAQ> <Q><IMAGE XML-LINK="ques.gif"/>Can I create my own XML documents without a DTD?</Q> <A>Yes. This is an example of a well-formed document, which can be parsed by any XML-compliant parser. However, it won't know how to display it unless you supply a stylesheet.</A> </FAQ>
§ Properly balanced, nested elements; all
start-tags and end-tags always present for elements which contain text
data; a trailing slash on EMPTY
elements (those containing no text data).
Yes, if they are unambiguous. XML processors will realize that in some applications precise white-space is critical, but in others that they need to collapse it: there are different rules for white-space handling in valid and well-formed documents, as well as ways to affect it.
<chapter> <section> <title> My title for Section 1. </title> <para> ... </para> </section> </chapter>
In other words, do you really want those linebreaks and spaces
before, after, and in the title, or are they just there to make it
easier to edit or because it was machine-generated? Should it say
<title>My title for Section 1.</title>
with no linebreaks or space? There are requirements and recommendations
for applications on the detection and handling of white-space in the
Draft XML Specification. XML instances
may not rely on whitespace being ignored in element content.
§ The same as for HTML and many other SGML document types.
Element names (start-tags and end-tags)
are case-insensitive: you can use upper- or lower-case (or even
<MiXeD>
);
Attribute names are also case-insensitive (NUM="7"
val="7"
);
Attribute values, however, may be case-sensitive,
depending on context: you can specify which in a
Document Type Definition (HRef="MyFile.SGML"
);
All entity names (Á
),
and your data content (the text), are case-sensitive.
XML defines two levels of conformance, valid and well-formed.
If your file already conforms to one of the HTML Document Type Definitions (DTDs), then it may be close to being valid. Three things need changing:
You need to include the Document Type Declaration:
<!DOCTYPE HTML SYSTEM "http://www.foo.com/myfiles/html3x.dtd">
There is an optional XML Declaration with a Required Markup Declaration which may preceded the Document Type Declaration.
Any DTD you reference must be an XML version, as must
any other entities the DTD refers to (eg
files of
character entities like ISO
Latin-1). They must all be accessible either through the network or from
the user's local disk (eg by
supplying a URL or filename for each in their SYSTEM
identifiers).
The file itself must be well-formed (see below).
If your file does not conform to any of the available DTDs, then you can make it well-formed. You must make sure it follows the rules for well-formed files, by editing the file and making the necessary changes. Then place an XML Declaration containing a Required Markup Declaration at the top:
<?XML VERSION="1.0" RMD="NONE"?> <HTML><HEAD><TITLE>Test file</TITLE></HEAD> <BODY><BLINK>Test text <IMG SRC="foo.gif" alt="A foo"/></BLINK> </BODY></HTML>
§ This lets you omit the DTD so long as the file is well-formed. Details of validity and well-formedness are in the Developers' and Implementors' section.
Yes, provided: a) the document has a valid Document
Type Definition (DTD) which you can use; and b) and the files are valid,
not just well-formed.
But at the moment there are few tools which handle XML files unchanged
because of the format of EMPTY
elements. This is
expected to change soon.
Yes, but at the moment there is still a need for tutorials, simple tools, and more examples of XML documents. Well-formed XML documents may look similar to HTML except for some small but very important points of syntax.
As every user community can have their own document type defined, it should be much easier to learn, because element names can be picked for relevance.
Yes, the XML Draft
Specification explicitly makes reference to
ISO 10646 (Unicode) and says
that users
`may extend the ISO 10646 character repertoire, in the rare cases
where this is necessary, by making use of the private use areas[ . . . ]all
XML processors must accept the UTF-8 and UCS-2 encodings of 10646; the
mechanisms for signalling which of the two are in use, and for bringing
other encodings into play, are[ . . . ]in the discussion of character
encodings.Regardless of the specific encoding used, any character in the
ISO 10646 character set may be referred to by the decimal or hexadecimal
equivalent of its bit string':
&#dddd;
or
&#Xhhhh;
respectively [from the
DXS]. All such numeric character
references must be to ISO 10646 (Unicode).
A DTD is usually a file (or several files together)
which contains a formal definition of a particular type of document.
This sets out what names can be used for elements, where they may occur
(for example, <ITEM>
might only be
meaningful inside <LIST>
), and how
they all fit together. It lets processors parse a document and identify
where each element comes, so that stylesheets, navigators, search
engines, and other applications can be used.
There are thousands of DTDs already in existence in all kinds of areas (see the SGML Web pages for examples). Many of them can be downloaded and used freely; or you can write your own. As with any language, you need to learn some of it [SGML] to do this: but XML is much simpler, see the list of restrictions which shows what has been cut out.
DTDs specifically for use on the Web may become commonplace, and people in different areas of interest may write their own for their own purposes: this is what XML is for.
¶ Examples of DTDs (note they are still SGML ones, not XML ones) can be retrieved at http://www.ucc.ie/cgi-bin/PUBLIC.
The linking abilities of XML systems are much more
powerful than those of HTML. Existing HREF
-style links
will remain usable, but new linking technology is based on the lessons
learned in the development of other standards involving hypertext, such
as
TEI and
HyTime, which will let
you manage bidirectional and multi-way links, as well as links to a span
of text (within your own or other documents) rather than to a single
point. This is already implemented for SGML in browsers like
Panorama and Multidoc Pro.
The current proposal is that an XML link can be either a URL or a TEI-style Extended Pointer (`Xptr'), or both. A URL on its own is assumed to be a resource (as with HTML); if an Xptr follows it, it is assumed to be a sub-resource of the URL; an Xptr on its own is assumed to apply to the current document.
An Xptr is always preceded by one of #, ?, or |. The # and ? mean do the same as in HTML applications; the | means the sub-resource can be found by applying the Xptr to the resource, but the method of doing this is left to the implementation.
TEI Extended Pointer Notation (EPN) is much more powerful than the simple ID examples given above. This sentence, for example, marked as a link, could be referred to within this document as ID(tei-link)CHILD(3), meaning the third object within the element labeled tei-link (this paragraph). Count the objects: a) the link to `TEI Extended Pointer Notation', b) the remainder of the first sentence, and c) the second sentence. If you view this file with Panorama you can click on the highlighted sentence above, which links to the start of this question, and then click on the cross-reference button beside the question title, and it will display the locations in Extended Pointer Notation of all the links to it, including the previous sentence. (Doing this in an HTML browser is not meaningful, as they do not support bidirectional linking or EPN.)
There are already three XML parsers (written in Java) which can be used to check that your files conform to the Draft XML Specification:
Norbert Mikula's NXP at http://www.edu.uni-klu.ac.at/~nmikula/NXP/
Tim Bray's Lark at http://www.textuality.com/Lark/
Sean Russell's kernel at http://jersey.uoregon.edu/ser/software/XML.tar.gz
See also the question on XML Browsers and the details of the xml-dev mailing list for software developers.
Yes, if the document type you use provides for math. The long-expired HTML3 could be used, or HTML Pro, or ISO 12083 Math, or the developments of the OpenMath or HTML-Math projects, or one of your own making. Browsers to display rudimentary math embedded in SGML already exist (eg Panorama, Multidoc Pro), and the mathematics-using communities may develop their own software for XML.
The sophistication could vary from math expressions like through simple inline equations such as to display equations like
(If you are using an HTML browser to read this, the above equations may not be rendered correctly unless you have a math plugin for Netscape like IBM's TechExplorer which reads the embedded TeX equivalent [use the source, Luke!].)
Right here (http://www.w3.org/pub/WWW/TR/). Includes the EBNF. There's also a version in Japanese.
Valid XML files are those which have a Document Type Definition (DTD) like all other SGML applications, and which adhere to it. They must also be well-formed (see below).
A valid file begins like any normal SGML file with a Document Type Declaration, but may have an optional XML Declaration prepended:
<?XML VERSION="1.0"?> <!doctype foo system "http://www.foo.org/bar.dtd"> <foo> <bar>...<blort/>...</bar> </foo>
The XML Specification defines an SGML Declaration which is fixed for all XML instances. An XML version of the specified DTD must be accessible to the XML processor, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network (with the SYSTEM identifier set to a URL).
If there is an internal DTD subset, this may be
referenced by an `INTERNAL
'
Required Markup Declaration (RMD) in the
XML Declaration:
<?XML VERSION="1.0" RMD="INTERNAL"?> <!doctype foo [ <!element guff - - (#PCDATA)> ]> <foo> <bar>...<blort/>...</bar> <guff>...</guff> </foo>
The default, when no XML Declaration is
present, is VERSION="1.0"
RMD="ALL"
ENCODING="UTF-8"
.
Well-formed XML files can be used without a DTD, but they must follow some simple rules to enable a browser to parse the file correctly (so that it can apply your stylesheet, enable linking, etc).
the file must start with a Required Markup Declaration, saying that there is no DTD (a different rule applies for valid files which do have a DTD):
<?XML VERSION="1.0" RMD="NONE"> <foo> <bar>...<blort/>...</bar> </foo>
all tags must be balanced: that is, all elements must have both start- and end-tags present (omission is not allowed, with one exception, see `Empty elements' below)
all attribute values must be in quotes (the single-quote character [the apostrophe] may be used if the value contains a double-quote character, and vice versa)
any EMPTY
elements (eg
those with no end-tag like HTML's <IMG>
,
<HR>
, and <BR>
and others) must either end with `/>
'
or you have to make them non-EMPTY
by adding a real
end-tag.
Example:
<BR>
would become either
<BR/>
or
<BR>
</BR>
.
there must not be any markup characters (<
or &
) in the character data (ie
they must be escaped as <
and
&
)
elements must nest inside each other properly (no overlapping markup, same rule as for regular SGML).
¶ Well-formed files may use attributes on any element, but they must all be of type CDATA, as without a DTD there is no way of defining them otherwise.
Well-formed XML files are considered to have
<
, >
,
'
,
"
, and
&
predefined and thus available
for use even without a DTD. Valid XML files must declare them
explicitly if they use them. A revised version of the XML Specification
will give a precise definition of what that declaration must be.
Processing Instructions are SGML's
way of adding `what to do' and `how to do it'
details to a file. In XML every file can begin with an XML Declaration
Processing Instruction which starts with `<?
'
and the keyword XML
, and ends with `?>
'
(slightly different from plain SGML, which omits the final
question-mark).
The XML Declaration must include the version number of XML being followed, and may include a Required Markup Declaration and an Encoding Declaration:
<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?>
The XML Declaration is optional, and defaults to the values given here: if other values are needed, the declaration must be included at the top of the file.
The principal changes are in what you can do in writing a Document Type Definition (DTD). To simplify the syntax and make it easier to write processing software, the following markup declaration restrictions have been picked for XML:
No comments (--
. . . --
)
inside other markup declarations
Comment declarations can't have spaces within the
markup of <!--
or -->
Declarations can't jump in and out of comments
with -- --
No name groups for declaring multiple elements or element attlists
No CDATA
or RCDATA
declared content
No exclusions or inclusions on content models
No minimization parameters on element declarations
Mixed content models must be optional-repeatable ORs,
with #PCDATA
first
No AND (&
) content model groups
No NAME
[S
],
NUMBER
[S
], or NUTOKEN
[S
]
declared values
No #CURRENT
or #CONREF
declared values
Attribute default values must be quoted
Marked sections can't have spaces within the
markup of
<![keyword[
or
]]>
No RCDATA
, TEMP
,
IGNORE
, or INCLUDE
marked sections
in instance
Marked sections in instance must have CDATA
keyword, not parameter entity
No SDATA
, CDATA
,
or bracketed internal entities
No SUBDOC
, CDATA
,
or SDATA
external entities
No public identifiers in entity and notation declarations (this may be changed to permit them at a later stage once a resolution mechanism has been identified)
No data attributes on NOTATION
s or
attribute value specifications on ENTITY
declarations
No SHORTREF
declarations
No USEMAP
declarations
No LINKTYPE
declarations
No LINK
declarations
No USELINK
declarations
No IDLINK
declarations
No SGML
declarations
§ The `double-dash' sequence is illegal in comment text as it is the terminator. As noted above, spaces are not allowed between either of the angle brackets, the exclamation mark, or any of the dashes: they are only valid within the comment text. The proposal for adding an asterisk to the inside of the comment delimiters has been withdrawn.
As noted in the question on source formatting, instances may not rely on whitespace being ignored in element content.
If you want to use existing SGML DTDs and entity files for XML, they will need to be edited to conform to the above requirements, but this only has to be done once. When the list has been finalized, it is likely that suitably-modified versions of the popular DTDs and character entity sets (eg the `ISO' files like ISOlat1) will be made available for use with XML.
Only to serve up .xml files as the correct MIME type. The XML project is submitting a MIME type of text/xml for approval, so for serving XML documents all that is needed is to edit the mime-types file (or its equivalent) and add the line
text/xml xml XML
However, more sophisticated applications may require HTTP content negotiation to determine what tools the client has for display. Also, since XML is designed to support stylesheets and sophisticated hyperlinking, XML documents may be accompanied by ancillary files such as DTDs, entity files, catalogs, stylesheets, etc, which may need their own MIME entry, and which require placing in the appropriate directories.
If you run scripts generating HTML, which you wish to work with XML, they will need to be modified to produce the relevant document type.
INCLUDE
s?Yes, so long as what they generate ends up as part of an XML-conformant file (ie either valid or well-formed).
However, some files containing embedded calls to external procedures which get invoked before transmission, such as the NCSA's `special' HTML, need to be checked carefully, to make sure that they do not contain raw markup characters (ie angle brackets and ampersands) which might confuse editors and other processing software. For example,
<!-- #exec cmd="tr '\012' '\040' <foo.bar"-->
which in a .shtml file embeds the contents of foo.bar in the output stream, with all newlines changed to spaces, needs to be written as
<!-- #exec cmd="tr '\012' '\040' <foo.bar"-->
INCLUDE
s?The same rule applies as for
server-side INCLUDE
s,
so you need to ensure that any embedded code which gets passed to a
third-party engine (eg SDQL
enquiries, Java
write
s, LiveWire requests,
etc) does not contain any characters
which might be misinterpreted as XML markup (ie
no angle brackets or ampersands): either use a CDATA
marked section to avoid your XML application parsing the embedded code,
or use the standard <
,
>
, and
&
character entity references
instead.
For implementation to succeed, the terminology needs
to be precise (for example `element' and
`tag' are not synonymous: an element is a whole
unit of markup, and may consist of a start-tag alone (as in HTML's
<BR>
) or a start-tag and an end-tag
and the content which goes between them; tags are
simply the markers at the start and end of elements). Sloppy terminology
causes misunderstandings.
Those new to SGML may want to read something like the Gentle Introduction to SGML chapter of the TEI.
Not yet, although many aspects of development software are being worked on: see the question on XML software.