Copyright © 2000 International Business Machines.
The omission of [NEL], the newline character defined in Unicode 3.0, from the End-of-Line Handling section in the XML 1.0 specification causes significant difficulty when processing XML documents and DTDs in IBM mainframe systems. Problem areas include:
XML documents that contain [NEL] characters are declared invalid or not well-formed by XML 1.0 compliant parsers.
We urge the W3C to include [NEL] as a legal line ending in XML, and hence as a legal white space character, in accordance with Unicode 3.0.
This document is a submission to the World Wide Web Consortium from IBM (see Submission Request, W3C Staff Comment). For a full list of all acknowledged Submissions, please see Acknowledged Submissions to W3C .
This document is a Note made available by W3C for discussion only. This work does not imply endorsement by, or the consensus of the W3C membership, nor that W3C has, is, or will be allocating any resources to the issues addressed by the Note. This document is a work in progress and may be updated, replaced, or rendered obsolete by other documents at any time.
A list of current W3C technical documents can be found at the Technical Reports page.
1. Overview
2. Problem Scenarios
3. Supporting Documentation
4. Suggested Modifications
Appendix A. Line Ending Summary for
OS/390
Appendix B. Line Ending Encodings
The Unicode 3.0 newline character [NEL], which corresponds to (#x0085), does not appear in the XML 1.0 list of line ending characters, nor in the list of white space characters. We have observed difficulties processing XML documents with the typical software and tools found on IBM's OS/390 mainframe system because of the omission. In particular:
Often, OS/390 system users who adopt XML, store XML documents with [NEL] line endings in the file system. Using [NEL] line endings means documents are processed correctly using native tools, and by FTP when transmitting documents to other platforms. However, it is necessary to transform the [NEL] line endings to [LF] line endings just before presenting XML documents and DTDs to an XML 1.0 compliant parser. It is also necessary to remember to transform [LF] line endings back to [NEL] line endings, e.g., after extracting document fragments, in order for the documents to be processed correctly by native software.
The absence of [NEL] in the list of XML 1.0 white space characters inhibits OS/390 users from using XML 1.0 compliant parsers. XML documents that contain [NEL] characters are declared invalid or not well-formed by XML 1.0 compliant parsers.
Examples:
We urge the W3C to include [NEL] as a legal line ending in XML, and hence as a legal white space character, in accordance with Unicode 3.0. XML processors should treat [NEL] precisely as they treat [LF]. The two-character sequence [CR][NEL] in combination, and any [NEL] alone, i.e., not preceded by [CR], should be normalized into a single [LF].
Scenarios where the problem arises include:
The use of [NEL] as a line ending character on mainframe systems is well documented over a number of years. For example:
A note on terminology: The character that we refer to as [NEL] in this document and in Unicode Technical Report 13, is referred to as [NL] in OS/390 documentation as well as in the FTP RFC.
This section provides suggested text for the necessary change to incorporate [NEL] in the Extensible Markup Language (XML) 1.0 (Second Edition) specification. Note that both the Unicode Consortium and the W3C Internationalization Working Group recommend the inclusion of the line separator (#x2028) and paragraph separator (#x2029) as well as [NEL].
In Section 2.3
Common Syntactic Constructs
S (white space) consists of one or more space characters, carriage returns,
line feeds, or tabs.
White Space
[3] S ::= (#x20 | #x9 | #xD | #xA)+
Change to:
S (white space) consists of one or more space characters, tabs, and linebreak
characters (carriage return, line feed, NEL).
White Space
[3] S ::= (#x20 | #x9 | #xD | #xA | #x85 )+
In Section 2.11
End-of-Line Handling
XML parsed entities are often stored in computer files which, for editing
convenience, are organized into lines. These lines are typically separated by
some combination of the characters carriage-return (#xD) and line-feed
(#xA).
To simplify the tasks of applications, wherever an external parsed entity or the literal entity value of an internal parsed entity contains either the literal two-character sequence "#xD#xA" or a standalone literal #xD, an XML processor must pass to the application the single character #xA. (This behavior can conveniently be produced by normalizing all line breaks to #xA on input, before parsing.)
Change to:
XML parsed entities are often stored in computer files which, for editing
convenience, are organized into lines. These lines are typically separated by
line break sequences: carriage-return (#xD), line-feed (#xA),
carriage-return/line-feed (#xD#xA), NEL (#x85), and carriage-return/NEL
(#xD#x85).
To simplify the tasks of applications, wherever an external parsed entity or the literal entity value of an internal parsed entity contains any of the above sequences, an XML processor must pass to the application the single character #xA. (This behavior can conveniently be produced by normalizing all line break sequences to #xA on input, before parsing.)
|
|
File Creation Method | Line Ending Generated on OS/390 |
FTP to OS/390 servers | [NEL] |
JDBC/ODBC through DRDA
database protocol inserting data from DOS based systems to OS/390 server |
[CR][LF] |
JDBC/ODBC through DRDA
database protocol inserting data from UNIX-based systems to OS/390 server |
[LF] |
JDBC/ODBC through DRDA
database protocol inserting data from local or remote OS/390 to OS/390 server |
[NEL] |
vi Editor on UNIX System Services on OS/390 | [NEL] |
C iconv conversion of [LF] on OS/390 | [NEL] |
\n printf output: OS/390 C or Java program | [NEL] |
\n printf output: OS/390 UNIX System Services(*) C program | [NEL] |
(*)"UNIX System Services" is also known as "MVS Open Edition". Note that \n printf output for the same C program on non-OS/390 UNIX produces [LF]
|
||
Platform | Line Ending | Unicode |
Apple Macintosh | [CR] | (#x000D) |
UNIX Based Systems | [LF] | (#x000A) |
DOS Based Systems | [CR][LF] | (#x000D)(#x000A) |
OS/390 | [NEL] | (#x0085) |