8000 entity en/decoding different between parser and serializer · Issue #421 · xmldom/xmldom · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
entity en/decoding different between parser and serializer #421
Closed
@sbresin

Description

@sbresin

Description

According to the XML-Spec, <, >, & have to be encoded in attributes and text nodes.
In attributes additionally ' and " have to be encoded.

The XMLSerializer does this encoding according to the spec. (except for &apos; in attributes, which is a bug, but super easily fixed)

The parser on the other hand, decodes all 5 entities in attributes AND in text nodes.

I have to process XMLs, where all 5 entities are also encoded for text fields. Parsing, modifying and then serializing these XMLs then changes all the text nodes.

How to replicate

// test.mjs
import { DOMParser, XMLSerializer } from '@xmldom/xmldom';

const testxml =
`<?xml version="1.0" encoding="UTF-8"?>
<rootel xmlns="http://soap.sforce.com/2006/04/metadata">
    <textnode testattribute="&amp; &lt; &gt; &apos; &quot;">
      &amp;
      &lt;
      &gt;
      &apos;
      &quot;
    </textnode>
</rootel>
`;

const xmldoc = new DOMParser().parseFromString(testxml, 'text/xml');

const serializedXml = new XMLSerializer().serializeToString(xmldoc);

console.log(serializedXml);

outputs this:

<?xml version="1.0" encoding="UTF-8"?>
<rootel xmlns="http://soap.sforce.com/2006/04/metadata">
    <textnode testattribute="&amp; &lt; &gt; ' &quot;">
      &amp;
      &lt;
      &gt;
      '
      "
    </textnode>
</rootel>

Solution

I am happy to open a PR for this, but first wanted to clarify the approach:

  1. simplest one: change the serializer, to encode all entities for text and attributes
    • It's a very simple 2 lines change, but it then encodes more chars than required by the spec
  2. OR: change parser to only decode &amp;, &lt; and &gt; for text nodes (here)
    • should only limit it in XML mode, would need to stay the same for html
    • would be spec compliant
    • could be breaking for people who are used to have all 5 entities being decoded

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0