Description
Context
We use Saxy to parse XML files from several APIs.
There have been changes introduced to emit whitespaces correctly, connected to [this issue and its pull request](#51).
Concerns
For our first concern related to the above, we noticed a reduction in performance when using the latest version of Saxy (v1.4.0), as opposed to a fork based on v0.9.1.
We did some digging to check whether it was changes on our end, or if it was a performance regression in this library.
We observed that the generated SimpleForm data from a pretty-printed XML, using the latest version of saxy is double the file size from the original one. Very likely caused or correlated by the emitted / parsed whitespaces.
Running benchmarks between SimpleForm results coming from both versions, we found that parsed / emitted whitespaces cause some performance regressions.
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.13.4
Erlang 25.0.4
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 5 s
parallel: 5
inputs: none specified
Estimated total run time: 24 s
Benchmarking new_saxy_version ...
Benchmarking old_saxy_version ...
Name ips average deviation median 99th %
old_saxy_version 4.33 231.03 ms ±3.94% 229.68 ms 257.67 ms
new_saxy_version 4.22 237.09 ms ±4.67% 234.80 ms 265.52 ms
Comparison:
old_saxy_version 4.33
new_saxy_version 4.22 - 1.03x slower +6.05 ms
Reduction count statistics:
Name average deviation median 99th %
old_saxy_version 20.86 M ±0.04% 20.86 M 20.87 M
new_saxy_version 24.84 M ±0.03% 24.84 M 24.85 M
Comparison:
old_saxy_version 20.86 M
new_saxy_version 24.84 M - 1.19x reduction count +3.98 M
The second concern is about whitespace values within tags. Previously, an XML in the shape of:
<Body> <Value> </Value> </Body>
Would result into the contents of Value being parsed as nil
. In the current build, it is parsed as a string containing whitespace.
In a previous issue mentioned in the beginning, this should only happen when certain attributes are provided. Like so:
<Body> <Value xml:space="preserve"> </Value> </Body>
Proposed solutions
- We should only emit whitespaces when
xml:space="preserve"
is provided. By default whitespaces should not be emitted. - Alternatively, providing an option to ignore whitespaces might be a way to solve it, such that we can do calls similar to these:
Saxy.parse_stream(stream, Saxy.SimpleForm.Handler, [emit_whitespaces?: false])