ARM Neon optimization of oj_dump_cstr
#967
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This pull request optimizes
oj_dump_cstr
using ARM Neon instructions to locate characters that need to be escaped. It does so by implementing a state machine within the current structure of the code.Note: It may be nice in the future to refactor this code into two distinct functions. One function for searching for characters that need to be escaped and another to do the escaping.
Additionally, I hope this doesn't break any of the Unicode validation. If
do_unicode_validation == true
and this code detects a high byte, it forces all of the characters within that chunk of text through the current code.Benchmarks
Both benchmarks compare
master
(commit:4332ae9f81db7caed9411be445a145276c69d4fb
) vs thisarm-neon-optimize-oj-dump-cstr
branch (commit5e8a65304633d87d0f13ce192ce3404de74bbb5b
)The benchmarks are a mix of macro benchmarks testing real-world data sets and synthetic "happy path" tests. Real world tests show a 0-25% speed up using
clang
and 6-42% speedup withgcc-14
.The best-case synthetic test shows a 264% speedup on
clang
and 320% ongcc
. This test is a 128 byte ascii string with a single\n
at the end.NOTE: Please scroll down to see the notes about synthetic worst-case tests.
compiler: Apple clang version 17.0.0 (clang-1700.0.13.3)
compiler: gcc version 14.2.0 (Homebrew GCC 14.2.0_1)
Synthetic worst case tests
Unfortunately, for escape heavy workloads this code may be significantly slower. I'm including these tests for transparency. Unfortunately it's not all roses.
I don't know how realistic these are... I would hope these sorts of use cases are quite uncommon. However, I'll like focus on a bit on trying to reduce the performance hit of these types of benchmarks/use cases. One quick idea I had was that if the output length is greater than
some multiple
of the input just fall back to the scalar loop.I did include one "happy path" test for a control.
The benchmarks
compiler: Apple clang version 17.0.0 (clang-1700.0.13.3)
compiler: gcc version 14.2.0 (Homebrew GCC 14.2.0_1)