ARM Neon optimization of `oj_dump_cstr` #967

samyron · 2025-05-09T02:56:11Z

Overview

This pull request optimizes oj_dump_cstr using ARM Neon instructions to locate characters that need to be escaped. It does so by implementing a state machine within the current structure of the code.

Note: It may be nice in the future to refactor this code into two distinct functions. One function for searching for characters that need to be escaped and another to do the escaping.

Additionally, I hope this doesn't break any of the Unicode validation. If do_unicode_validation == true and this code detects a high byte, it forces all of the characters within that chunk of text through the current code.

Benchmarks

Both benchmarks compare master (commit: 4332ae9f81db7caed9411be445a145276c69d4fb) vs this arm-neon-optimize-oj-dump-cstr branch (commit 5e8a65304633d87d0f13ce192ce3404de74bbb5b)

The benchmarks are a mix of macro benchmarks testing real-world data sets and synthetic "happy path" tests. Real world tests show a 0-25% speed up using clang and 6-42% speedup with gcc-14.

The best-case synthetic test shows a 264% speedup on clang and 320% on gcc. This test is a 128 byte ascii string with a single \n at the end.

NOTE: Please scroll down to see the notes about synthetic worst-case tests.

compiler: Apple clang version 17.0.0 (clang-1700.0.13.3)

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.210k i/100ms
Calculating -------------------------------------
               after     22.296k (± 1.3%) i/s   (44.85 μs/i) -    112.710k in   5.055962s

Comparison:
              before:    17833.7 i/s
               after:    22296.3 i/s - 1.25x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    90.000 i/100ms
Calculating -------------------------------------
               after    902.113 (± 1.4%) i/s    (1.11 ms/i) -      4.590k in   5.089184s

Comparison:
              before:      909.1 i/s
               after:      902.1 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   220.000 i/100ms
Calculating -------------------------------------
               after      2.193k (± 1.1%) i/s  (455.89 μs/i) -     11.000k in   5.015504s

Comparison:
              before:     2018.5 i/s
               after:     2193.5 i/s - 1.09x  faster


== Encoding bytes.16.singlematch-start (200001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   396.000 i/100ms
Calculating -------------------------------------
               after      3.973k (± 1.5%) i/s  (251.67 μs/i) -     20.196k in   5.083963s

Comparison:
              before:     4082.4 i/s
               after:     3973.4 i/s - 1.03x  slower


== Encoding bytes.16.singlematch-middle (200001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   400.000 i/100ms
Calculating -------------------------------------
               after      4.029k (± 1.3%) i/s  (248.22 μs/i) -     20.400k in   5.064644s

Comparison:
              before:     4138.4 i/s
               after:     4028.6 i/s - same-ish: difference falls within error


== Encoding bytes.16.singlematch-end (200001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   421.000 i/100ms
Calculating -------------------------------------
               after      4.189k (± 0.9%) i/s  (238.72 μs/i) -     2
8000
1.050k in   5.025500s

Comparison:
              before:     4241.0 i/s
               after:     4189.0 i/s - same-ish: difference falls within error


== Encoding bytes.128.single-escape-at-end (1330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   178.000 i/100ms
Calculating -------------------------------------
               after      1.761k (± 2.4%) i/s  (567.84 μs/i) -      8.900k in   5.056808s

Comparison:
              before:      668.1 i/s
               after:     1761.1 i/s - 2.64x  faster

compiler: gcc version 14.2.0 (Homebrew GCC 14.2.0_1)

== Encoding activitypub.json (52595 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after     2.388k i/100ms
Calculating -------------------------------------
               after     24.243k (± 1.1%) i/s   (41.25 μs/i) -    121.788k in   5.024204s

Comparison:
              before:    17104.2 i/s
               after:    24243.5 i/s - 1.42x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   108.000 i/100ms
Calculating -------------------------------------
               after      1.097k (± 1.1%) i/s  (911.73 μs/i) -      5.508k in   5.022404s

Comparison:
              before:     1035.4 i/s
               after:     1096.8 i/s - 1.06x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   247.000 i/100ms
Calculating -------------------------------------
               after      2.482k (± 2.0%) i/s  (402.95 μs/i) -     12.597k in   5.078149s

Comparison:
              before:     2183.4 i/s
               after:     2481.7 i/s - 1.14x  faster


== Encoding bytes.16.singlematch-start (200001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   424.000 i/100ms
Calculating -------------------------------------
               after      4.205k (± 2.5%) i/s  (237.80 μs/i) -     21.200k in   5.044784s

Comparison:
              before:     3423.3 i/s
               after:     4205.2 i/s - 1.23x  faster


== Encoding bytes.16.singlematch-middle (200001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   422.000 i/100ms
Calculating -------------------------------------
               after      4.301k (± 1.5%) i/s  (232.48 μs/i) -     21.522k in   5.004466s

Comparison:
              before:     3439.2 i/s
               after:     4301.5 i/s - 1.25x  faster


== Encoding bytes.16.singlematch-end (200001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   447.000 i/100ms
Calculating -------------------------------------
               after      4.419k (± 1.6%) i/s  (226.32 μs/i) -     22.350k in   5.059584s

Comparison:
              before:     3424.5 i/s
               after:     4418.5 i/s - 1.29x  faster


== Encoding bytes.128.single-escape-at-end (1330001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   168.000 i/100ms
Calculating -------------------------------------
               after      1.763k (± 2.8%) i/s  (567.13 μs/i) -      8.904k in   5.053473s

Comparison:
              before:      551.6 i/s
               after:     1763.3 i/s - 3.20x  faster

Synthetic worst case tests

Unfortunately, for escape heavy workloads this code may be significantly slower. I'm including these tests for transparency. Unfortunately it's not all roses.

I don't know how realistic these are... I would hope these sorts of use cases are quite uncommon. However, I'll like focus on a bit on trying to reduce the performance hit of these types of benchmarks/use cases. One quick idea I had was that if the output length is greater than some multiple of the input just fall back to the scalar loop.

I did include one "happy path" test for a control.

The benchmarks

benchmark_encoding "bytes.15.bestcase", ([("a" * 15)] * 10000)
benchmark_encoding "bytes.15.worstcase", ([('"' * 15)] * 10000)
benchmark_encoding "bytes.15.worstcase-2", (["\u0001a\u0002b\u0001a\u0002b\u0001a\u0002b\u0001a\u0002"] * 10000)
benchmark_encoding "bytes.32.worstcase", ([("\"\t\\\n" * 8)] * 10000)

compiler: Apple clang version 17.0.0 (clang-1700.0.13.3)

== Encoding bytes.15.bestcase (180001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   556.000 i/100ms
Calculating -------------------------------------
               after      5.650k (± 1.8%) i/s  (176.99 μs/i) -     28.356k in   5.020415s

Comparison:
              before:     5748.1 i/s
               after:     5650.0 i/s - same-ish: difference falls within error


== Encoding bytes.15.worstcase (330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   172.000 i/100ms
Calculating -------------------------------------
               after      1.737k (± 4.4%) i/s  (575.76 μs/i) -      8.772k in   5.064093s

Comparison:
              before:     2766.3 i/s
               after:     1736.8 i/s - 1.59x  slower


== Encoding bytes.15.worstcase-2 (580001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   202.000 i/100ms
Calculating -------------------------------------
               after      2.085k (± 5.1%) i/s  (479.64 μs/i) -     10.504k in   5.055611s

Comparison:
              before:     2589.5 i/s
               after:     2084.9 i/s - 1.24x  slower


== Encoding bytes.32.worstcase (670001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    79.000 i/100ms
Calculating -------------------------------------
               after    824.656 (± 1.9%) i/s    (1.21 ms/i) -      4.187k in   5.079329s

Comparison:
              before:     1446.3 i/s
               after:      824.7 i/s - 1.75x  slower

compiler: gcc version 14.2.0 (Homebrew GCC 14.2.0_1)

== Encoding bytes.15.bestcase (180001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   605.000 i/100ms
Calculating -------------------------------------
               after      6.239k (± 2.1%) i/s  (160.28 μs/i) -     31.460k in   5.044701s

Comparison:
              before:     6504.8 i/s
               after:     6238.9 i/s - same-ish: difference falls within error


== Encoding bytes.15.worstcase (330001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   158.000 i/100ms
Calculating -------------------------------------
               after      1.583k (± 1.6%) i/s  (631.72 μs/i) -      8.058k in   5.091748s

Comparison:
              before:     2182.4 i/s
               after:     1583.0 i/s - 1.38x  slower


== Encoding bytes.15.worstcase-2 (580001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   227.000 i/100ms
Calculating -------------------------------------
               after      2.255k (± 2.2%) i/s  (443.38 μs/i) -     11.350k in   5.034758s

Comparison:
              before:     3010.0 i/s
               after:     2255.4 i/s - 1.33x  slower


== Encoding bytes.32.worstcase (670001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after    78.000 i/100ms
Calculating -------------------------------------
               after    799.802 (± 7.1%) i/s    (1.25 ms/i) -      3.978k in   5.020809s

Comparison:
              before:     1410.0 i/s
               after:      799.8 i/s - 1.76x  slower

…plification Simplifications of the ARM Neon code.

samyron · 2025-05-11T02:17:33Z

The performance is pretty consistent after the simplifications.

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.230k i/100ms
Calculating -------------------------------------
               after     22.419k (± 1.6%) i/s   (44.60 μs/i) -    113.730k in   5.074151s

Comparison:
              before:    17756.7 i/s
               after:    22419.4 i/s - 1.26x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    90.000 i/100ms
Calculating -------------------------------------
               after    905.970 (± 1.5%) i/s    (1.10 ms/i) -      4.590k in   5.067532s

Comparison:
              before:      916.5 i/s
               after:      906.0 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   220.000 i/100ms
Calculating -------------------------------------
               after      2.204k (± 1.3%) i/s  (453.63 μs/i) -     11.220k in   5.090643s

Comparison:
              before:     1999.9 i/s
               after:     2204.4 i/s - 1.10x  faster

First pass at optimizing oj_dump_cstr with ARM Neon SIMD instructions.

5e8a653

samyron marked this pull request as ready for review May 9, 2025 03:04

samyron added 2 commits May 10, 2025 21:13

Simplifications of the ARM Neon code.

62f917e

Merge pull request #1 from samyron/arm-neon-optimize-oj-dump-cstr-sim…

93dd85a

…plification Simplifications of the ARM Neon code.

ohler55 approved these changes May 11, 2025

View reviewed changes

ohler55 merged commit b8ecebd into ohler55:develop May 11, 2025
50 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ARM Neon optimization of `oj_dump_cstr` #967

ARM Neon optimization of `oj_dump_cstr` #967

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ARM Neon optimization of oj_dump_cstr #967

ARM Neon optimization of oj_dump_cstr #967

Uh oh!

Conversation

Uh oh!

Overview

Benchmarks

compiler: Apple clang version 17.0.0 (clang-1700.0.13.3)

compiler: gcc version 14.2.0 (Homebrew GCC 14.2.0_1)

Synthetic worst case tests

The benchmarks

compiler: Apple clang version 17.0.0 (clang-1700.0.13.3)

compiler: gcc version 14.2.0 (Homebrew GCC 14.2.0_1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ARM Neon optimization of `oj_dump_cstr` #967

ARM Neon optimization of `oj_dump_cstr` #967