8000 ARM Neon optimization of `oj_dump_cstr` by samyron · Pull Request #967 · ohler55/oj · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ARM Neon optimization of oj_dump_cstr #967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 11, 2025

Conversation

samyron
Copy link
Contributor
@samyron samyron commented May 9, 2025

Overview

This pull request optimizes oj_dump_cstr using ARM Neon instructions to locate characters that need to be escaped. It does so by implementing a state machine within the current structure of the code.

Note: It may be nice in the future to refactor this code into two distinct functions. One function for searching for characters that need to be escaped and another to do the escaping.

Additionally, I hope this doesn't break any of the Unicode validation. If do_unicode_validation == true and this code detects a high byte, it forces all of the characters within that chunk of text through the current code.

Benchmarks

Both benchmarks compare master (commit: 4332ae9f81db7caed9411be445a145276c69d4fb) vs this arm-neon-optimize-oj-dump-cstr branch (commit 5e8a65304633d87d0f13ce192ce3404de74bbb5b)

The benchmarks are a mix of macro benchmarks testing real-world data sets and synthetic "happy path" tests. Real world tests show a 0-25% speed up using clang and 6-42% speedup with gcc-14.

The best-case synthetic test shows a 264% speedup on clang and 320% on gcc. This test is a 128 byte ascii string with a single \n at the end.

NOTE: Please scroll down to see the notes about synthetic worst-case tests.

compiler: Apple clang version 17.0.0 (clang-1700.0.13.3)

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.210k i/100ms
Calculating -------------------------------------
               after     22.296k (± 1.3%) i/s   (44.85 μs/i) -    112.710k in   5.055962s

Comparison:
              before:    17833.7 i/s
               after:    22296.3 i/s - 1.25x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    90.000 i/100ms
Calculating -------------------------------------
               after    902.113 (± 1.4%) i/s    (1.11 ms/i) -      4.590k in   5.089184s

Comparison:
              before:      909.1 i/s
               after:      902.1 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   220.000 i/100ms
Calculating -------------------------------------
               after      2.193k (± 1.1%) i/s  (455.89 μs/i) -     11.000k in   5.015504s

Comparison:
              before:     2018.5 i/s
               after:     2193.5 i/s - 1.09x  faster


== Encoding bytes.16.singlematch-start (200001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   396.000 i/100ms
Calculating -------------------------------------
               after      3.973k (± 1.5%) i/s  (251.67 μs/i) -     20.196k in   5.083963s

Comparison:
              before:     4082.4 i/s
               after:     3973.4 i/s - 1.03x  slower


== Encoding bytes.16.singlematch-middle (200001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   400.000 i/100ms
Calculating -------------------------------------
               after      4.029k (± 1.3%) i/s  (248.22 μs/i) -     20.400k in   5.064644s

Comparison:
              before:     4138.4 i/s
               after:     4028.6 i/s - same-ish: difference falls within error


== Encoding bytes.16.singlematch-end (200001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   421.000 i/100ms
Calculating -------------------------------------
               after      4.189k (± 0.9%) i/s  (238.72 μs/i) -     2
8000
1.050k in   5.025500s

Comparison:
              before:     4241.0 i/s
               after:     4189.0 i/s - same-ish: difference falls within error


== Encoding bytes.128.single-escape-at-end (1330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   178.000 i/100ms
Calculating -------------------------------------
               after      1.761k (± 2.4%) i/s  (567.84 μs/i) -      8.900k in   5.056808s

Comparison:
              before:      668.1 i/s
               after:     1761.1 i/s - 2.64x  faster

compiler: gcc version 14.2.0 (Homebrew GCC 14.2.0_1)

== Encoding activitypub.json (52595 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after     2.388k i/100ms
Calculating -------------------------------------
               after     24.243k (± 1.1%) i/s   (41.25 μs/i) -    121.788k in   5.024204s

Comparison:
              before:    17104.2 i/s
               after:    24243.5 i/s - 1.42x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   108.000 i/100ms
Calculating -------------------------------------
               after      1.097k (± 1.1%) i/s  (911.73 μs/i) -      5.508k in   5.022404s

Comparison:
              before:     1035.4 i/s
               after:     1096.8 i/s - 1.06x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   247.000 i/100ms
Calculating -------------------------------------
               after      2.482k (± 2.0%) i/s  (402.95 μs/i) -     12.597k in   5.078149s

Comparison:
              before:     2183.4 i/s
               after:     2481.7 i/s - 1.14x  faster


== Encoding bytes.16.singlematch-start (200001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   424.000 i/100ms
Calculating -------------------------------------
               after      4.205k (± 2.5%) i/s  (237.80 μs/i) -     21.200k in   5.044784s

Comparison:
              before:     3423.3 i/s
               after:     4205.2 i/s - 1.23x  faster


== Encoding bytes.16.singlematch-middle (200001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   422.000 i/100ms
Calculating -------------------------------------
               after      4.301k (± 1.5%) i/s  (232.48 μs/i) -     21.522k in   5.004466s

Comparison:
              before:     3439.2 i/s
               after:     4301.5 i/s - 1.25x  faster


== Encoding bytes.16.singlematch-end (200001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   447.000 i/100ms
Calculating -------------------------------------
               after      4.419k (± 1.6%) i/s  (226.32 μs/i) -     22.350k in   5.059584s

Comparison:
              before:     3424.5 i/s
               after:     4418.5 i/s - 1.29x  faster


== Encoding bytes.128.single-escape-at-end (1330001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   168.000 i/100ms
Calculating -------------------------------------
               after      1.763k (± 2.8%) i/s  (567.13 μs/i) -      8.904k in   5.053473s

Comparison:
              before:      551.6 i/s
               after:     1763.3 i/s - 3.20x  faster

Synthetic worst case tests

Unfortunately, for escape heavy workloads this code may be significantly slower. I'm including these tests for transparency. Unfortunately it's not all roses.

I don't know how realistic these are... I would hope these sorts of use cases are quite uncommon. However, I'll like focus on a bit on trying to reduce the performance hit of these types of benchmarks/use cases. One quick idea I had was that if the output length is greater than some multiple of the input just fall back to the scalar loop.

I did include one "happy path" test for a control.

The benchmarks

benchmark_encoding "bytes.15.bestcase", ([("a" * 15)] * 10000)
benchmark_encoding "bytes.15.worstcase", ([('"' * 15)] * 10000)
benchmark_encoding "bytes.15.worstcase-2", (["\u0001a\u0002b\u0001a\u0002b\u0001a\u0002b\u0001a\u0002"] * 10000)
benchmark_encoding "bytes.32.worstcase", ([("\"\t\\\n" * 8)] * 10000)

compiler: Apple clang version 17.0.0 (clang-1700.0.13.3)

== Encoding bytes.15.bestcase (180001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   556.000 i/100ms
Calculating -------------------------------------
               after      5.650k (± 1.8%) i/s  (176.99 μs/i) -     28.356k in   5.020415s

Comparison:
              before:     5748.1 i/s
               after:     5650.0 i/s - same-ish: difference falls within error


== Encoding bytes.15.worstcase (330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   172.000 i/100ms
Calculating -------------------------------------
               after      1.737k (± 4.4%) i/s  (575.76 μs/i) -      8.772k in   5.064093s

Comparison:
              before:     2766.3 i/s
               after:     1736.8 i/s - 1.59x  slower


== Encoding bytes.15.worstcase-2 (580001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   202.000 i/100ms
Calculating -------------------------------------
               after      2.085k (± 5.1%) i/s  (479.64 μs/i) -     10.504k in   5.055611s

Comparison:
              before:     2589.5 i/s
               after:     2084.9 i/s - 1.24x  slower


== Encoding bytes.32.worstcase (670001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    79.000 i/100ms
Calculating -------------------------------------
               after    824.656 (± 1.9%) i/s    (1.21 ms/i) -      4.187k in   5.079329s

Comparison:
              before:     1446.3 i/s
               after:      824.7 i/s - 1.75x  slower

compiler: gcc version 14.2.0 (Homebrew GCC 14.2.0_1)

== Encoding bytes.15.bestcase (180001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   605.000 i/100ms
Calculating -------------------------------------
               after      6.239k (± 2.1%) i/s  (160.28 μs/i) -     31.460k in   5.044701s

Comparison:
              before:     6504.8 i/s
               after:     6238.9 i/s - same-ish: difference falls within error


== Encoding bytes.15.worstcase (330001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   158.000 i/100ms
Calculating -------------------------------------
               after      1.583k (± 1.6%) i/s  (631.72 μs/i) -      8.058k in   5.091748s

Comparison:
              before:     2182.4 i/s
               after:     1583.0 i/s - 1.38x  slower


== Encoding bytes.15.worstcase-2 (580001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after   227.000 i/100ms
Calculating -------------------------------------
               after      2.255k (± 2.2%) i/s  (443.38 μs/i) -     11.350k in   5.034758s

Comparison:
              before:     3010.0 i/s
               after:     2255.4 i/s - 1.33x  slower


== Encoding bytes.32.worstcase (670001 bytes)
ruby 3.3.8 (2025-04-09 revision b200bad6cd) +YJIT [arm64-darwin24]
Warming up --------------------------------------
               after    78.000 i/100ms
Calculating -------------------------------------
               after    799.802 (± 7.1%) i/s    (1.25 ms/i) -      3.978k in   5.020809s

Comparison:
              before:     1410.0 i/s
               after:      799.8 i/s - 1.76x  slower

@samyron samyron marked this pull request as ready for review May 9, 2025 03:04
@samyron
Copy link
Contributor Author
samyron commented May 11, 2025

The performance is pretty consistent after the simplifications.

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.230k i/100ms
Calculating -------------------------------------
               after     22.419k (± 1.6%) i/s   (44.60 μs/i) -    113.730k in   5.074151s

Comparison:
              before:    17756.7 i/s
               after:    22419.4 i/s - 1.26x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after    90.000 i/100ms
Calculating -------------------------------------
               after    905.970 (± 1.5%) i/s    (1.10 ms/i) -      4.590k in   5.067532s

Comparison:
              before:      916.5 i/s
               after:      906.0 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   220.000 i/100ms
Calculating -------------------------------------
               after      2.204k (± 1.3%) i/s  (453.63 μs/i) -     11.220k in   5.090643s

Comparison:
              before:     1999.9 i/s
               after:     2204.4 i/s - 1.10x  faster

@ohler55 ohler55 merged commit b8ecebd into ohler55:develop May 11, 2025
50 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0