core/clist: use insertion sort #21398

maribu · 2025-04-11T19:06:16Z

Contribution description

This replaces the previous clist sort implementation, which used a merge sort, with a simple insertion sort implementation.

The motivation is as follow:

The insertion sort is smaller (code size)
The performance benefit of the merge sort would only become noticeable on data sets that would be to large to comfortably handle on the MCUs RIOT targets
The previous implementation looked like a good candidate for the IOCCC, while the insertion sort is trivial to understand
Static analysis could provide for two statements an example control flow path that would trigger a NULL pointer dereferencation. The current code has no such defects.

Testing procedure

There is a unit test that should still pass.
Adding -fanalyzer should now be happy with the code
The code should now be understandable without spending hours on a white board

Issues/PRs references

None

maribu · 2025-04-11T19:21:31Z

I also improved the unit test a little bit to cover a few edge cases and to verify that the sorted list is actually sorted.

riot-ci · 2025-04-11T19:29:43Z

Murdock results

❌ FAILED

66f4a66 fixup! core/clist: use insertion sort

Success	Failures	Total	Runtime
5943	0	9842	05m:49s

Artifacts

This adds a benchmark for clist_sort() and calls that on three different list sizes.

maribu · 2025-04-12T09:35:49Z

Depends on and includes:

Add a few cycles before starting the stopwatch for fancy CPUs that do have a cache and branch predictor that may warm up (e.g. Cortex M7).

maribu · 2025-04-12T19:37:27Z

Some benchmarks:

`native64`

`master` (merge sort, IOCCC variant)

main(): This is RIOT! (Version: 2025.04-devel-520-g98b01-tests/bench/runtime_coreapis)
Runtime of Selected Core API functions

                 nop loop:       430us  ---   0.000us per call  ---  2325581395 calls per sec

             mutex_init():       893us  ---   0.000us per call  ---  1119820828 calls per sec
        mutex lock/unlock:    909438us  ---   0.909us per call  ---    1099580 calls per sec

       thread_flags_set():    319468us  ---   0.319us per call  ---    3130203 calls per sec
     thread_flags_clear():    325462us  ---   0.325us per call  ---    3072555 calls per sec
thread flags set/wait any:    978442us  ---   0.978us per call  ---    1022032 calls per sec
thread flags set/wait all:    972636us  ---   0.972us per call  ---    1028133 calls per sec
thread flags set/wait one:    979552us  ---   0.979us per call  ---    1020874 calls per sec

        msg_try_receive():    334414us  ---   0.334us per call  ---    2990305 calls per sec
              msg_avail():      4767us  ---   0.004us per call  ---  209775540 calls per sec

   clist_sort (#4, worst):     19603us  ---   0.019us per call  ---   51012600 calls per sec
     clist_sort (#4, avg):     20029us  ---   0.020us per call  ---   49927604 calls per sec
   clist_sort (#8, worst):     47127us  ---   0.047us per call  ---   21219258 calls per sec
     clist_sort (#8, avg):     49394us  ---   0.049us per call  ---   20245373 calls per sec
  clist_sort (#16, worst):    119396us  ---   0.119us per call  ---    8375489 calls per sec
    clist_sort (#16, avg):    115471us  ---   0.115us per call  ---    8660183 calls per sec
  clist_sort (#32, worst):    278381us  ---   0.278us per call  ---    3592199 calls per sec
    clist_sort (#32, avg):    312818us  ---   0.312us per call  ---    3196746 calls per sec
  clist_sort (#64, worst):    636261us  ---   0.636us per call  ---    1571682 calls per sec
    clist_sort (#64, avg):    780777us  ---   0.780us per call  ---    1280775 calls per sec
 clist_sort (#128, worst):   1424094us  ---   1.424us per call  ---     702200 calls per sec
   clist_sort (#128, avg):   1874443us  ---   1.874us per call  ---     533491 calls per sec
 clist_sort (#256, worst):   3235114us  ---   3.235us per call  ---     309108 calls per sec
   clist_sort (#256, avg):   4239094us  ---   4.239us per call  ---     235899 calls per sec

[SUCCESS]
{ "threads": [{ "name": "idle", "stack_size": 16384, "stack_used": 1080 }]}
{ "threads": [{ "name": "main", "stack_size": 20480, "stack_used": 4760 }]}

This PR (simple insertion sort)

main(): This is RIOT! (Version: 2025.04-devel-522-gc33cb94-core/clist_sort_insertion_sort)
Runtime of Selected Core API functions

                 nop loop:       593us  ---   0.000us per call  ---  1686340640 calls per sec

             mutex_init():      1268us  ---   0.001us per call  ---  788643533 calls per sec
        mutex lock/unlock:   1038304us  ---   1.038us per call  ---     963109 calls per sec

       thread_flags_set():    327235us  ---   0.327us per call  ---    3055907 calls per sec
     thread_flags_clear():    331057us  ---   0.331us per call  ---    3020627 calls per sec
thread flags set/wait any:    979440us  ---   0.979us per call  ---    1020991 calls per sec
thread flags set/wait all:    976936us  ---   0.976us per call  ---    1023608 calls per sec
thread flags set/wait one:    979527us  ---   0.979us per call  ---    1020900 calls per sec

        msg_try_receive():    330618us  ---   0.330us per call  ---    3024638 calls per sec
              msg_avail():      2910us  ---   0.002us per call  ---  343642611 calls per sec

   clist_sort (#4, worst):     16905us  ---   0.016us per call  ---   59154096 calls per sec
     clist_sort (#4, avg):     15056us  ---   0.015us per call  ---   66418703 calls per sec
   clist_sort (#8, worst):     33741us  ---   0.033us per call  ---   29637532 calls per sec
     clist_sort (#8, avg):     22386us  ---   0.022us per call  ---   44670776 calls per sec
  clist_sort (#16, worst):     73995us  ---   0.073us per call  ---   13514426 calls per sec
    clist_sort (#16, avg):     56454us  ---   0.056us per call  ---   17713536 calls per sec
  clist_sort (#32, worst):    194442us  ---   0.194us per call  ---    5142921 calls per sec
    clist_sort (#32, avg):    106041us  ---   0.106us per call  ---    9430314 calls per sec
  clist_sort (#64, worst):    394818us  ---   0.394us per call  ---    2532812 calls per sec
    clist_sort (#64, avg):    253255us  ---   0.253us per call  ---    3948589 calls per sec
 clist_sort (#128, worst):    753123us  ---   0.753us per call  ---    1327804 calls per sec
   clist_sort (#128, avg):    601839us  ---   0.601us per call  ---    1661573 calls per sec
 clist_sort (#256, worst):   1514474us  ---   1.514us per call  ---     660295 calls per sec
   clist_sort (#256, avg):   1330050us  ---   1.330us per call  ---     751851 calls per sec

[SUCCESS]
{ "threads": [{ "name": "idle", "stack_size": 16384, "stack_used": 1080 }]}
{ "threads": [{ "name": "main", "stack_size": 20480, "stack_used": 4760 }]}

`same54-xpro`

`master` (merge sort, IOCCC variant)

main(): This is RIOT! (Version: 2025.04-devel-520-g98b01-tests/bench/runtime_coreapis)
Runtime of Selected Core API functions

                 nop loop:     33335us  ---   0.033us per call  ---   29998500 calls per sec

             mutex_init():         2us  ---   0.000us per call  ---  1783793664 calls per sec
        mutex lock/unlock:    541669us  ---   0.541us per call  ---    1846145 calls per sec

       thread_flags_set():    275002us  ---   0.275us per call  ---    3636337 calls per sec
     thread_flags_clear():    266669us  ---   0.266us per call  ---    3749967 calls per sec
thread flags set/wait any:    941669us  ---   0.941us per call  ---    1061944 calls per sec
thread flags set/wait all:    800002us  ---   0.800us per call  ---    1249996 calls per sec
thread flags set/wait one:    933335us  ---   0.933us per call  ---    1071426 calls per sec

        msg_try_receive():    416669us  ---   0.416us per call  ---    2399986 calls per sec
              msg_avail():    133335us  ---   0.133us per call  ---    7499906 calls per sec

   clist_sort (#4, worst):      4127us  ---   4.127us per call  ---     242306 calls per sec
     clist_sort (#4, avg):      4202us  ---   4.202us per call  ---     237981 calls per sec
   clist_sort (#8, worst):     10352us  ---  10.352us per call  ---      96599 calls per sec
     clist_sort (#8, avg):     10358us  ---  
8000
10.358us per call  ---      96543 calls per sec
  clist_sort (#16, worst):     25110us  ---  25.110us per call  ---      39824 calls per sec
    clist_sort (#16, avg):     25002us  ---  25.002us per call  ---      39996 calls per sec
  clist_sort (#32, worst):     59235us  ---  59.235us per call  ---      16881 calls per sec
    clist_sort (#32, avg):     62160us  ---  62.160us per call  ---      16087 calls per sec
  clist_sort (#64, worst):    136694us  ---  136.694us per call  ---       7315 calls per sec
    clist_sort (#64, avg):    147785us  ---  147.785us per call  ---       6766 calls per sec
 clist_sort (#128, worst):    310018us  ---  310.018us per call  ---       3225 calls per sec
   clist_sort (#128, avg):    341994us  ---  341.994us per call  ---       2924 calls per sec
 clist_sort (#256, worst):    693477us  ---  693.477us per call  ---       1442 calls per sec
   clist_sort (#256, avg):    777652us  ---  777.652us per call  ---       1285 calls per sec

[SUCCESS]

This PR (simple insertion sort)

main(): This is RIOT! (Version: 2025.04-devel-522-gc33cb94-core/clist_sort_insertion_sort)
Runtime of Selected Core API functions

                 nop loop:     33335us  ---   0.033us per call  ---   29998500 calls per sec

             mutex_init():         2us  ---   0.000us per call  ---  1783793664 calls per sec
        mutex lock/unlock:    541669us  ---   0.541us per call  ---    1846145 calls per sec

       thread_flags_set():    275001us  ---   0.275us per call  ---    3636350 calls per sec
     thread_flags_clear():    266668us  ---   0.266us per call  ---    3749981 calls per sec
thread flags set/wait any:    941669us  ---   0.941us per call  ---    1061944 calls per sec
thread flags set/wait all:    800002us  ---   0.800us per call  ---    1249996 calls per sec
thread flags set/wait one:    933336us  ---   0.933us per call  ---    1071425 calls per sec

        msg_try_receive():    416668us  ---   0.416us per call  ---    2399992 calls per sec
              msg_avail():    133335us  ---   0.133us per call  ---    7499906 calls per sec

   clist_sort (#4, worst):      2252us  ---   2.252us per call  ---     444049 calls per sec
     clist_sort (#4, avg):      2443us  ---   2.443us per call  ---     409332 calls per sec
   clist_sort (#8, worst):      4852us  ---   4.852us per call  ---     206100 calls per sec
     clist_sort (#8, avg):      3918us  ---   3.918us per call  ---     255232 calls per sec
  clist_sort (#16, worst):     10052us  ---  10.052us per call  ---      99482 calls per sec
    clist_sort (#16, avg):      8485us  ---   8.485us per call  ---     117855 calls per sec
  clist_sort (#32, worst):     20452us  ---  20.452us per call  ---      48894 calls per sec
    clist_sort (#32, avg):     15069us  ---  15.069us per call  ---      66361 calls per sec
  clist_sort (#64, worst):     41252us  ---  41.252us per call  ---      24241 calls per sec
    clist_sort (#64, avg):     31985us  ---  31.985us per call  ---      31264 calls per sec
 clist_sort (#128, worst):     82852us  ---  82.852us per call  ---      12069 calls per sec
   clist_sort (#128, avg):     68819us  ---  68.819us per call  ---      14530 calls per sec
 clist_sort (#256, worst):    166052us  ---  166.052us per call  ---       6022 calls per sec
   clist_sort (#256, avg):    140085us  ---  140.085us per call  ---       7138 calls per sec

[SUCCESS]

kaspar030 · 2025-04-12T21:35:36Z

tests/bench/runtime_coreapis/main.c

+        /* This is the worst case for most sorting algorithms: having the
+         * list perfectly sorted but in inverse direction. For real time, we
+         * only care about worst case and not about best case / average case */
+        n->value = val--;


This is not the worst case for this insertion sort implementation, as using val++ makes sorting much slower.

Suggested change

n->value = val--;

n->value = val++;

E.g., on nrf52840dk, for n=256:

before:

2025-04-12 23:18:21,904 # clist_sort (#256, worst): 311423us --- 311.423us per call --- 3211 calls per sec

after:

2025-04-12 23:19:11,954 # clist_sort (#256, worst): 9233345us --- 9233.345us per call --- 108 calls per sec

That's a factor 30 slower than the claimed "worst case".

The cut-off point where this insertion sort becomes slower than the previous implementation (which does not come from IOCCC btw) is below n=32.

kaspar030 · 2025-04-12T22:11:53Z

I'm all for a fixed clist sort, but the benchmark seems misleading, the worst case is actually not the worst case for this insertion sort.

The performance benefit of the merge sort would only become noticeable on data sets that would be to large to comfortably handle on the MCUs RIOT targets

With the "fixed worst case", it becomes noticable with less than 32 list entries. At 32 list entries, the old merge sort is already 50% faster then this insertion sort (worst case). The nrf's memory fits 1000 times 32 list entries.

With 16384 list entries, merge sort can sort the list five times per second. (insertion sort is still running and will take a while, I will update this post when it is done.). This might be a size that is unlikely to come up on our little devices.

This insertion sort is still much faster in the "avg case" of this benchmark. (Which it shouldn't be actually. 🙂)

- Iterate of different lengths of unsorted data to also cover corner case one-item list - Actually check that the list is sorted afterwards (and not just that the maximum got sorted in last) - Actually check that the list was stable-sorted (as the API doc promises a stable sort)

This replaces the previous clist sort implementation, which used a merge sort, with a simple insertion sort implementation. The motivation is as follow: 1. The insertion sort is smaller (code size) 2. The performance benefit of the merge sort would only become noticeable on data sets that would be to large to comfortably handle on the MCUs RIOT targets 3. The previous implementation looked like a good candidate for the IOCCC, while the insertion sort is trivial to understand 4. Static analysis could provide for two statements an example control flow path that would trigger a `NULL` pointer dereferencation. The current code has no such defects.

maribu requested a review from kaspar030 as a code owner April 11, 2025 19:06

github-actions bot added the Process: missing approvals Integration Process: PR needs more ACKS (handled by action) label Apr 11, 2025

maribu force-pushed the core/clist_sort_insertion_sort branch 2 times, most recently from 99199f5 to cd93aa7 Compare April 11, 2025 19:12

maribu requested a review from miri64 as a code owner April 11, 2025 19:20

github-actions bot added the Area: tests Area: tests and testing framework label Apr 11, 2025

maribu force-pushed the core/clist_sort_insertion_sort branch from f8a1c38 to 689b4cd Compare April 12, 2025 09:00

maribu added State: waiting for other PR State: The PR requires another PR to be merged first and removed Area: tests Area: tests and testing framework labels Apr 12, 2025

maribu marked this pull request as draft April 12, 2025 09:01

tests/bench/runtime_coreapis: add clist_sort() benchmark

8502272

This adds a benchmark for clist_sort() and calls that on three different list sizes.

maribu force-pushed the core/clist_sort_insertion_sort branch from 689b4cd to 9842631 Compare April 12, 2025 09:34

github-actions bot added the Area: tests Area: tests and testing framework label Apr 12, 2025

maribu added 2 commits April 12, 2025 21:15

fixup! tests/bench/runtime_coreapis: add clist_sort() benchmark

138f164

sys/benchmark: warm up cache

98b0136

Add a few cycles before starting the stopwatch for fancy CPUs that do have a cache and branch predictor that may warm up (e.g. Cortex M7).

maribu force-pushed the core/clist_sort_insertion_sort branch from 9842631 to c33cb94 Compare April 12, 2025 19:31

github-actions bot added the Area: sys Area: System label Apr 12, 2025

kaspar030 reviewed Apr 12, 2025

View reviewed changes

maribu added 3 commits April 13, 2025 09:12

fixup! tests/bench/runtime_coreapis: add clist_sort() benchmark

048daad

fixup! tests/bench/runtime_coreapis: add clist_sort() benchmark

49c3833

fixup! fixup! tests/bench/runtime_coreapis: add clist_sort() benchmark

c55e768

maribu added 3 commits April 13, 2025 11:58

fixup! core/clist: use insertion sort

66f4a66

maribu force-pushed the core/clist_sort_insertion_sort branch from c33cb94 to 66f4a66 Compare April 13, 2025 10:00

kaspar030 mentioned this pull request Apr 13, 2025

tests/bench/runtime_coreapis: add clist_sort() benchmark #21403

Merged

maribu mentioned this pull request Apr 13, 2025

core/clist_sort: rewrite #21406

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core/clist: use insertion sort #21398

core/clist: use insertion sort #21398

core/clist: use insertion sort #21398

Are you sure you want to change the base?

core/clist: use insertion sort #21398

Conversation

Contribution description

Testing procedure

Issues/PRs references

Murdock results

Artifacts

native64

master (merge sort, IOCCC variant)

This PR (simple insertion sort)

same54-xpro

master (merge sort, IOCCC variant)

This PR (simple insertion sort)

Choose a reason for hiding this comment

`native64`

`master` (merge sort, IOCCC variant)

`same54-xpro`

`master` (merge sort, IOCCC variant)