-
Notifications
You must be signed in to change notification settings - Fork 297
snabbmark: Add first step towards porting basic1 to asm [sketch] #603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Super early sketch of the code! shared just for fun. The idea is simple: clone the 'basic1' benchmark in assembler to see what performance is possible. Then we could use this as a benchmark for deciding how optimized our Lua code is. Could be that in writing the assembler code we get some bright ideas for improving the Lua code too. This is very rough code that implements: source -> sink and recycles packets via a freelist. The full implementation would have to implement the real basic1 topology i.e. include a "Tee" app that duplicates packets. This is not optimized, etc, so don't draw any sweeping conclusions yet :-). Note also that I did most of my assembler programming 20+ years ago on the MC68000 so it will take a little practice to find a decent style :).
Currently the performance I see with I have been reading CSAPP3 and so I can see some obviously bad patterns in my coding style like this series of instructions where each depends on the result of the previous one and probably sucks for ILP:
CPUs have changed a lot since the last time I wrote assembler code in anger (Amiga 500...) and so for now I am giving myself permission to suck :-). |
... I will try to implement the Tee app to make the program equivalent to |
See also The Unofficial Dynasm Documentation that I only discovered recently. Really great! Note that this is based on the standard C-based dynasm while Snabb Switch is now using the extended Lua-based dynasm. |
The code is still really rough - and surely buggy - but now it implements the same topology as the 'snabbmark basic1' with a Tee app duplicating packets.
The code is still extremely rough but I added a Tee app so that it implements the same topology as Each packet is allocated by a Source from a freelist and lightly initialized, transmitted to a Tee which makes a full copy and transmits on two output links, on to a Sink that puts the packets back onto the freelist. This is roughly equivalent to the work done by the Current performance is around 23 cycles per packet. On Interlaken this is
8000
107 Mpps and that is about 5x faster than Note: The reason we have not dived into optimizing these basic operations in the past is that in real applications the performance bottlenecks have been elsewhere i.e. inside the main loops of the apps and not in the machinery that connects the apps together. So why is this interesting at all? Well: I do think that sooner or later we will be in the situation of wanting to squeeze maximum performance out of the app network. There are already some apps that bypass the app network for performance (packetblaster and firehose) and this may actually turn out to be unnecessary. Likely we will also want to write an app that dispatches a 100G traffic stream across different links in an app network and it would be valuable to understand what the maximum performance we could expect from that would be (and maybe this is an app that we really would want to write in JIT'd assembler). I also have a basic feeling that our machinery link Here is a cute feature of the code: copying a 64-byte packet using SIMD. This version uses 128-bit registers:
and on AVX2 (Haswell) you would only need half as many instructions (pardon my fudged opcodes):
and on AVX512 (Skylake) you can actually fit a whole 64-byte packet in a register:
Funny to think about "packet copies" that will take only 2 instructions and execute in 1 cycle! (Of course not all packets are 64-bytes but small packets are usually where performance hurts.) |
Oh and if you want to actually run it then you can do this:
and it will print the link statistics and CPU performance counters after running 1 billion packets from the source through the app network:
|
Port the asm version of 'snabbmark basic1' back to Lua. The assembler version was simplified in various ways e.g. it does not have a real breathe loop and it assumes that links/freelists never overflow or underflow. This is a fairly faithful port of that simplified assembler version back to Lua. The purpose is to see apples-to-apples how fast the Lua code runs when you 'cheat' in the same ways as the assembler code. Loosely the results seem to be (on Interlaken): original: 27 Mpps asm: 140 Mpps lua-asm: 62 Mpps
Just having a little more fun here: I ported the assembler code back to Lua with the same simplifications (i.e. no real breathe loop, etc). The notion is that the exercise of writing the code in assembler can help to write an efficient Lua version due to fresh mechanical sympathy. Here is how it runs:
So more-or-less the Lua port of the asm code is twice as fast as the original basic1 benchmark and then the actual asm code is twice as fast again. I added "per step" values that take the average number of cycles for each of the five "steps" in processing a packet. The app network looks like this:
and so the five processing "steps" are:
4.6 cycles per step must be really approaching the limit of what is possible, particularly since one of the processing steps is making a copy of the packet (albeit with only 64 bytes of payload). Overall the impression I am forming is that:
|
Nice way to get big, impressive numbers 😄. One thing that does stand out is that there's not a single As comparison, the current SnS can get over 50Mpps when doing a simple Source/Sink network. Of course, that's much less steps than this demo, but it includes a fully working link, a sizeable freelist and copying of a 120-byte packet (yes, the same one over and over, so it's probably in cache all the time). I don't see the 'original' code that's being compared to the |
This is a problem and it could be that we need to be more disciplined in our use of metrics. Generally the term "Mpps" can refer to any code that is operating on packets but we might want to consider reserving this for benchmarks that are actually performing I/O. Then we would need to find a different metric for tests like this one and also the multiprocessor tests that are moving packets between rings in memory rather than physical ethernet links. I also think we should move towards metrics based on cycles rather than seconds for synthetic benchmarks. Most benchmark scores are directly proportional to CPU frequency and so if you tested with (say) both a 2GHz and 4GHz then you should expect the same cycles/packet on both but a factor of 2 difference in Mpps. To me this says that cycles/packet is the more useful metric for talking about the performance of a given bit of code. This is the thinking also that lead me to I actually have a fantasy of making a benchmarking environment that will take any app and generate a "data sheet" that says how it performs under various configurations and workloads and also how dependent it is on cache behavior e.g. test separately with packets that fit in L2 cache, in L3 cache, or in DRAM. Just a notion at this stage though. |
The reference code is running
This is a standard benchmark of the app network with simple Source/Sink/Tee apps. SnabbBot CI is running this benchmark on every PR and will report a test failure if the score regresses. This means that when we optimize the app network we should see an improvement in the snabbmark score and the CI will defend this improvement in the future. Ideally I would like to have performance regression tests in place for all applications and only merge optimizations that can be shown to improve the benchmark scores. This is the optimization equivalent to requiring bug fixes to be submitted with a test case to show that it really works and to ensure that it will keep working in the future. |
Closing PR: I am not actively developing this branch. |
Super early sketch of the code! shared just for fun.
The idea is simple: clone the 'basic1' benchmark in assembler to see what performance is possible. Then we could use this as a benchmark for deciding how optimized our Lua code is. Could be that in writing the assembler code we get some bright ideas for improving the Lua code too.
This is very rough code that implements:
and recycles packets via a freelist.
The full implementation would have to implement the real basic1 topology i.e. include a "Tee" app that duplicates packets.
This is not optimized, etc, so don't draw any sweeping conclusions yet :-).
Note also that I did most of my assembler programming 20+ years ago on the MC68000 so it will take a little practice to find a decent style :).