8000 snabbmark: Add first step towards porting basic1 to asm [sketch] by lukego · Pull Request #603 · snabbco/snabb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

snabbmark: Add first step towards porting basic1 to asm [sketch] #603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

lukego
Copy link
Member
@lukego lukego commented Aug 23, 2015

Super early sketch of the code! shared just for fun.

The idea is simple: clone the 'basic1' benchmark in assembler to see what performance is possible. Then we could use this as a benchmark for deciding how optimized our Lua code is. Could be that in writing the assembler code we get some bright ideas for improving the Lua code too.

This is very rough code that implements:

source -> sink

and recycles packets via a freelist.

The full implementation would have to implement the real basic1 topology i.e. include a "Tee" app that duplicates packets.

This is not optimized, etc, so don't draw any sweeping conclusions yet :-).

Note also that I did most of my assembler programming 20+ years ago on the MC68000 so it will take a little practice to find a decent style :).

Super early sketch of the code! shared just for fun.

The idea is simple: clone the 'basic1' benchmark in assembler to see
what performance is possible. Then we could use this as a benchmark
for deciding how optimized our Lua code is. Could be that in writing
the assembler code we get some bright ideas for improving the Lua code
too.

This is very rough code that implements:

    source -> sink

and recycles packets via a freelist.

The full implementation would have to implement the real basic1
topology i.e. include a "Tee" app that duplicates packets.

This is not optimized, etc, so don't draw any sweeping conclusions yet :-).

Note also that I did most of my assembler programming 20+ years ago on
the MC68000 so it will take a little practice to find a decent style :).
@lukego
Copy link
Member Author
lukego 8000 commented Aug 23, 2015

Currently the performance I see with perf is 11 cycles per packet to run a Source and Sink app and recycle packets between them via freelist.

I have been reading CSAPP3 and so I can see some obviously bad patterns in my coding style like this series of instructions where each depends on the result of the previous one and probably sucks for ILP:

-- Write onto output link                                                                                                             
|  mov rax, output->write
|  add rax, rcx
|  and rax, 255
|  mov [output + rax * 8], packet

CPUs have changed a lot since the last time I wrote assembler code in anger (Amiga 500...) and so for now I am giving myself permission to suck :-).

@lukego
Copy link
Member Author
lukego commented Aug 23, 2015

... I will try to implement the Tee app to make the program equivalent to basic1 before getting carried away with tweaking the instructions.

@lukego lukego changed the title snabbmark: Add first step towards porting basic1 to asm snabbmark: Add first step towards porting basic1 to asm [sketch] Aug 23, 2015
@lukego
Copy link
Member Author
lukego commented Aug 23, 2015

See also The Unofficial Dynasm Documentation that I only discovered recently. Really great! Note that this is based on the standard C-based dynasm while Snabb Switch is now using the extended Lua-based dynasm.

The code is still really rough - and surely buggy - but now it
implements the same topology as the 'snabbmark basic1' with a Tee app
duplicating packets.
@lukego
Copy link
Member Author
lukego commented Aug 23, 2015

The code is still extremely rough but I added a Tee app so that it implements the same topology as snabbmark basic1.

Each packet is allocated by a Source from a freelist and lightly initialized, transmitted to a Tee which makes a full copy and transmits on two output links, on to a Sink that puts the packets back onto the freelist. This is roughly equivalent to the work done by the snabbmark basic1 benchmark. I have cut some corners but hopefully not important ones.

Current performance is around 23 cycles per packet. On Interlaken this is 8000 107 Mpps and that is about 5x faster than snabbmark basic1. I suspect that both the asm code and the Lua code could be optimized substantially.

Note: The reason we have not dived into optimizing these basic operations in the past is that in real applications the performance bottlenecks have been elsewhere i.e. inside the main loops of the apps and not in the machinery that connects the apps together.

So why is this interesting at all? Well: I do think that sooner or later we will be in the situation of wanting to squeeze maximum performance out of the app network. There are already some apps that bypass the app network for performance (packetblaster and firehose) and this may actually turn out to be unnecessary. Likely we will also want to write an app that dispatches a 100G traffic stream across different links in an app network and it would be valuable to understand what the maximum performance we could expect from that would be (and maybe this is an app that we really would want to write in JIT'd assembler).

I also have a basic feeling that our machinery link link.transmit is not sufficiently minimalist but I have not been able to quantify this either by measuring performance or by showing all of the instructions. However, a 5x potential performance gain would make it easier to motivate work like #570 and speeding up the basic1 benchmark.

Here is a cute feature of the code: copying a 64-byte packet using SIMD. This version uses 128-bit registers:

|  movdqu xmm0, [packet];    movdqu [packet2], xmm0
|  movdqu xmm0, [packet+16]; movdqu [packet2+16], xmm0
|  movdqu xmm0, [packet+32]; movdqu [packet2+32], xmm0
|  movdqu xmm0, [packet+48]; movdqu [packet2+48], xmm0

and on AVX2 (Haswell) you would only need half as many instructions (pardon my fudged opcodes):

|  mov.qu ymm0, [packet];    movdqu [packet2], ymm0
|  mov.qu ymm0, [packet+32]; movdqu [packet2+32], ymm0

and on AVX512 (Skylake) you can actually fit a whole 64-byte packet in a register:

|  mov.qu zmm0, [packet];    mov.qu [packet2], zmm0

Funny to think about "packet copies" that will take only 2 instructions and execute in 1 cycle! (Of course not all packets are 64-bytes but small packets are usually where performance hurts.)

@lukego
Copy link
Member Author
lukego commented Aug 23, 2015

Oh and if you want to actually run it then you can do this:

sudo perf stat ./snabb snsh -l program.snabbnfv.snabbmark_asm

and it will print the link statistics and CPU performance counters after running 1 billion packets from the source through the app network:

tee1.read       1000000000ULL   tee1.write      1000000000ULL
tee2.read       1000000000ULL   tee2.write      1000000000ULL
fl  .read       2000000000ULL   fl  .write      2000000255ULL
s2t .read       0ULL    s2t .write      1000000000ULL
done

 Performance counter stats for './snabb snsh -l program.snabbmark.snabbmark_asm':

    64,488,027,290 instructions              #    2.90  insns per cycle
    22,243,300,039 cycles

       9.276217798 seconds time elapsed

Port the asm version of 'snabbmark basic1' back to Lua.

The assembler version was simplified in various ways e.g. it does not
have a real breathe loop and it assumes that links/freelists never
overflow or underflow.

This is a fairly faithful port of that simplified assembler version
back to Lua. The purpose is to see apples-to-apples how fast the Lua
code runs when you 'cheat' in the same ways as the assembler code.

Loosely the results seem to be (on Interlaken):

    original:  27 Mpps
    asm:      140 Mpps
    lua-asm:   62 Mpps
@lukego
Copy link
Member Author
lukego commented Sep 10, 2015

Just having a little more fun here: I ported the assembler code back to Lua with the same simplifications (i.e. no real breathe loop, etc). The notion is that the exercise of writing the code in assembler can help to write an efficient Lua version due to fresh mechanical sympathy.

Here is how it runs:

8000
version cycles/packet cycles/step
asm 23 4.6
asmlua 51 10.2
basic1 116 23

So more-or-less the Lua port of the asm code is twice as fast as the original basic1 benchmark and then the actual asm code is twice as fast again.

I added "per step" values that take the average number of cycles for each of the five "steps" in processing a packet. The app network looks like this:

                  .------.     
    Source ---> Tee      Sink    
                  `------'

and so the five processing "steps" are:

  1. Source app allocates and transmits a packet.
  2. Tee app receives the packet and transmits a duplicate.
  3. Tee app transmits the original.
  4. Sink app receives the duplicate and frees.
  5. Sink app receives the original and frees.

4.6 cycles per step must be really approaching the limit of what is possible, particularly since one of the processing steps is making a copy of the packet (albeit with only 64 bytes of payload).

Overall the impression I am forming is that:

  1. The app network abstraction is fundamentally pretty cheap.
  2. The implementation we have can possibly be tuned to double or quadruple speed.
  3. We should remember this if/when we have an app that is limited by basic app network operations e.g. an app that splits up a 100Gbps traffic stream and has to perform link.transmit() at a ferocious rate.

@javierguerragiraldez
Copy link
Contributor

Nice way to get big, impressive numbers 😄.

One thing that does stand out is that there's not a single if..then in the whole Lua file! Of course, as the comment says, this code assumes that links never overflow nor underflow.

As comparison, the current SnS can get over 50Mpps when doing a simple Source/Sink network. Of course, that's much less steps than this demo, but it includes a fully working link, a sizeable freelist and copying of a 120-byte packet (yes, the same one over and over, so it's probably in cache all the time).

I don't see the 'original' code that's being compared to the asm and lua-asm, but I guess it does two packet clones (one at the source and one at the tee), three link transmissions (source->tee and 2*tee->sink) and two allocations/frees (the original and the copy) for each source->tee packet. That is, slightly more than twice the steps, which correlates neatly with the 27Mpps you mention.

@lukego
Copy link
Member Author
lukego commented Sep 11, 2015

Nice way to get big, impressive numbers

This is a problem and it could be that we need to be more disciplined in our use of metrics.

Generally the term "Mpps" can refer to any code that is operating on packets but we might want to consider reserving this for benchmarks that are actually performing I/O. Then we would need to find a different metric for tests like this one and also the multiprocessor tests that are moving packets between rings in memory rather than physical ethernet links.

I also think we should move towards metrics based on cycles rather than seconds for synthetic benchmarks. Most benchmark scores are directly proportional to CPU frequency and so if you tested with (say) both a 2GHz and 4GHz then you should expect the same cycles/packet on both but a factor of 2 difference in Mpps. To me this says that cycles/packet is the more useful metric for talking about the performance of a given bit of code.

This is the thinking also that lead me to cycles/step in the description above. That is supposed to tell the average cost of processing a packet per app and be somewhat independent of how complex the app network is.

I actually have a fantasy of making a benchmarking environment that will take any app and generate a "data sheet" that says how it performs under various configurations and workloads and also how dependent it is on cache behavior e.g. test separately with packets that fit in L2 cache, in L3 cache, or in DRAM. Just a notion at this stage though.

@lukego
Copy link
Member Author
lukego commented Sep 11, 2015

I don't see the 'original' code that's being compared

The reference code is running snabb snabbmark 1e9:

$ sudo ./snabb snabbmark basic1 1e9

Processed 1000.1 million packets in 36.75 seconds (rate: 27.2 Mpps).

This is a standard benchmark of the app network with simple Source/Sink/Tee apps. SnabbBot CI is running this benchmark on every PR and will report a test failure if the score regresses. This means that when we optimize the app network we should see an improvement in the snabbmark score and the CI will defend this improvement in the future.

Ideally I would like to have performance regression tests in place for all applications and only merge optimizations that can be shown to improve the benchmark scores. This is the optimization equivalent to requiring bug fixes to be submitted with a test case to show that it really works and to ensure that it will keep working in the future.

@lukego
Copy link
Member Author
lukego commented Feb 7, 2016

Closing PR: I am not actively developing this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0