Experimental support for pico HTTP parser #71

kellabyte · 2015-12-02T20:00:10Z

@nmdguerreiro I would love your code review feedback on this because there's some flaws in my implementation.

pico http parser is a SSE accelerated HTTP parser. This can gain us quite a lot of performance in Haywire I believe. It's not a streaming parser like http_parser that Haywire already uses but I've made it so you can configure which one Haywire uses. There's a cloudflare fork of this parser that is AVX optimized as well. I don't want to require AVX or SSE but since performance is Haywire's main goal, I want to get these integrated and see how fast we can go!

Keep in mind there's some hard coded values and not tidy stuff in this branch so far because I'm experimenting. There's a major bug in my implementation though which is killing performance that I would love a code review on to get it fixed so we can see the potential of pico in Haywire.

To run pico run the following ./build/hello_world --parser pico

The big problem with my pico implementation is it's losing requests over time. The longer the benchmark the slower the requests/second throughput rate is. If you run dstat and watch network throughput you'll see over time it slows down until almost no responses get sent. I suspect that there's a loop/buffer management issue that is causing requests to stall and never get processed.

If you do a really short benchmark you can see some amazing potential.

HTTP_PARSER

./build/hello_world --threads 20 --parser http_parser --batch 64

../bin/wrk/wrk --script ./pipelined_get.lua --latency -d 10s -t 20 -c 256 http://104.130.19.205:8000 -- 64
Running 10s test @ http://104.130.19.205:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.77ms   12.04ms  63.20ms   75.96%
    Req/Sec   147.27k    49.42k  348.40k    75.32%
  Latency Distribution
     50%    6.78ms
     75%   23.08ms
     90%   34.49ms
     99%    0.00us
  29456134 requests in 10.10s, 4.33GB read
Requests/sec: 2,916,682.82
Transfer/sec:    439.49MB

PICO

2 second run shows some great potential.

./build/hello_world --threads 20 --parser pico --batch 64

./bin/wrk/wrk --script ./pipelined_get.lua --latency -d 2s -t 20 -c 256 http://104.130.19.205:8000 -- 64
Running 2s test @ http://104.130.19.205:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   661.65us  488.02us   9.51ms   71.29%
    Req/Sec   386.06k    88.75k  486.79k    41.35%
  Latency Distribution
     50%  740.00us
     75%    1.27ms
     90%    0.00us
     99%    0.00us
  15980992 requests in 2.10s, 2.35GB read
Requests/sec: 7,610,895.00
Transfer/sec:      1.12GB

2 minute run shows how performance tanked over time.

./build/hello_world --threads 20 --parser pico --batch 64

../bin/wrk/wrk --script ./pipelined_get.lua --latency -d 2m -t 20 -c 256 http://104.130.19.205:8000 -- 64
Running 2m test @ http://104.130.19.205:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   401.47us  341.74us  16.19ms   88.10%
    Req/Sec   233.44k    71.54k  561.32k    73.52%
  Latency Distribution
     50%  262.00us
     75%  448.00us
     90%  812.00us
     99%    1.77ms
  141840704 requests in 2.00m, 20.87GB read
Requests/sec: 1,181,341.04
Transfer/sec:    178.01MB

dstat output during the pico run

dstat -n --output output.csv

-net/total-
 recv  send
5870B  274B
 120M  415M
 332M 1145M
 254M  877M
 210M  724M
 180M  622M
 150M  519M
 142M  489M
 127M  438M
 100M  343M
  90M  312M
  80M  277M
  77M  265M
  77M  266M
  77M  266M
  77M  266M
  76M  262M
  76M  263M
  79M  272M
  77M  265M
  71M  244M
  70M  242M
  70M  242M
  64M  221M
  60M  207M
  55M  191M
  52M  180M
  52M  181M
  52M  180M
  52M  180M
  52M  180M
-net/total-
 recv  send
  52M  180M
  52M  180M
  52M  180M
  52M  180M
  52M  180M
  52M  180M
  51M  175M
  52M  178M
  52M  180M
  52M  179M
  52M  180M
  52M  179M
  52M  180M
  52M  180M
  52M  180M
  52M  179M
  50M  174M
  50M  172M
  50M  172M
  50M  172M
  50M  172M
  50M  172M
  50M  172M
  50M  172M
  50M  173M
  50M  171M
  51M  175M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  47M  161M
  47M  164M
  48M  165M
  48M  166M
-net/total-
 recv  send
  48M  167M
  48M  166M
  48M  166M
  48M  166M
  48M  167M
  48M  167M
  48M  165M
  47M  164M
  47M  164M
  48M  165M
  48M  164M
  49M  168M
  48M  165M
  47M  163M
  48M  165M
  40M  139M
  30M  103M
  30M  103M
  30M  103M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  29M  102M
  29M  102M
  29M  102M
  29M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  103M
  29M  101M
  18M   61M
-net/total-
 recv  send
8112k   27M
8124k   27M
8084k   27M
9200k   31M
9855k   33M
9654k   33M
3973k   13M

…ver.

kellabyte · 2015-12-02T20:18:23Z

The problem may also exist in http_server.c:376. I'm not sure yet.

nmdguerreiro · 2015-12-03T14:50:04Z

EDIT: The memory leak actually happens when you use --parser http_parser. Having read through the code a bit, it looks to me like http_request_on_url doesn't get called when pico parser is used, which explains why that chunk of memory doesn't need to be freed with pico parser.

~~I wonder if you're experiencing a lot of swapping towards the end of the test, due to a possible memory leak.~~

~~Could you re-run the test and keep an eye on the RSS used by the program? Without --batch, I see that the RSS is about ~10MB, and it stays pretty constant throughout the test.~~

~~If I set --batch 64, I see memory usage growing continuously throughout the test.~~

~~I ran the benchmark for 2 minutes under Valgrind, without the pico parser, but with --batch 64.~~

~~Here's the most significant chunk of memory unfreed:~~

==6042== 956,531,520 bytes in 238 blocks are possibly lost in loss record 64 of 64
==6042==    at 0x4C29BFD: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==6042==    by 0x40AC3D: create_http_connection (http_server.c:116)
==6042==    by 0x40CEDF: connection_consumer_new_connection (connection_consumer.c:59)
==6042==    by 0x417ECF: uv__server_io (stream.c:551)
==6042==    by 0x41BA8C: uv__io_poll (linux-core.c:345)
==6042==    by 0x412B87: uv_run (core.c:351)
==6042==    by 0x40D062: connection_consumer_start (connection_consumer.c:114)
==6042==    by 0x4191C6: uv__thread_start (thread.c:49)
==6042==    by 0x4E3CDF4: start_thread (pthread_create.c:308)
==6042==    by 0x534F1AC: clone (clone.S:113)

nmdguerreiro · 2015-12-03T17:35:39Z

src/haywire/http_request.c

@@ -84,7 +85,7 @@ void free_http_request(http_request* request)
        free((char*)v);
    });
    kh_destroy(string_hashmap, request->headers);
-    free(request->url);
+    //free(request->url);


I think this is what's causing the memory leak I'm seeing. Uncommenting this line works while using http_parser (it creates a seg fault though, when using pico, as to be expected).

nmdguerreiro · 2015-12-03T17:55:36Z

Using pico, I can't actually replicate the slowdown:

2 minute run

[root@localhost Haywire]# ../wrk/wrk --script ./benchmark/pipelined_get.lua --latency -d 2m -t 20 -c 256 http://localhost:8000 -- 64
Running 2m test @ http://localhost:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    39.93ms   40.29ms 319.71ms   87.81%
    Req/Sec    18.56k     6.82k   69.12k    76.83%
  Latency Distribution
     50%   28.01ms
     75%   50.06ms
     90%   90.60ms
     99%  196.02ms
  43167552 requests in 2.00m, 6.35GB read
Requests/sec: 359429.62
Transfer/sec:     54.16MB

2 second run

[root@localhost Haywire]# ../wrk/wrk --script ./benchmark/pipelined_get.lua --latency -d 2s -t 20 -c 256 http://localhost:8000 -- 64
Running 2s test @ http://localhost:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    43.86ms   41.78ms 272.30ms   83.20%
    Req/Sec    18.69k    10.19k   84.48k    67.64%
  Latency Distribution
     50%   31.05ms
     75%   59.69ms
     90%   99.43ms
     99%  194.97ms
  732800 requests in 2.10s, 110.42MB read
Requests/sec: 349596.47
Transfer/sec:     52.68MB

I'm running this on a VM with just one core dedicated to it, so I'm not sure if that influences the results. I'll try on a server machine with multiple cores.

nmdguerreiro · 2015-12-03T18:12:13Z

And getting similar results on EC2 as well:

2 second run

[ec2-user@ip-172-31-16-12 Haywire]$ ~/wrk/wrk --script ./benchmark/pipelined_get.lua --latency -d 2s -t 20 -c 256 http://localhost:8000 -- 64
Running 2s test @ http://localhost:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.27ms    4.13ms  34.51ms   80.61%
    Req/Sec   163.11k    64.34k  339.39k    66.25%
  Latency Distribution
     50%    1.86ms
     75%    5.11ms
     90%   14.09ms
     99%    0.00us
  6629248 requests in 2.07s, 0.98GB read
Requests/sec: 3196717.86
Transfer/sec:    481.68MB

2 minute run

[ec2-user@ip-172-31-16-12 Haywire]$ ~/wrk/wrk --script ./benchmark/pipelined_get.lua --latency -d 2m -t 20 -c 256 http://localhost:8000 -- 64
Running 2m test @ http://localhost:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.22ms    4.20ms  56.36ms   83.17%
    Req/Sec   162.47k    57.43k  438.30k    65.80%
  Latency Distribution
     50%    1.66ms
     75%    4.26ms
     90%   12.41ms
     99%    0.00us
  387537792 requests in 2.00m, 57.02GB read
Requests/sec: 3226901.50
Transfer/sec:    486.22MB

kellabyte · 2015-12-03T18:21:36Z

So if i use a batch of -- 64 I'm seeing the problem, if I use 32 I don't and it goes MUCH faster. I wonder if there's a buffer size issue somewhere and it's DoS'ing the pipeline?

../bin/wrk/wrk --script ./pipelined_get.lua --latency -d 5m -t 40 -c 760 http://server:8000 -- 32

Running 5m test @ http://server:8000
  40 threads and 760 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.04ms    5.38ms 389.00ms   92.75%
    Req/Sec   233.38k    48.72k  458.99k    86.19%
  Latency Distribution
     50%    1.26ms
     75%    1.96ms
     90%    4.09ms
     99%    0.00us
  2781077938 requests in 5.00m, 409.23GB read
Requests/sec: 9,267,161.41
Transfer/sec:      1.36GB

kellabyte · 2015-12-07T15:10:54Z

Yeah, that's the problem, is if you have a single request coming in, it's response time is bound to that timer. Another option is to push responses into a queue and have a thread pulling off the queue. We get into some complicated things here though with spin locks and contention etc.

tobz · 2015-12-08T02:15:11Z

Batching is an inherent trade-off of throughput over latency, right? Working from that, could it make sense to define a throughput mode vs latency mode?

I would it think roughly as..

Thoughput mode:

background timer that adjusts the batch size
metrics that tracks request rate / bandwidth / whatever is important
timer tracks previous batch size vs current batch size
-- if current batch size improves metric of interest over previous, increase batch size
-- otherwise, decrease batch size

Batch mode might need some sort of simple math or hard cap on the batch flush time to prevent run away latency.

Latency mode:

background timer that flushes at a configurable interval

Latency mode still batches, but there's an upper bound on how long it waits.

Dunno, just spitballing.

kellabyte · 2015-12-08T02:44:55Z

It's a throughput vs latency tradeoff yes but how bad of a tradeoff depends on your implementation. Consider a thread spin waiting on the queue. As soon as an entry is pushed it'll pick it up VERY quickly. But this isn't going to be easy to build. I'm wondering if there is an alternative.

nmdguerreiro · 2015-12-08T12:34:36Z

src/haywire/http_server.c

@@ -111,6 +129,7 @@ void free_http_connection(http_connection* connection)
        free_http_request(connection->request);
    }

+    free(connection->response_buffers);


@kellabyte unless I'm missing something, response buffers are allocated statically and not through malloc, so do they really need to be freed here?

Actually the buffer array doesn't need to be freed, but the base elements do.

kellabyte · 2015-12-08T17:41:18Z

I actually think our keep-alive logic is wrong. We probably need to decide if we are going to be a HTTP 1.1 server (I think that is best) since 1.1 assumes keep-alive by default I believe.

I don't think me setting keep_alive = 1 in the pico parser code makes sense. I probably threw that in there during prototyping.

nmdguerreiro · 2015-12-09T09:35:53Z

Yeah, I think HTTP/1.1 is the way to go, considering it's been around for around 16 years now and that there are already quite a few HTTP/2.0 implementations. Would you want to have an issue for tracking HTTP/1.1 compliance? On the persistent connections topic, we probably need to support the Keep-Alive header too .

…control.

kellabyte · 2015-12-10T18:26:57Z

I've committed my changes so far but when I run a benchmark I'm still getting:

hello_world: src/unix/stream.c:1499: uv_read_start: Assertion `((stream)->io_watcher.fd) >= 0' failed.

Any thoughts @nmdguerreiro?

nmdguerreiro · 2015-12-10T18:51:15Z

src/haywire/http_server.c

@@ -261,11 +286,93 @@ void http_stream_on_read_http_parser(uv_stream_t* tcp, ssize_t nread, const uv_b
            /* uv_close((uv_handle_t*) &client->handle, http_stream_on_close); */
        }
    }
+    else if (nread == UV_EOF)


I'm a bit confused here. We're closing the connection in both else branches.

nmdguerreiro · 2015-12-10T18:57:55Z

@kellabyte my only guess is that nread < 0 but there is some other error code being returned in that situation. We could close on nread < 0, but it would be interesting to know what's happening (if my guess is correct).

kellabyte · 2015-12-10T22:03:13Z

So I printf("%d ", nread); and saw the following. Lots of -4095 followed by a couple 104.

4095 -4095 -4095 -4095 -4095 -4095 -4095 -4095 -104 -4095 -4095 -4095 -104 -4095 -4095 -4095 -4095 -4095 -4095 -4095 -4095 -4095 -4095
hello_world: src/unix/stream.c:1499: uv_read_start: Assertion `((stream)->io_watcher.fd) >= 0' failed.
hello_world: src/unix/stream.c:1499: uv_read_start: Assertion `((stream)->io_watcher.fd) >= 0' failed.

I've narrowed it down that https://github.com/kellabyte/Haywire/blob/picohttpparser/src/haywire/connection_consumer.c#L67 sometimes returns an rc=-11 instead of 0. Why does this happen when pico is enabled but not when http_parser is enabled? This is very strange. What change of mine caused this?

If I only start reading if rc==0 the fd assertions stop happening but there's 10 million bad http responses. Something weird going on.

nmdguerreiro · 2015-12-11T09:40:34Z

I can't find that -104 return code in uv-errno.h, but errno.h defines ECONNRESET as 104, so maybe that's what's happening, after a few attempted reads from a connection that has been closed by the remote end (which I suppose in this case is actually the connection dispatcher, no?).

Should we just close on rc < 0?

kellabyte · 2015-12-11T15:07:52Z

Hmm perhaps. The weird thing is this only happens with pico not http_parser it seems. I may have surfaced 2 issues.

We don't handle rc < 0 properly
Somehow we are causing rc < 0 with pico parser (potentially).

kellabyte · 2015-12-11T15:42:54Z

I changed the url variable to be a hw_string and I think I've fucked something up somewhere. Notice all the non-200 errors.

Running 10s test @ http://server:8000
  40 threads and 760 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.38ms    8.69ms 225.45ms   88.63%
    Req/Sec   152.44k    56.00k  387.70k    74.30%
  Latency Distribution
     50%    2.85ms
     75%    8.00ms
     90%   16.22ms
     99%   40.87ms
  9838320 requests in 10.08s, 1.44GB read
  Socket errors: connect 0, read 27299, write 0, timeout 0
  Non-2xx or 3xx responses: 9517609
Requests/sec: 976424.55
Transfer/sec:    146.04MB

nmdguerreiro · 2015-12-14T14:43:06Z

I can't replicate the non-2xx codes you got above 😕

I just ran the benchmark repeatedly without an issue (other than a couple of timeouts):

[ec2-user@ip-172-31-31-0 ~]$ ~/wrk/wrk --script pipelined_get.lua --latency -d 30s -t 200 -c 512 http://172.31.16.12:8000 -- 64
Running 30s test @ http://172.31.16.12:8000
  200 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    78.50ms  157.88ms   1.91s    86.36%
    Req/Sec     8.51k     4.13k   72.32k    87.69%
  Latency Distribution
     50%    8.30ms
     75%   65.23ms
     90%  306.69ms
     99%  885.95ms
  48717856 requests in 30.10s, 7.17GB read
  Socket errors: connect 0, read 0, write 0, timeout 13
Requests/sec: 1618658.94
Transfer/sec:    243.90MB

kellabyte · 2015-12-14T19:13:22Z

Are you running on a Mac or Linux? I seem to find the problem on Linux.

I actually think the issue is here now that I've switched the url from using char* to hw_string https://github.com/kellabyte/Haywire/blob/picohttpparser/src/haywire/route_compare_method.c

nmdguerreiro · 2015-12-14T21:37:08Z

I tried both and those results are from Amazon Linux.
On Mon, 14 Dec 2015 at 19:13, Kelly Sommers notifications@github.com
wrote:

Are you running on a Mac or Linux? I seem to find the problem on Linux.

—
Reply to this email directly or view it on GitHub
#71 (comment).

nmdguerreiro · 2015-12-16T16:41:34Z

@kellabyte still having issues?

We do static matching currently so might as well simplify this for now until `pathfinder` is implemented.

kellabyte added 24 commits August 27, 2015 14:57

Added experimental response batching support to configuration and ser…

c69962d

…ver.

Added experimental picohttpparser support.

3b96137

Fixing Travis CI build.

58aa145

Fixed OSX build.

a067f24

Fixed Linux build.

94be721

Removed duplicate libuv build.

c06250a

Update .travis.yml

9990731

Update .travis.yml

89e7606

Update .travis.yml

3e4d4fc

Update .travis.yml

4a49789

Update .travis.yml

f0487c5

Update .travis.yml

2ca6bab

Update .travis.yml

480571b

Update .travis.yml

9e38f87

Update program.c

71ce6cd

Update .travis.yml

4e2ecdc

Update program.c

3e66639

Update .travis.yml

960c2c8

Create http.h

639c7eb

Update http_connection.h

a07b3ff

Update http_request.c

303df7e

Update http_request.h

4d08976

Update http_server.c

406381d

Update http_server.c

e399d22

nmdguerreiro reviewed Dec 3, 2015
View reviewed changes

kellabyte added 5 commits December 7, 2015 14:53

Webhook test.

591bf33

Webhook test.

a6f7e27

Testing webhook.

8d0db48

Test webhook.

19d56ef

Testing webhook.

895796a

nmdguerreiro reviewed Dec 8, 2015
View reviewed changes

kellabyte added 3 commits December 10, 2015 12:44

Polishing the pico parsing logic.

100114c

Polishing pico parsing implementation.

9b09d63

Changing connection structure to store url as a hw_string for better …

db2bc70

…control.

nmdguerreiro reviewed Dec 10, 2015
View reviewed changes

kellabyte added 4 commits January 3, 2016 20:34

Simplifying route matching.

6114616

We do static matching currently so might as well simplify this for now until `pathfinder` is implemented.

Simplify route matching

e199c34

We do static matching currently so might as well simplify this for now until `pathfinder` is implemented.

Simplify route matching

d6554db

We do static matching currently so might as well simplify this for now until `pathfinder` is implemented.

Adding TCP NODELAY support to experimental branch.

1b13f4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimental support for pico HTTP parser #71

Experimental support for pico HTTP parser #71

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Experimental support for pico HTTP parser #71

Are you sure you want to change the base?

Experimental support for pico HTTP parser #71

Uh oh!

Conversation

HTTP_PARSER

PICO

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2 minute run

2 second run

Uh oh!

2 second run

2 minute run

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!