8000 Experimental support for pico HTTP parser by kellabyte · Pull Request #71 · haywire/haywire · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Experimental support for pico HTTP parser #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: master
Choose a base branch
from
Open

Conversation

kellabyte
Copy link
Collaborator

@nmdguerreiro I would love your code review feedback on this because there's some flaws in my implementation.

pico http parser is a SSE accelerated HTTP parser. This can gain us quite a lot of performance in Haywire I believe. It's not a streaming parser like http_parser that Haywire already uses but I've made it so you can configure which one Haywire uses. There's a cloudflare fork of this parser that is AVX optimized as well. I don't want to require AVX or SSE but since performance is Haywire's main goal, I want to get these integrated and see how fast we can go!

Keep in mind there's some hard coded values and not tidy stuff in this branch so far because I'm experimenting. There's a major bug in my implementation though which is killing performance that I would love a code review on to get it fixed so we can see the potential of pico in Haywire.

To run pico run the following ./build/hello_world --parser pico

The big problem with my pico implementation is it's losing requests over time. The longer the benchmark the slower the requests/second throughput rate is. If you run dstat and watch network throughput you'll see over time it slows down until almost no responses get sent. I suspect that there's a loop/buffer management issue that is causing requests to stall and never get processed.

If you do a really short benchmark you can see some amazing potential.

HTTP_PARSER

./build/hello_world --threads 20 --parser http_parser --batch 64

../bin/wrk/wrk --script ./pipelined_get.lua --latency -d 10s -t 20 -c 256 http://104.130.19.205:8000 -- 64
Running 10s test @ http://104.130.19.205:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.77ms   12.04ms  63.20ms   75.96%
    Req/Sec   147.27k    49.42k  348.40k    75.32%
  Latency Distribution
     50%    6.78ms
     75%   23.08ms
     90%   34.49ms
     99%    0.00us
  29456134 requests in 10.10s, 4.33GB read
Requests/sec: 2,916,682.82
Transfer/sec:    439.49MB

PICO

2 second run shows some great potential.

./build/hello_world --threads 20 --parser pico --batch 64

./bin/wrk/wrk --script ./pipelined_get.lua --latency -d 2s -t 20 -c 256 http://104.130.19.205:8000 -- 64
Running 2s test @ http://104.130.19.205:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   661.65us  488.02us   9.51ms   71.29%
    Req/Sec   386.06k    88.75k  486.79k    41.35%
  Latency Distribution
     50%  740.00us
     75%    1.27ms
     90%    0.00us
     99%    0.00us
  15980992 requests in 2.10s, 2.35GB read
Requests/sec: 7,610,895.00
Transfer/sec:      1.12GB

2 minute run shows how performance tanked over time.

./build/hello_world --threads 20 --parser pico --batch 64

../bin/wrk/wrk --script ./pipelined_get.lua --latency -d 2m -t 20 -c 256 http://104.130.19.205:8000 -- 64
Running 2m test @ http://104.130.19.205:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   401.47us  341.74us  16.19ms   88.10%
    Req/Sec   233.44k    71.54k  561.32k    73.52%
  Latency Distribution
     50%  262.00us
     75%  448.00us
     90%  812.00us
     99%    1.77ms
  141840704 requests in 2.00m, 20.87GB read
Requests/sec: 1,181,341.04
Transfer/sec:    178.01MB

dstat output during the pico run

dstat -n --output output.csv

-net/total-
 recv  send
5870B  274B
 120M  415M
 332M 1145M
 254M  877M
 210M  724M
 180M  622M
 150M  519M
 142M  489M
 127M  438M
 100M  343M
  90M  312M
  80M  277M
  77M  265M
  77M  266M
  77M  266M
  77M  266M
  76M  262M
  76M  263M
  79M  272M
  77M  265M
  71M  244M
  70M  242M
  70M  242M
  64M  221M
  60M  207M
  55M  191M
  52M  180M
  52M  181M
  52M  180M
  52M  180M
  52M  180M
-net/total-
 recv  send
  52M  180M
  52M  180M
  52M  180M
  52M  180M
  52M  180M
  52M  180M
  51M  175M
  52M  178M
  52M  180M
  52M  179M
  52M  180M
  52M  179M
  52M  180M
  52M  180M
  52M  180M
  52M  179M
  50M  174M
  50M  172M
  50M  172M
  50M  172M
  50M  172M
  50M  172M
  50M  172M
  50M  172M
  50M  173M
  50M  171M
  51M  175M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  51M  176M
  47M  161M
  47M  164M
  48M  165M
  48M  166M
-net/total-
 recv  send
  48M  167M
  48M  166M
  48M  166M
  48M  166M
  48M  167M
  48M  167M
  48M  165M
  47M  164M
  47M  164M
  48M  165M
  48M  164M
  49M  168M
  48M  165M
  47M  163M
  48M  165M
  40M  139M
  30M  103M
  30M  103M
  30M  103M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  29M  102M
  29M  102M
  29M  102M
  29M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  102M
  30M  103M
  29M  101M
  18M   61M
-net/total-
 recv  send
8112k   27M
8124k   27M
8084k   27M
9200k   31M
9855k   33M
9654k   33M
3973k   13M

@kellabyte
Copy link
Collaborator Author

The problem may also exist in http_server.c:376. I'm not sure yet.

@nmdguerreiro
Copy link
Contributor

EDIT: The memory leak actually happens when you use --parser http_parser. Having read through the code a bit, it looks to me like http_request_on_url doesn't get called when pico parser is used, which explains why that chunk of memory doesn't need to be freed with pico parser.

I wonder if you're experiencing a lot of swapping towards the end of the test, due to a possible memory leak.

~~Could you re-run the test and keep an eye on the RSS used by the program? Without --batch, I see that the RSS is about ~10MB, and it stays pretty constant throughout the test.~~

If I set --batch 64, I see memory usage growing continuously throughout the test.

I ran the benchmark for 2 minutes under Valgrind, without the pico parser, but with --batch 64.

Here's the most significant chunk of memory unfreed:

==6042== 956,531,520 bytes in 238 blocks are possibly lost in loss record 64 of 64
==6042==    at 0x4C29BFD: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==6042==    by 0x40AC3D: create_http_connection (http_server.c:116)
==6042==    by 0x40CEDF: connection_consumer_new_connection (connection_consumer.c:59)
==6042==    by 0x417ECF: uv__server_io (stream.c:551)
==6042==    by 0x41BA8C: uv__io_poll (linux-core.c:345)
==6042==    by 0x412B87: uv_run (core.c:351)
==6042==    by 0x40D062: connection_consumer_start (connection_consumer.c:114)
==6042==    by 0x4191C6: uv__thread_start (thread.c:49)
==6042==    by 0x4E3CDF4: start_thread (pthread_create.c:308)
==6042==    by 0x534F1AC: clone (clone.S:113)

@@ -84,7 +85,7 @@ void free_http_request(http_request* request)
free((char*)v);
});
kh_destroy(string_hashmap, request->headers);
free(request->url);
//free(request->url);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what's causing the memory leak I'm seeing. Uncommenting this line works while using http_parser (it creates a seg fault though, when using pico, as to be expected).

@nmdguerreiro
Copy link
Contributor

Using pico, I can't actually replicate the slowdown:

2 minute run

[root@localhost Haywire]# ../wrk/wrk --script ./benchmark/pipelined_get.lua --latency -d 2m -t 20 -c 256 http://localhost:8000 -- 64
Running 2m test @ http://localhost:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    39.93ms   40.29ms 319.71ms   87.81%
    Req/Sec    18.56k     6.82k   69.12k    76.83%
  Latency Distribution
     50%   28.01ms
     75%   50.06ms
     90%   90.60ms
     99%  196.02ms
  43167552 requests in 2.00m, 6.35GB read
Requests/sec: 359429.62
Transfer/sec:     54.16MB

2 second run

[root@localhost Haywire]# ../wrk/wrk --script ./benchmark/pipelined_get.lua --latency -d 2s -t 20 -c 256 http://localhost:8000 -- 64
Running 2s test @ http://localhost:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    43.86ms   41.78ms 272.30ms   83.20%
    Req/Sec    18.69k    10.19k   84.48k    67.64%
  Latency Distribution
     50%   31.05ms
     75%   59.69ms
     90%   99.43ms
     99%  194.97ms
  732800 requests in 2.10s, 110.42MB read
Requests/sec: 349596.47
Transfer/sec:     52.68MB

I'm running this on a VM with just one core dedicated to it, so I'm not sure if that influences the results. I'll try on a server machine with multiple cores.

@nmdguerreiro
Copy link
Contributor

And getting similar results on EC2 as well:

2 second run

[ec2-user@ip-172-31-16-12 Haywire]$ ~/wrk/wrk --script ./benchmark/pipelined_get.lua --latency -d 2s -t 20 -c 256 http://localhost:8000 -- 64
Running 2s test @ http://localhost:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.27ms    4.13ms  34.51ms   80.61%
    Req/Sec   163.11k    64.34k  339.39k    66.25%
  Latency Distribution
     50%    1.86ms
     75%    5.11ms
     90%   14.09ms
     99%    0.00us
  6629248 requests in 2.07s, 0.98GB read
Requests/sec: 3196717.86
Transfer/sec:    481.68MB

2 minute run

[ec2-user@ip-172-31-16-12 Haywire]$ ~/wrk/wrk --script ./benchmark/pipelined_get.lua --latency -d 2m -t 20 -c 256 http://localhost:8000 -- 64
Running 2m test @ http://localhost:8000
  20 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.22ms    4.20ms  56.36ms   83.17%
    Req/Sec   162.47k    57.43k  438.30k    65.80%
  Latency Distribution
     50%    1.66ms
     75%    4.26ms
     90%   12.41ms
     99%    0.00us
  387537792 requests in 2.00m, 57.02GB read
Requests/sec: 3226901.50
Transfer/sec:    486.22MB

@kellabyte
Copy link
Collaborator Author

So if i use a batch of -- 64 I'm seeing the problem, if I use 32 I don't and it goes MUCH faster. I wonder if there's a buffer size issue somewhere and it's DoS'ing the pipeline?

../bin/wrk/wrk --script ./pipelined_get.lua --latency -d 5m -t 40 -c 760 http://server:8000 -- 32

Running 5m test @ http://server:8000
  40 threads and 760 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.04ms    5.38ms 389.00ms   92.75%
    Req/Sec   233.38k    48.72k  458.99k    86.19%
  Latency Distribution
     50%    1.26ms
     75%    1.96ms
     90%    4.09ms
     99%    0.00us
  2781077938 requests in 5.00m, 409.23GB read
Requests/sec: 9,267,161.41
Transfer/sec:      1.36GB

@kellabyte
Copy link
Collaborator Author

Yeah, that's the problem, is if you have a single request coming in, it's response time is bound to that timer. Another option is to push responses into a queue and have a thread pulling off the queue. We get into some complicated things here though with spin locks and contention etc.

@tobz
Copy link
Contributor
tobz commented Dec 8, 2015

Batching is an inherent trade-off of throughput over latency, right? Working from that, could it make sense to define a throughput mode vs latency mode?

I would it think roughly as..

Thoughput mode:

  • background timer that adjusts the batch size
  • metrics that tracks request rate / bandwidth / whatever is important
  • timer tracks previous batch size vs current batch size
    -- if current batch size improves metric of interest over previous, increase batch size
    -- otherwise, decrease batch size

Batch mode might need some sort of simple math or hard cap on the batch flush time to prevent run away latency.

Latency mode:

  • background timer that flushes at a configurable interval

Latency mode still batches, but there's an upper bound on how long it waits.

Dunno, just spitballing.

@kellabyte
Copy link
Collaborator Author

It's a throughput vs latency tradeoff yes but how bad of a tradeoff depends on your implementation. Consider a thread spin waiting on the queue. As soon as an entry is pushed it'll pick it up VERY quickly. But this isn't going to be easy to build. I'm wondering if there is an alternative.

@@ -111,6 +129,7 @@ void free_http_connection(http_connection* connection)
free_http_request(connection->request);
}

free(connection->response_buffers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kellabyte unless I'm missing something, response buffers are allocated statically and not through malloc, so do they really need to be freed here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the buffer array doesn't need to be freed, but the base elements do.

@kellabyte
Copy link
Collaborator Author

I actually think our keep-alive logic is wrong. We probably need to decide if we are going to be a HTTP 1.1 server (I think that is best) since 1.1 assumes keep-alive by default I believe.

I don't think me setting keep_alive = 1 in the pico parser code makes sense. I probably threw that in there during prototyping.

@nmdguerreiro
Copy link
Contributor

Yeah, I think HTTP/1.1 is the way to go, considering it's been around for around 16 years now and that there are already quite a few HTTP/2.0 implementations. Would you want to have an issue for tracking HTTP/1.1 compliance? On the persistent connections topic, we probably need to support the Keep-Alive header too .

@kellabyte
Copy link
Collaborator Author

I've committed my changes so far but when I run a benchmark I'm still getting:

hello_world: src/unix/stream.c:1499: uv_read_start: Assertion `((stream)->io_watcher.fd) >= 0' failed.

Any thoughts @nmdguerreiro?

@@ -261,11 +286,93 @@ void http_stream_on_read_http_parser(uv_stream_t* tcp, ssize_t nread, const uv_b
/* uv_close((uv_handle_t*) &client->handle, http_stream_on_close); */
}
}
else if (nread == UV_EOF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here. We're closing the connection in both else branches.

@nmdguerreiro
Copy link
Contributor

@kellabyte my only guess is that nread < 0 but there is some other error code being returned in that situation. We could close on nread < 0, but it would be interesting to know what's happening (if my guess is correct).

@kellabyte
Copy link
Collaborator Author

So I printf("%d ", nread); and saw the following. Lots of -4095 followed by a couple 104.

4095 -4095 -4095 -4095 -4095 -4095 -4095 -4095 -104 -4095 -4095 -4095 -104 -4095 -4095 -4095 -4095 -4095 -4095 -4095 -4095 -4095 -4095
hello_world: src/unix/stream.c:1499: uv_read_start: Assertion `((stream)->io_watcher.fd) >= 0' failed.
hello_world: src/unix/stream.c:1499: uv_read_start: Assertion `((stream)->io_watcher.fd) >= 0' failed.

I've narrowed it down that https://github.com/kellabyte/Haywire/blob/picohttpparser/src/haywire/connection_consumer.c#L67 sometimes returns an rc=-11 instead of 0. Why does this happen when pico is enabled but not when http_parser is enabled? This is very strange. What change of mine caused this?

If I only start reading if rc==0 the fd assertions stop happening but there's 10 million bad http responses. Something weird going on.

@nmdguerreiro
Copy link
Contributor

I can't find that -104 return code in uv-errno.h, but errno.h defines ECONNRESET as 104, so maybe that's what's happening, after a few attempted reads from a connection that has been closed by the remote end (which I suppose in this case is actually the connection dispatcher, no?).

Should we just close on rc < 0?

@kellabyte
Copy link
Collaborator Author

Hmm perhaps. The weird thing is this only happens with pico not http_parser it seems. I may have surfaced 2 issues.

  1. We don't handle rc < 0 properly
  2. Somehow we are causing rc < 0 with pico parser (potentially).

@kellabyte
Copy link
Collaborator Author

I changed the url variable to be a hw_string and I think I've fucked something up somewhere. Notice all the non-200 errors.

Running 10s test @ http://server:8000
  40 threads and 760 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.38ms    8.69ms 225.45ms   88.63%
    Req/Sec   152.44k    56.00k  387.70k    74.30%
  Latency Distribution
     50%    2.85ms
     75%    8.00ms
     90%   16.22ms
     99%   40.87ms
  9838320 requests in 10.08s, 1.44GB read
  Socket errors: connect 0, read 27299, write 0, timeout 0
  Non-2xx or 3xx responses: 9517609
Requests/sec: 976424.55
Transfer/sec:    146.04MB

@nmdguerreiro
Copy link
Contributor

I can't replicate the non-2xx codes you got above 😕

I just ran the benchmark repeatedly without an issue (other than a couple of timeouts):

[ec2-user@ip-172-31-31-0 ~]$ ~/wrk/wrk --script pipelined_get.lua --latency -d 30s -t 200 -c 512 http://172.31.16.12:8000 -- 64
Running 30s test @ http://172.31.16.12:8000
  200 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    78.50ms  157.88ms   1.91s    86.36%
    Req/Sec     8.51k     4.13k   72.32k    87.69%
  Latency Distribution
     50%    8.30ms
     75%   65.23ms
     90%  306.69ms
     99%  885.95ms
  48717856 requests in 30.10s, 7.17GB read
  Socket errors: connect 0, read 0, write 0, timeout 13
Requests/sec: 1618658.94
Transfer/sec:    243.90MB

@kellabyte
Copy link
1025E Collaborator Author

Are you running on a Mac or Linux? I seem to find the problem on Linux.

I actually think the issue is here now that I've switched the url from using char* to hw_string https://github.com/kellabyte/Haywire/blob/picohttpparser/src/haywire/route_compare_method.c

@nmdguerreiro
Copy link
Contributor

I tried both and those results are from Amazon Linux.
On Mon, 14 Dec 2015 at 19:13, Kelly Sommers notifications@github.com
wrote:

Are you running on a Mac or Linux? I seem to find the problem on Linux.


Reply to this email directly or view it on GitHub
#71 (comment).

@nmdguerreiro
Copy link
Contributor

@kellabyte still having issues?

We do static matching currently so might as well simplify this for now until `pathfinder` is implemented.
We do static matching currently so might as well simplify this for now until `pathfinder` is implemented.
We do static matching currently so might as well simplify this for now until `pathfinder` is implemented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0