8000 gRPC call error data propagation broken with HAProxy as a reverse-proxy. · Issue #1219 · haproxy/haproxy · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

gRPC call error data propagation broken with HAProxy as a reverse-proxy. #1219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mgetka opened this issue Apr 9, 2021 · 16 comments
Open
Labels
2.6 This issue affects the HAProxy 2.6 stable branch. 2.8 This issue affects the HAProxy 2.8 stable branch. 3.0 This issue affects the HAProxy 3.0 stable branch. status: fixed This issue is a now-fixed bug. type: bug This issue describes a bug.

Comments

@mgetka
Copy link
mgetka commented Apr 9, 2021

Detailed description of the problem

If the gRPC server application (served from behind HAProxy) immediately (ie. before sending any responses) aborts the response streaming call with specific error, then the client application will receive CANCELLED error instead of the one that was sent. The error details will be Received RST_STREAM with error code 8.

Expected behavior

The error data produced by the gRPC server reaches the client unchanged in the terms of gRPC protocol..

Steps to reproduce the behavior

I have prepared a repository with minimal example containing simple python client & server. It also contains docker-compose with network environment including configured HAProxy.

Do you have any idea what may have caused this?

To terminate the call gRPC server sends RST_STREAM with error code NO_ERROR. The client application expects the error code to be like that. HAProxy rewrite the error code to CANCEL which breaks the client logic. The rewriting is done in h2_do_shutr function of mux_h2.c.

What is your configuration?

global
    log stdout local0

listen grpc-server-fe
    bind *:6000 proto h2

    mode http
    log global
    option logasap
    option httplog

    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

    server grpc-server-be server:6000 proto h2

Output of haproxy -vv and uname -a

# haproxy -vv
HA-Proxy version 2.3.2-d522db7 2020/11/28 - https://haproxy.org/
Status: stable branch - will stop receiving fixes around Q1 2022.
Known bugs: http://www.haproxy.org/bugs/bugs-2.3.2.html
Running on: Linux 5.4.0-70-generic #78~18.04.1-Ubuntu SMP Sat Mar 20 14:10:07 UTC 2021 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1
  DEBUG   = 

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1d  10 Sep 2019
Running on OpenSSL version : OpenSSL 1.1.1d  10 Sep 2019
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with network namespace support.
Built with the Prometheus exporter as a service
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.32 2018-09-10
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 8.3.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services :
	prometheus-exporter

Available filters :
	[SPOE] spoe
	[CACHE] cache
	[FCGI] fcgi-app
	[COMP] compression
	[TRACE] trace

# uname -a
Linux 1fccc9cf174c 5.4.0-70-generic #78~18.04.1-Ubuntu SMP Sat Mar 20 14:10:07 UTC 2021 x86_64 GNU/Linux

In my repository containing the minimal issue example I've received report that the issue is also present with HAProxy 2.4 and gRPC 1.37.0.

Additional information (if helpful)

@mgetka mgetka added status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug. labels Apr 9, 2021
@santigl
Copy link
santigl commented Apr 9, 2021

We are seeing this exact scenario in an implementation of the ByteStream API, where a client can upload a file calling ByteStream.Write() with a stream of WriteRequests containing chunks of the file.

A server can, after reading the first message and detecting that the Write() is for a file already in storage, return a successful WriteResponse without reading subsequent messages in the client's stream.

In that case the client detects (at the gRPC level) a failed Write() operation; but, when closing its side of the stream, receives the WriteResponse returned early by the server, which indicates that the file was successfully written.

This however is not the case when going through HAProxy: gRPC interprets the result of the transaction as a 1: CANCELLED error.

@wtarreau
Copy link
Member

Unfortunately at this point we're reaching the limits of what can be done by gRPC trying to reuse HTTP infrastructure with more detailed semantics than HTTP itself provides :-/ The reality here is that we have a stream abort on one side that we must signal on the other side, and that in order to best accommodate the various combinations of H1/H2 on each side, if there's an empty response it's a cancelled (i.e. the client may retry) but if it was an error or an active abort it's a stream error.

There's some ongoing design work to try to improve the various abort situations and try to pass finer information between the two sides, we should definitely include this one into the mix. But I have no solution to propose till 2.5 at least given that such information cannot even be internally represented. CCing @capflam.

@wtarreau
Copy link
Member

By the way we've worked quite a bit on this and have clearer ideas about what needs to be changed for this to work better. However these are definitely long-term changes I'm afraid.

@wtarreau wtarreau added status: future fix This issue cannot be fixed without a major rework and thus is postponed. and removed status: needs-triage This issue needs to be triaged. labels Apr 14, 2022
@capflam
Copy link
Member
capflam commented May 6, 2024

I've tested your repro on several haproxy versions (from 2.2 to 3.0-DEV). And everything seems to work properly. The client always returns:

Received gRPC error:
Status:         StatusCode.PERMISSION_DENIED
Details:        Details sent by the server

The issue is quite old now and I guess the bug was in fact not well identified and solved in the mean time.

I'm closing now. Thanks !

@capflam capflam closed this as completed May 6, 2024
@capflam capflam added status: fixed This issue is a now-fixed bug. and removed status: future fix This issue cannot be fixed without a major rework and thus is postponed. labels May 6, 2024
@psavalle
Copy link
Contributor
psavalle commented Mar 7, 2025

@capflam I believe I have been observing this issue on real workloads, and I can still reproduce this issue on 3.1.5 or 3.2-dev6 with the minimal example above, when running it for multiple requests.

After starting the proxy/test server, the first request seems to always succeed, then we see the CANCELLED:

# seq 1 3 | xargs -n 1 -I "A" docker run --network haproxy_grpc_issue client --host proxy stream-stream
Received gRPC error:
Status:		StatusCode.PERMISSION_DENIED
Details:	Details sent by the server
Received gRPC error:
Status:		StatusCode.CANCELLED
Details:	Received RST_STREAM with error code 8
Received gRPC error:
Status:		StatusCode.CANCELLED
Details:	Received RST_STREAM with error code 8

Running it 200 times with haproxy:3.1.5, I get:

 193 StatusCode.CANCELLED
   7 StatusCode.PERMISSION_DENIED

I believe this might be expected based on what h2s_is_forwardable_abort does:

For now, only CANCEL from the client is forwardable to the server.

I've made a custom build where h2s_is_forwardable_abort always returns true, and I always see StatusCode.PERMISSION_DENIED then. This seems to fix the issue, although I am unsure about any dire consequences this might have, and it probably doesn't make much sense as a general fix.

I understand from above that a lot of the complexity here might stem from cases with both HTTP/1 and HTTP/2. However, would there be a way to support forwarding error codes from the server at least in cases where only HTTP/2 is being used?

@capflam
Copy link
Member
capflam commented Mar 11, 2025

Well, I performed another test, capturing the traffic with wireshark.

And this work as expected. Both the server and haproxy do the same thing. The permission denied is returned in the 200-ok response, in the grpc-status header. However, if the request is unfinished when the response is returned a RST_STREAM(CANCEL) frame is also emitted to interrupt the request. So, on client side, depending of the timing, I guess the RST-STREAM code is returned instead of the gRPC status from the response. It may be an issue with the gRPC python module, I don't know. But, on HAProxy side, there is no issue for me.

@yangchi
Copy link
yangchi commented Mar 11, 2025

@capflam From what I can see in wireshark when we reproduce this:

gRPC server ends the stream with END_STREAM flag in a Header frame, followed by a RST_STREAM with NO_ERROR
Then it's haproxy that turns the RST_STREAM(NO_ERROR) into a RST_STREAM(CANCEL) and sends to the client. That's what causes the problem in the client.
Sending HEADER(ES flag) + RST(NO_ERROR) to a Python gRPC client doesn't have any problem.. But turning NO_ERROR into Cancel is where the problem is. I don't see why Haproxy must change the error code when it forwards the RST_STREAM. But I guess the issue here is that this really isn't "forwarding", but maybe this: haproxy sees its stream state on the backend direction is closed then decides to also close the stream state in the frontend direction, except, it should have saved the error code from the backend direction, and use the same error code to send the frontend stream.

It may be an issue with the gRPC python module, I don't know.

In the same setup where we reproduce this issue, once we remove the haproxy, let client directly talk to server, client has no problem handling HEADER(ES flag) + RST(NO_ERROR). For a HTTP/2 server that wants to close the stream, the server either has to wait for client to also ES the stream, or wait for the client to RST the stream, or if server doesn't want to wait, then of course server's only choice is the RST the stream itself. I really do not see any wrong behavior from gRPC in this case.

@capflam
Copy link
Member
capflam commented Mar 11, 2025

You are right for the server behavior. It does not send a RST_STREAM(CANCEL) but a RST_STREAM(NO_ERROR).

But you are wrong about the HAProxy behavior. It does not turn the NO_ERROR into a CANCEL, it sends a CANCEL on its own because the stream was considered as finished while the request is not finished. It is the real purpose of the CANCEL in this case. It is the way to inform the client it can stop uploading data. At least, it is the H2 meaning of the CANCEL error code after a response. For instance, in the HTTP world, it could be the case for a server that returned a redirect immediately after receiving a POST request, to notify the client it could stop uploading data.

The code is not the same between the server and the client because both sides are almost not aware of what happens on the other side. The protocol may be different, the connection management too, the content may be altered... What happens on one side is not necessarily true on the other side. In that case, I guess a NO_ERROR error code is used by the server because the CANCEL error code has a specific meaning for gRPC, hijacking this way the H2 meaning. But, in H2, it is totally valid to use the CANCEL error code to interrupt the request processing.

The way the client handles the result of the request is the problem here. If the HEADER frame is received alone, the permission denied is properly reported. Most of time, it is what happens in my case. But if the RST_STREAM(CANCEL) is received in same time, the gRPC status in the response is ignored. At least, it is what I suspect. I'm not a gRPC expert of course, but for me, the response must always be handled before the RST_STREAM frame, independently of the timing. A gRPC client must be, first and foremost, a H2 client. A CANCEL received before a response with a gRPC status or after is totally different. The first one can be interpreted as a real gRPC cancel, the second one is just a H2 cancel because the gRPC result was already delivered.

@yangchi
Copy link
yangchi commented Mar 14, 2025

@capflam Shouldn't clients stop uploading data after RSTSTREAM regardless of error code? The stream state is driven by the frame not the error code. Server means to tell client NO_ERROR but now because of Haproxy clients see Cancel. From ending streams perspective RST is enough. The problem here is the choice of error code

@capflam
Copy link
Member
capflam commented Mar 14, 2025

Indeed, a RST_STREAM is enough. But the reason is useful to understand what happens. And the CANCEL is here to say the stream is no longer needed. And it is the case here. Another way would be to drain the request payload, just like in HTTP/1.1. The NO_ERROR error code is to say the stream was finished without error. But here, it is not accurate. There is no reason to change the HTTP/2 behavior on this point. And I'm so 8000 rry but I disagree on the problem origin. It a client or a lib issue. Just think about it. A valid response was delivered to the client with a gRPC status (the permission denied here). And then the H2 stream is cancelled. the response must be handled first in all cases. If the reset is received later, there is no issue and the permission denied is properly reported. But if the reset is received in the same time, the cancel is reported. There is absolutely no reason to do so. I don't know how it is implemented. But H2 frames order is important. And I doubt it is specified in the gRPC specs that a cancel must be handled before all other frames. There is no problem with NO_ERROR code, but for me, it is just hiding the true bug.

@yangchi
Copy link
yangchi commented Mar 14, 2025

A valid response was delivered to the client with a gRPC status (the permission denied here). And then the H2 stream is cancelled. the response must be handled first in all cases. If the reset is received later, there is no issue and the permission denied is properly reported. But if the reset is received in the same time, the cancel is reported. There is absolutely no reason to do so.

I don't agree with this but I don't know if me disagreeing makes any difference at this point.

The response was handled. But after ES the client stream is at half-closed state. Then a RST_STREAM showed up, clients should still process it. All these are within h2 spec.

If you are building a h2 client library today will you just ignore the RST during half-closes state? I think that's wrong. If client library delivers a finish callback and then an error to application, should application also ignore the error? I think that's also wrong.

Server never cancelled the stream. It's merely a termination. Proxy cancelled it. The whole client behavior shouldn't even be a concern here but even if it is, I think client should process the final RST and its error code and notify application. Are any of these illegal in h2?

@psavalle
Copy link
Contributor

A source of confusion on my end is that 'HTTP Request/Response Exchange' in the HTTP/2 specification indicates:

   An HTTP response is complete after the server sends -- or the client
   receives -- a frame with the END_STREAM flag set (including any
   CONTINUATION frames needed to complete a header block).  A server can
   send a complete response prior to the client sending an entire
   request if the response does not depend on any portion of the request
   that has not been sent and received.  When this is true, a server MAY
   request that the client abort transmission of a request without error
   by sending a RST_STREAM with an error code of NO_ERROR after sending
   a complete response (i.e., a frame with the END_STREAM flag).

   Clients MUST NOT discard responses as a result of receiving such a
   RST_STREAM, though clients can always discard responses at their
   discretion for other reasons.

This is the situation here -- the client is streaming messages to the server, which then decides to send a complete response (with the grpc-status header set and the END_STREAM flag).

It specifies that clients must not discard responses when getting a RST_STREAM(NO_ERROR), which is what the gRPC server is sending -- but it says nothing about not discarding responses when getting a RST_STREAM(CANCEL).

I would also agree that the client probably should still not discard a complete response in that case (and the Python gRPC client seems to unfortunately behave non-deterministically there at the moment), but as far as I understand, it is undefined in the HTTP/2 specification which would instead suggest to stick to RST_STREAM(NO_ERROR) - is that the right way to look at it?

@capflam
Copy link
Member
capflam commented Mar 14, 2025

Hum, thanks @psavalle . I was pretty sure a CANCEL should be used in that case. It seems I'm wrong. So I guess in that case HAProxy should not send a CANCEL but a NO_ERROR. indeed. I will review that with Willy to be sure. But it seems there is indeed an issue here.

However, it remains pretty strange that the client does not return the gRPC status in that case. the gRPC is a protocol on top of H2. All what happens at the H2 level should not necessarily have an equivalence at the GRPC level. Here the reset is about the stream not the gRPC request. The gRPC response was provided. the reset must of couse be handled at the H2 level. But at the gRPC level, it is meaningless at this stage. So indeed, the request upload must be stopped from the H2 point of view. But I doubt there is anything to do with this reset if you already handled the permission denied.

BUT BUT, it remains the wrong error code seems to be used here. So I'm reopening the issue.

@capflam capflam reopened this Mar 14, 2025
@wtarreau
Copy link
Member

Yes we can have a look at this, it's indeed possible that we need to differentiate data interrupted after the end of the response from data interrupted before.

@capflam capflam added status: reviewed This issue was reviewed. A fix is required. and removed status: fixed This issue is a now-fixed bug. labels Mar 17, 2025
haproxy-mirror pushed a commit that referenced this issue Mar 20, 2025
…was already sent

On frontend side, when a stream is shut while the response was already fully
sent, it was cancelled by sending a RST_STREAM(CANCEL) frame. However, it is
not accurrate. CANCEL error code must only be used if the response headers
were sent, but not the full response. As stated in the RFC 9113, when the
response was fully sent, to stop the request sending, a RST_STREAM with an
error code of NO_ERROR must be sent.

This patch should solve the issue #1219. It must be backported to all stable
versions.
@capflam
Copy link
Member
capflam commented Mar 20, 2025

Should be fixed in 3.2-dev now. Thanks !

@capflam capflam added status: fixed This issue is a now-fixed bug. 2.6 This issue affects the HAProxy 2.6 stable branch. and removed status: reviewed This issue was reviewed. A fix is required. labels Mar 20, 2025
@capflam capflam added 2.8 This issue affects the HAPr B07F oxy 2.8 stable branch. 2.9 This issue affects the HAProxy 2.9 stable branch. 3.0 This issue affects the HAProxy 3.0 stable branch. 3.1 This issue affects the HAProxy 3.1 stable branch. labels Mar 20, 2025
@psavalle
Copy link
Contributor

I have tested 3.2-dev with both the minimal example from above and with real traffic, and can confirm the original issue is gone. Thanks @capflam!

FireBurn pushed a commit to FireBurn/haproxy that referenced this issue Mar 26, 2025
…was already sent

On frontend side, when a stream is shut while the response was already fully
sent, it was cancelled by sending a RST_STREAM(CANCEL) frame. However, it is
not accurrate. CANCEL error code must only be used if the response headers
were sent, but not the full response. As stated in the RFC 9113, when the
response was fully sent, to stop the request sending, a RST_STREAM with an
error code of NO_ERROR must be sent.

This patch should solve the issue haproxy#1219. It must be backported to all stable
versions.

(cherry picked from commit e87397b)
Signed-off-by: Willy Tarreau <w@1wt.eu>
@capflam capflam removed the 3.1 This issue affects the HAProxy 3.1 stable branch. label Mar 31, 2025
@capflam capflam removed the 2.9 This issue affects the HAProxy 2.9 stable branch. label Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.6 This issue affects the HAProxy 2.6 stable branch. 2.8 This issue affects the HAProxy 2.8 stable branch. 3.0 This issue affects the HAProxy 3.0 stable branch. status: fixed This issue is a now-fixed bug. type: bug This issue describes a bug.
Projects
None yet
Development

No branches or pull requests

6 participants
0