8000 Proposal: FINAL_DATA by bemasc · Pull Request #2949 · httpwg/http-extensions · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Proposal: FINAL_DATA #2949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Proposal: FINAL_DATA #2949

wants to merge 10 commits into from

Conversation

bemasc
Copy link
Contributor
@bemasc bemasc commented Nov 14, 2024

This adds a FINAL_DATA capsule type to make clean shutdown explicit.

This adds a FINAL_DATA capsule type to make clean shutdown explicit.
@bemasc bemasc added the connect-tcp draft-ietf-httpbis-connect-tcp label Nov 14, 2024
Copy link
Contributor
@DavidSchinazi DavidSchinazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we should add a FINAL_DATA. It allows for example sending metadata after the stream is gracefully closed. I can imagine using that to send capsules that contain TCP performance info

Base automatically changed from bemasc-capsule-only to main November 18, 2024 23:22
@bemasc bemasc marked this pull request as ready for review November 18, 2024 23:22
@kazuho
Copy link
Contributor
kazuho commented Nov 19, 2024

Thank you for opening the new issue dedicated to FINAL_DATA.

@DavidSchinazi

I do think we should add a FINAL_DATA. It allows for example sending metadata after the stream is gracefully closed. I can imagine using that to send capsules that contain TCP performance info

I think the question is if we would benefit from being able to observe the ordering between close and metadata; if there is, then it makes sense to have a frame indicating close. Otherwise, it is waste, considering that most underlying layers (HTTP/2, HTTP/3, TLS) already provide ways to distinguish between graceful shutdown and abrupt close.

For things like performance information, I don't think there would be material difference in sending them right before or after signalling the closure. I'd assume that we would allow such information to be sent at any moment during the lifetime of the tunnel, and that receivers would record whatever they receive. Senders that want to emit most recent information can send performance information at the moment they notice the TCP tunnel getting closed, and close the tunnel.

Separately, if we have the interest in sending metadata after closure, I am not sure if such interest applies only to graceful close and not to resets? To paraphrase, it might make more sense to define a frame that conveys closure and how it is closed (i.e., graceful or reset), rather than FINAL_DATA that only signals graceful shutdown.

@DavidSchinazi
Copy link
Contributor

Otherwise, it is waste

What is wasted? This just requires registering a capsule type, which is a 2^62 registry. In most cases, the proxy will send the last bit of data inside a FINAL_DATA capsule. In the unusual scenario where it learns about the FIN after it sent its last DATA capsule, then it does need to send 2 bytes.

allow such information to be sent at any moment during the lifetime of the tunnel

FWIW, the lifetime of the tunnel is already somewhat disconnected from the TCP FIN since a FIN only closes one side of the connection. I see a value in telling the peer that TCP closed gracefully while reserving the right to send future capsules in response to something the peer sends.

it might make more sense to define a frame that conveys closure and how it is closed (i.e., graceful or reset), rather than FINAL_DATA that only signals graceful shutdown.

Having a third capsule like DATA_ABORTED doesn't seem unreasonable to me.

@PiotrSikora
Copy link
Contributor

it might make more sense to define a frame that conveys closure and how it is closed (i.e., graceful or reset), rather than FINAL_DATA that only signals graceful shutdown.

+1

@bemasc
Copy link
Contributor Author
bemasc commented Feb 27, 2025

@PiotrSikora @kazuho Please comment at #3000 with whether you want to move forward with this approach or a different one. Otherwise I will merge this and publish a new release to get in before the draft deadline.

bemasc added a commit that referenced this pull request Apr 9, 2025
bemasc added a commit that referenced this pull request Apr 9, 2025
Copy link
Contributor
@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a better approach, but I'm not finding it particularly clear about who does what (and in reaction to what).

@bemasc
Copy link
Contributor Author
bemasc commented Apr 15, 2025

This seems like a better approach, but I'm not finding it particularly clear about who does what (and in reaction to what).

I've added a lot more detail (and diagrams!) to this PR to try to pin down precisely what is proposed here.

@wtarreau
Copy link

Nice! It looks clean to me. I'm just having one nit here:

- HTTP/1.1 over TLS: a TLS Error Alert

I'd instead say:

- HTTP/1.1 over TLS: a TLS Error Alert or TCP RST

The rationale for this is the following: sending a TCP RST doesn't cost anything for the sender, so a client could easily send hundreds of thousands of them per second. However, the gateway dealing with protocol translation would face a risk of trivial denial of service by sending a TLS alert, because that means you need to be the last one sending before closing, with the connection ending in TIME_WAIT on its outgoing side, something that must never ever happen, otherwise it ends up depleting its source ports in a fraction of second and fails to connect to the servers for a minute or more. The only way to avoid this is to close the TCP connection abrubptly as well, but that results in the loss of the TLS alert for the other one (data queued in network buffers are destroyed by the system as soon as the connection is reset). As such, the next hop cannot reliably expect to see the TLS alert, and must be prepared to just see the TCP connection being reset.

We could suggest that the TLS alert be a SHOULD that eases debugging and is more polite, but that the TCP RST is the final way of closing in any case. I.e. depending on timing the endpoint must be prepared to receive a TCP RST for all protocols ultimately transported over TCP. Some might have the chance to see an RST_STREAM or TLS alert before that.

Copy link
Contributor
@kazuho kazuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for pushing the pull request forward. Looks good overall.

@bemasc
Copy link
Contributor Author
bemasc commented Apr 16, 2025

@wtarreau It sounds like you're concerned about a malicious client and destination executing an asymmetric resource exhaustion attack against the proxy. The more I thought about this class of attacks, the more I realized there seemed to be a lot of them. I've added a section discussing these attacks and how to mitigate them.

I don't think we need to change the guidance on TLS Alerts. Sending a TCP RST in that case would not fix all the "WAIT Abuse" port exhaustion attacks, and it would violate the confidentiality promise of the "https://" URI scheme. Instead, I think these attacks must be addressed by doing resource accounting more carefully, as noted in the new text.

@wtarreau
Copy link

@wtarreau It sounds like you're concerned about a malicious client and destination executing an asymmetric resource exhaustion attack against the proxy. The more I thought about this class of attacks, the more I realized there seemed to be a lot of them. I've added a section discussing these attacks and how to mitigate them.

Not just attacks actually, even regular usage. The most common cases I've seen port exhaustion was on pretty valid traffic relying misdesigned protocols (e.g. SQL stuff where the client closes first). Once you have many clients doing just a few short connections, as soon as this happens more often than the the number of ephemeral ports over the time_wait duration, you're stuck. On a default linux setup, you have 28232 ports and 60s time_wait, that's an abysmally load of 470 conn/s that suffices to block them all. It doesn't require many clients, and actually a single firewall reboot or edge proxy restart in production can be sufficient to cause this. Thus for me the mitigation doesn't address accidents, it only addresses the most trivial case of a bad actor.

I don't think we need to change the guidance on TLS 8000 Alerts. Sending a TCP RST in that case would not fix all the "WAIT Abuse" port exhaustion attacks,

Yes it does because it avoids the TIME_WAIT.

and it would violate the confidentiality promise of the "https://" URI scheme.

I don't see how. Otherwise TCP would violate TLS, since an RST can happen for many reasons, the first one being just the gateway crashing or being restarted in the middle of transfers.

Instead, I think these attacks must be addressed by doing resource accounting more carefully, as noted in the new text.

Attacks do have to be mitigated, but it's also our responsibility to make sure that protocols and their extensions are designed in a way that doesn't cause domino effects on production equipment or that they amplify the consequences of minor accidents (e.g. a first layer of proxy being restarted before the gateway that performs the conversion).

@bemasc
Copy link
Contributor Author
bemasc commented Apr 16, 2025

@wtarreau It sounds like you're concerned about a malicious client and destination executing an asymmetric resource exhaustion attack against the proxy. The more I thought about this class of attacks, the more I realized there seemed to be a lot of them. I've added a section discussing these attacks and how to mitigate them.

Not just attacks actually, even regular usage. The most common cases I've seen port exhaustion was on pretty valid traffic relying misdesigned protocols (e.g. SQL stuff where the client closes first). Once you have many clients doing just a few short connections, as soon as this happens more often than the the number of ephemeral ports over the time_wait duration, you're stuck. On a default linux setup, you have 28232 ports and 60s time_wait, that's an abysmally load of 470 conn/s that suffices to block them all. It doesn't require many clients, and actually a single firewall reboot or edge proxy restart in production can be sufficient to cause this. Thus for me the mitigation doesn't address accidents, it only addresses the most trivial case of a bad actor.

I'm not sure I understand this problem. TIME-WAIT is per-4-tuple, so the 470 conn/s limit is at each client, not at the proxy. (Or maybe you're thinking of 470 conn/s from the proxy to any single destination? But that would not be affected by how we structure the client<->proxy protocol.)

I don't think we need to change the guidance on TLS Alerts. Sending a TCP RST in that case would not fix all the "WAIT Abuse" port exhaustion attacks,

Yes it does because it avoids the TIME_WAIT.

The TCP RFC says that sending RST means you enter TIME-WAIT. Regardless, TIME-WAIT is by 4-tuple, so it primarily applies when the client closes.

and it would violate the confidentiality promise of the "https://" URI scheme.

I don't see how. Otherwise TCP would violate TLS, since an RST can happen for many reasons, the first one being just the gateway crashing or being restarted in the middle of transfers.

These RSTs are HTTP response content, and cannot leak outside the TLS envelope. If a standard HTTP gateway receives a TCP RST from upstream, it forwards it as an HTTP 5XX or stream error inside TLS.

BTW, using a TCP RST here also violates the "https://" integrity guarantees: a middlebox can (and often does) convert the RST to a FIN, losing the error signal.

Instead, I think these attacks must be addressed by doing resource accounting more carefully, as noted in the new text.

Attacks do have to be mitigated, but it's also our responsibility to make sure that protocols and their extensions are designed in a way that doesn't cause domino effects on production equipment or that they amplify the consequences of minor accidents (e.g. a first layer of proxy being restarted before the gateway that performs the conversion).

Definitely! If we have a convincing way to make this protocol safer to deploy, I'm all for it. Otherwise, perhaps you can contribute some text to the Operational Considerations.

@wtarreau
Copy link

The TCP RFC says that sending RST means you enter TIME-WAIT.

In fact not really, that's a non-normative "should". In practices all stacks I've dealt with till now don't do that, and for a good reason, which is that the only ways for a userland application to trigger an RST are 1) breaking the association by connecting to AF_UNSPEC , leaving no trace of the connection in the TCP table, or 2) disabling lingering and closing, in which case the RST is more a consequence of the destruction of pending unsent data. The only case an application needs to force an RST is to close an outgoing connection precisely to avoid monopolizing an ephemeral port to a destination.

Regarding the direction of the RST, I think we were not talking about the same since you're speaking about responses. I agree that with responses generally an intermediary will send a 5xx (if the RST was received before the intermediary started to send headers), or break TLS to the client. I was speaking about the other direction: a TCP client reaches a gateway that encapsulates the connection over HTTP. In this case the client can close using RST at no cost, and it must not incur a cost for the gateway. If the gateway is forced to emit a TLS alert, it will have to keep the ephemeral port open to the next hop. My point is that for such directions the intermediary must be allowed to close using RST (and the next hop to detect that as well, just as if the gateway crashed after all).

If we have a convincing way to make this protocol safer to deploy, I'm all for it. Otherwise, perhaps you can contribute some text to the Operational Considerations.

I think that just using this does the job:

HTTP/1.1 over TLS: a TLS Error Alert, or TCP RST as a last resort

@bemasc
Copy link
Contributor Author
bemasc commented Apr 21, 2025

@wtarreau OK, it sounds like your main concern is about shared clients being left in the TIME-WAIT state on the client<->proxy connection. After thinking about this a bit, I believe there are many ways that this can happen, even when connections are closed cleanly.

Ultimately, this is just one of several major downsides of using HTTP/1.1 for this purpose. I've added Operational Considerations text recommending against use of HTTP/1.1 for this kind of deployment and enumerating several concerns, including this one.

@wtarreau
Copy link

OK I think there was a misunderstanding between us, because by client I meant the original TCP one for which the gateway would encapsulate the TCP connection while in your case the client is the client of the protocol acting for another one. Then we're in line. And I generally agree with your approach recommending against HTTP/1.1 in this case. Thanks!

Comment on lines +169 to +170
- HTTP/3: RESET_STREAM with H3_CONNECT_ERROR
- HTTP/2: RST_STREAM with CONNECT_ERROR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are the first mentions of these frames, you may want to cite where they come from (i.e. link to the QUIC frame and HTTP/3 error code, the HTTP/2 frame and code, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to figure out how to make kramdown-rfc do this. :-/ If you know how, please do share.

@@ -212,8 +273,32 @@ Template-driven TCP proxying is largely subject to the same security risks as cl

A small additional risk is posed by the use of a URI Template parser on the client side. The template input string could be crafted to exploit any vulnerabilities in the parser implementation. Client implementers should apply their usual precautions for code that processes untrusted inputs.

## Resource Exhaustion attacks

Proxy implementors should take special care to avoid resource exhaustion attacks when the client is not trusted. A malicious client can achieve highly asymmetric resource usage by colluding with a destination server and violating the ordinary rules of TCP or HTTP. Some example attacks and mitigations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the is not trusted caveat? Maybe even deleting the entire sentence leaves the paragraph clear enough that there are client-initiated resource exhaustion attacks a proxy needs to watch out for.

NB: I don't tend to trust anything at this layer of the stack, since it can be controlled by untrusted code running at a higher layer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I shortened this considerably.

- Mitigation: Limit the number of concurrent connections per client.
* **Window Bloat**: An attacker can grow the receive window size by simulating a "long, fat network" {{?RFC7323}}, then fill the window (from the sender) and stop acknowledging it (at the receiver). This leaves the proxy buffering up to 1 GiB of TCP data until some timeout, while the attacker does not have to retain a large buffer.
- Mitigation: Limit the maximum receive window for TCP and HTTP connections, and the size of userspace buffers used for proxying. Alternatively, monitor the connections' send queues and limit the total buffered data per client.
* **WAIT Abuse**: An attacker can force the proxy into a TIME-WAIT, CLOSE-WAIT, or FIN-WAIT state until the timer expires, tying up a proxy<->destination 4-tuple for up to four minutes after the client's connection is closed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does four minutes come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TCP RFC defines the TIME-WAIT timeout as 2 * MSL, which is 2 minutes.

Copy link
Contributor
@LPardue LPardue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I raised a few nits/suggestions but this LGTM whatever you decide.

Co-authored-by: Kazuho Oku <kazuhooku@gmail.com>
@bemasc bemasc force-pushed the bemasc-final-data branch from 7f66609 to 4441310 Compare April 24, 2025 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connect-tcp draft-ietf-httpbis-connect-tcp
Development

Successfully merging this pull request may close these issues.

8 participants
0