Description
- Initially raised as discussion Percent encoding `|` in paths #3479
I got no response, so I'm opening this issue for more visibility.
OS: Windows 11
python --version
: Python 3.12.8
httpx
version: 0.28.1
I believe the |
should be percent encoded in paths, which is not currently the case. If I'm understanding RFC3986 correctly, path characters are pchar
, which can be unreserved
, pct-encoded
, sub-delims
, ":"
, or "@"
. unreserved
can be composed of ALPHA
, DIGIT
, "-"
, "."
, "_"
, or "~"
. pct-encoded
is the percent encoding sequences. sub-delims
can be "!"
, "$"
, "&"
, "'"
, "("
, ")"
, "*"
, "+"
, ","
, ";"
, or "="
. Nowhere in this set is the |
character present, meaning it has to be percent-encoded.
Simplifying my problem, httpx
seems to call its internal urlparse
function to process urls. So, here's an example using that function. This function normally percent-encodes characters as needed, like spaces:
httpx._urlparse.urlparse('http://example.com/ ')
will return
ParseResult(scheme='http', userinfo='', host='example.com', port=None, path='/%20', query=None, fragment=None)
However, this does not happen for |
:
httpx._urlparse.urlparse('http://example.com/|')
will return
ParseResult(scheme='http', userinfo='', host='example.com', port=None, path='/|', query=None, fragment=None)
In Firefox and Google Chrome, |
is percent-encoded:
encodeURI('http://example.com/|')
will return
"http://example.com/%7C"
In the requests
library, |
is also percent-encoded:
requests.utils.requote_uri('http://example.com/|')
will return
'http://example.com/%7C'
The rfc3986
library also percent encodes |
:
rfc3986.urlparse('http://example.com/|')
will return
ParseResult(scheme='http', userinfo=None, host='example.com', port=None, path='/%7C', query=None, fragment=None)
Using urllib
itself, |
also seems to be percent-encoded for path components:
urllib.parse.quote('/|')
will return
'/%7C'
I'm fairly certain that I've interpreted this RFC right, and I think that |
should be excluded from the PATH_SAFE
set here. Here is its current value: "!$%&'()*+,-./0123456789:;=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_abcdefghijklmnopqrstuvwxyz|~"
.
Potential Fix: nathaniel-daniel@a2f327f