-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
urlapi: normalize retrieved URLs #16841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
On Mastodon I was provided an example bug that occurred in a different project due to URL normalization, which could serve as an indication that we need a way to opt out. |
d5debbb
to
23a3843
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As someone who (completely coincidentally) has been doing some related research into URL normalization (but on the server side):
Different implementations normalize paths differently. Notably:
- NGINX percent-decodes before path-splitting or evaluating
.
and..
components. - Recent Apache httpd decodes
%2e
and%2E
to.
before normalizing, but does not decode%2F
yet. The default is to 404 if%2F
is present. - HAProxy does not treat %2F specially, but requires an experimental flag to normalize URLs at all.
- Web browsers do not normalize %2F.
I wonder if the safest option is to fail the request by default if %2F is present, as it has server-dependent behavior. Decoding it is definitely not safe as it could change how the request is routed, and I believe that the Wikimedia Foundation uses %2F in paths in places. The worst offender is when where %2F
decoding introduces new path components that are empty, .
, or ..
, as Apache httpd and NGINX disagree on what to do here.
It would also be useful to document why some characters are kept unchanged
Thanks @DemiMarie for taking a look and your comments. I actually did not intend for curl actually does not decode either them nor %2e to do its "../" dedotdotification. (I filed #16869 for that, it is a bug.) I opted to leave a whole bunch of letters non-percent encoded in the query part basically because they are letters that are commonly used unencoded in query parts and I thought doing it this way would leave them a little more readable and a little more like users would write them if they would do them manually. I agree that we should document the normalization rules once we have them set so that users can learn exactly what to expect from this. |
f875245
to
7323ebe
Compare
The API documentation for curl_url_get() mentions normalizing when extracting a URL (only), so this code does not affect retrieving the individual components. Normalizes: user, password, options, (lowercase) hostnames, path, query and fragment Added CURLU_NO_NORMALIZE flag for retrieving the URL without normalization. Uses this option internally for redirect handling. Extend test 1560 to verify Fixes #16829 Reported-by: Alberto Leiva Popper Closes #16841
beb5a15
to
617accf
Compare
The API documentation for curl_url_get() mentions normalizing when extracting a URL (only), so this code does not affect retrieving the individual components. Normalizes: user, password, options, (lowercase) hostnames, path, query and fragment Added CURLU_NO_NORMALIZE flag for retrieving the URL without normalization. Uses this option internally for redirect handling. Extend test 1560 to verify Assisted-by: Demi Marie Obenour Fixes #16829 Reported-by: Alberto Leiva Popper Closes #16841
617accf
to
6c3526b
Compare
I think this normalization is a little too intrusive to switch on by default like this (partly based on how it caused problems even to us when using this API internally), in spite of the existing wording in the documentation. As I suspect the normalization can still be useful for selected users, I intend to revert the work here and make the URL-normalization opt-in instead of opt-out. |
No longer claim curl_url_get() does URL normalizing by default. With this option used, curl_url_get() normalizes the user, password, options, (lowercase) hostnames, path, query and fragment parts. Extend test 1560 to verify Assisted-by: Demi Marie Obenour Fixes #16829 (sort of) Reported-by: Alberto Leiva Popper Closes #16841
722461d
to
56fbbd4
Compare
No longer claim curl_url_get() does URL normalizing by default. With this option used, curl_url_get() normalizes the user, password, options, (lowercase) hostnames, path, query and fragment parts. Extend test 1560 to verify Assisted-by: Demi Marie Obenour Fixes #16829 (sort of) Reported-by: Alberto Leiva Popper Closes #16841
56fbbd4
to
113f2a5
Compare
Can you elaborate?
Which I interpret as "if a normalizer finds a reserved character, or a percent-encoded reserved character, it must leave it alone. (Not decode nor encode it.)" This is because the semantics of that character are defined by whoever generated the URI, not the normalizer.
Pedantic note: If the host name lowercasing is part of the normalization (which appears to be implied by the next paragraph), it appears "normalized" means more than just persistent percent encoding.
Seems to need more coffee :-) "is already" should probably be "if already," "left left" should prolly be "are left." Even then, there's a bit of redundancy. Suggestion: "In the query component,
Does not match my tests; see below. Ok, so I realize this is not necessarily meant to be a perfect RFC 3986 normalizer, but I'd like to point out some quirks I found in my 3986 unit tests, just in case: static char *
url_normalize(char const *url)
{
CURLU *curlu;
char *normal;
CURLUcode err;
curlu = curl_url();
if (!curlu) {
printf("Out of memory.\n");
exit(ENOMEM);
}
err = curl_url_set(curlu, CURLUPART_URL, url,
CURLU_NON_SUPPORT_SCHEME | CURLU_NORMALIZE);
if (err)
goto einval;
err = curl_url_get(curlu, CURLUPART_URL, &normal, 0);
if (err)
goto einval;
curl_url_cleanup(curlu);
return normal;
einval: /* printf("Error parsing URL: %s\n", curl_url_strerror(err)); */
curl_url_cleanup(curlu);
return NULL;
}
static void
check_normal(char const *query, char const *expected)
{
char *actual;
bool equal;
actual = url_normalize(query);
if (expected == NULL)
equal = (actual == NULL);
else
equal = (actual == NULL) ? false : !strcmp(expected, actual);
if (!equal)
if (actual != NULL)
printf("Oops!\n"
"- query : %s\n"
" expected result: %s\n"
" actual result : %s\n",
query, expected, actual);
curl_free(actual);
}
void
test_normalization(void)
{
printf("Invalid URIs\n");
check_normal("", NULL);
check_normal("h", NULL);
check_normal("http", NULL);
check_normal("https", NULL);
check_normal("https:", NULL);
check_normal("https:/", NULL);
check_normal("rsync://", NULL);
check_normal("rsync://a.b.c/β", NULL);
printf("3986#6.2.2.1: Uppercase percent-encoding\n");
check_normal("https://a.b.c/%3a", "https://a.b.c/%3A");
printf("3986#6.2.2.1, 9110#4.2.3c: Lowercase scheme and host\n");
check_normal("http://a.b.c/d", "http://a.b.c/d");
check_normal("abcde://a.b.c/d", "abcde://a.b.c/d");
check_normal("HTTPS://a.b.c/d", "https://a.b.c/d");
check_normal("rSyNc://a.b.c/d", "rsync://a.b.c/d");
check_normal("HTTPS://A.B.C/d", "https://a.b.c/d");
check_normal("HTTP://WWW.EXAMPLE.COM/aBc/dEf?gHi#jKl", "http://www.example.com/aBc/dEf?gHi#jKl");
printf("3986#6.2.2.2, 9110#4.2.3d: Decode unreserved characters\n");
check_normal("http://%61%7A.%41%5A.%30%39/%61%7A%41%5A%30%39", "http://az.AZ.09/azAZ09");
check_normal("http://%2D%2E%5F%7E/%2D%2E%5F%7E", "http://-._~/-._~");
printf("3986#6.2.2.3: Path segment normalization\n");
check_normal("rsync://a.b.c", "rsync://a.b.c/");
check_normal("rsync://a.b.c/", "rsync://a.b.c/");
check_normal("rsync://a.b.c/d", "rsync://a.b.c/d");
check_normal("rsync://a.b.c//////", "rsync://a.b.c//////");
check_normal("rsync://a.b.c/d/e", "rsync://a.b.c/d/e");
check_normal("rsync://a.b.c/d/e/.", "rsync://a.b.c/d/e/");
check_normal("rsync://a.b.c/d/e/.", "rsync://a.b.c/d/e/");
check_normal("rsync://a.b.c/././d/././e/./.", "rsync://a.b.c/d/e/");
check_normal("rsync://a.b.c/d/..", "rsync://a.b.c/");
check_normal("rsync://a.b.c/x/../x/y/z", "rsync://a.b.c/x/y/z");
check_normal("rsync://a.b.c/d/../d/../d/e/", "rsync://a.b.c/d/e/");
check_normal("rsync://x//y/z/../../m/./n/o", "rsync://x//m/n/o");
check_normal("rsync://.", "rsync://./");
check_normal("https://./.", "https://./");
check_normal("https://./d", "https://./d");
check_normal("rsync://..", "rsync://../");
check_normal("rsync://../..", "rsync://../");
check_normal("rsync://../d", "rsync://../d");
check_normal("rsync://a.b.c/..", "rsync://a.b.c/");
check_normal("rsync://a.b.c/../..", "rsync://a.b.c/");
check_normal("rsync://a.b.c/../x", "rsync://a.b.c/x");
check_normal("rsync://a.b.c/../x/y/z", "rsync://a.b.c/x/y/z");
check_normal("rsync://a.b.c/d/e/../../..", "rsync://a.b.c/");
printf("3986#6.2.2: All the above, combined\n");
check_normal("example://a/b/c/%7Bfoo%7D", "example://a/b/c/%7Bfoo%7D");
check_normal("eXAMPLE://a/./b/../b/%63/%7bfoo%7d", "example://a/b/c/%7Bfoo%7D");
printf("3986#6.2.3: Scheme-based normalization\n");
check_normal("http://example.com/?", "http://example.com/?");
check_normal("http://example.com/#", "http://example.com/#");
printf("9110#4.2.3a: Default port\n");
check_normal("http://example.com:/", "http://example.com/");
check_normal("http://example.com:80/", "http://example.com/");
check_normal("http://example.com:443/", "http://example.com:443/");
check_normal("https://example.com:443/", "https://example.com/");
check_normal("https://example.com:80/", "https://example.com:80/");
printf("9110#4.2.3b: Provide default path\n");
check_normal("http://example.com", "http://example.com/");
check_normal("http://example.com/", "http://example.com/");
printf("9110#4.2.3: All 9110, combined\n");
check_normal("http://example.com:80/~smith/home.html", "http://example.com/~smith/home.html");
check_normal("http://EXAMPLE.com/%7Esmith/home.html", "http://example.com/~smith/home.html");
check_normal("http://EXAMPLE.com:/%7esmith/home.html", "http://example.com/~smith/home.html");
/*
* Additional, tricky: RFC 3986 never states that `//` should be
* normalized as `/`, which is seemingly implying that `/d//..` equals
* `/d/`, not `/` (as Unix would lead one to believe).
*/
printf("Extra\n");
check_normal("http://a.b.c/d//..", "http://a.b.c/d/");
} This is the output (each URI trio is meant to be a failed test):
|
(I compiled commit 113f2a5.) |
I'm having second thoughts about this entire PR. I'm not merging this cycle. |
The API documentation for curl_url_get() mentions normalizing when extracting a URL (only), so this code does not affect retrieving the individual components.
normalizing
It implies URL decoding + URL encoding every URL part, independently and then put the cleansed versions together into a full URL. This process can however not be an unconditional decode + encode for each part since generally URL encoding would encode characters that users might actually prefer even need to have as-is in the URL. An example being
=
in the query part, that is typically treated differently than%3D
. This makes normalizing opinionated and not strictly a spec thing.todo:
CURLU_NORMALIZE
Extend test 1560 to verify
Fixes #16829
Reported-by: Alberto Leiva Popper