10000 urlapi: normalize retrieved URLs by bagder · Pull Request #16841 · curl/curl · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

urlapi: normalize retrieved URLs #16841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

bagder
Copy link
Member
@bagder bagder commented Mar 26, 2025

The API documentation for curl_url_get() mentions normalizing when extracting a URL (only), so this code does not affect retrieving the individual components.

normalizing

It implies URL decoding + URL encoding every URL part, independently and then put the cleansed versions together into a full URL. This process can however not be an unconditional decode + encode for each part since generally URL encoding would encode characters that users might actually prefer even need to have as-is in the URL. An example being = in the query part, that is typically treated differently than %3D. This makes normalizing opinionated and not strictly a spec thing.

todo:

  • normalize (lowercase) hostnames
  • normalize path
  • normalize query
  • normalize fragment
  • normalize user
  • normalize password
  • normalize option
  • add option to select normalizing: CURLU_NORMALIZE
  • document new option
  • document exactly how the normalization is done and works

Extend test 1560 to verify

Fixes #16829
Reported-by: Alberto Leiva Popper

@bagder bagder added the URL label Mar 26, 2025
@github-actions github-actions bot added the tests label Mar 26, 2025
@testclutch

This comment was marked as outdated.

@bagder
Copy link
Member Author
bagder commented Mar 27, 2025

On Mastodon I was provided an example bug that occurred in a different project due to URL normalization, which could serve as an indication that we need a way to opt out.

@bagder bagder force-pushed the bagder/normalize-urls branch 2 times, most recently from d5debbb to 23a3843 Compare March 28, 2025 11:37
@bagder bagder added feature-window A merge of this requires an open feature window and removed libcurl API labels Mar 28, 2025
8000 Copy link
@DemiMarie DemiMarie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who (completely coincidentally) has been doing some related research into URL normalization (but on the server side):

Different implementations normalize paths differently. Notably:

  • NGINX percent-decodes before path-splitting or evaluating . and .. components.
  • Recent Apache httpd decodes %2e and %2E to . before normalizing, but does not decode %2F yet. The default is to 404 if %2F is present.
  • HAProxy does not treat %2F specially, but requires an experimental flag to normalize URLs at all.
  • Web browsers do not normalize %2F.

I wonder if the safest option is to fail the request by default if %2F is present, as it has server-dependent behavior. Decoding it is definitely not safe as it could change how the request is routed, and I believe that the Wikimedia Foundation uses %2F in paths in places. The worst offender is when where %2F decoding introduces new path components that are empty, ., or .., as Apache httpd and NGINX disagree on what to do here.

It would also be useful to document why some characters are kept unchanged

@bagder
Copy link
Member Author
bagder commented Mar 29, 2025

Thanks @DemiMarie for taking a look and your comments.

I actually did not intend for %2f in paths to get decoded in the normalization, since I would presume that a user would use that sequence to separate them from proper slashes for a reason. I think I should stick to that intent and tweak this PR accordingly.

curl actually does not decode either them nor %2e to do its "../" dedotdotification. (I filed #16869 for that, it is a bug.)

I opted to leave a whole bunch of letters non-percent encoded in the query part basically because they are letters that are commonly used unencoded in query parts and I thought doing it this way would leave them a little more readable and a little more like users would write them if they would do them manually. I agree that we should document the normalization rules once we have them set so that users can learn exactly what to expect from this.

@bagder bagder force-pushed the bagder/normalize-urls branch from f875245 to 7323ebe Compare March 29, 2025 16:37
bagder added a commit that referenced this pull request Mar 30, 2025
The API documentation for curl_url_get() mentions normalizing when
extracting a URL (only), so this code does not affect retrieving the
individual components.

Normalizes: user, password, options, (lowercase) hostnames, path, query
and fragment

Added CURLU_NO_NORMALIZE flag for retrieving the URL without
normalization.

Uses this option internally for redirect handling.

Extend test 1560 to verify

Fixes #16829
Reported-by: Alberto Leiva Popper
Closes #16841
@bagder bagder force-pushed the bagder/normalize-urls branch from beb5a15 to 617accf Compare March 30, 2025 12:33
bagder added a commit that referenced this pull request Mar 30, 2025
The API documentation for curl_url_get() mentions normalizing when
extracting a URL (only), so this code does not affect retrieving the
individual components.

Normalizes: user, password, options, (lowercase) hostnames, path, query
and fragment

Added CURLU_NO_NORMALIZE flag for retrieving the URL without
normalization.

Uses this option internally for redirect handling.

Extend test 1560 to verify

Assisted-by: Demi Marie Obenour

Fixes #16829
Reported-by: Alberto Leiva Popper
Closes #16841
@bagder bagder force-pushed the bagder/normalize-urls branch from 617accf to 6c3526b Compare March 30, 2025 21:52
@bagder
Copy link
Member Author
bagder 8000 commented Apr 1, 2025

I think this normalization is a little too intrusive to switch on by default like this (partly based on how it caused problems even to us when using this API internally), in spite of the existing wording in the documentation.

As I suspect the normalization can still be useful for selected users, I intend to revert the work here and make the URL-normalization opt-in instead of opt-out.

bagder added a commit that referenced this pull request Apr 1, 2025
No longer claim curl_url_get() does URL normalizing by default.

With this option used, curl_url_get() normalizes the user, password, options,
(lowercase) hostnames, path, query and fragment parts.

Extend test 1560 to verify

Assisted-by: Demi Marie Obenour

Fixes #16829 (sort of)
Reported-by: Alberto Leiva Popper
Closes #16841
@bagder bagder force-pushed the bagder/normalize-urls branch from 722461d to 56fbbd4 Compare April 1, 2025 14:52
No longer claim curl_url_get() does URL normalizing by default.

With this option used, curl_url_get() normalizes the user, password, options,
(lowercase) hostnames, path, query and fragment parts.

Extend test 1560 to verify

Assisted-by: Demi Marie Obenour

Fixes #16829 (sort of)
Reported-by: Alberto Leiva Popper
Closes #16841
@ydahhrk
Copy link
ydahhrk commented Apr 8, 2025

This process can however not be an unconditional decode + encode for each part since generally URL encoding would encode characters that users might actually prefer even need to have as-is in the URL. An example being = in the query part, that is typically treated differently than %3D. This makes normalizing opinionated and not strictly a spec thing.

Can you elaborate? = is a reserved character. RFC 3986 says

If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.

Which I interpret as "if a normalizer finds a reserved character, or a percent-encoded reserved character, it must leave it alone. (Not decode nor encode it.)"

This is because the semantics of that character are defined by whoever generated the URI, not the normalizer.

Normalized here means using consistent percent
encoding and decoding for the result.

Pedantic note: If the host name lowercasing is part of the normalization (which appears to be implied by the next paragraph), it appears "normalized" means more than just persistent percent encoding.

In the query component, %3D, %2B and %26 are left encoded is already
present and =, + and & left left as-is when provided.

Seems to need more coffee :-)

"is already" should probably be "if already," "left left" should prolly be "are left." Even then, there's a bit of redundancy. Suggestion:

"In the query component, %3D, %2B and %26 are left encoded; =, + and & are left decoded."

The host name is lowercased. All percent encodings are done using uppercase.

Does not match my tests; see below.


Ok, so I realize this is not necessarily meant to be a perfect RFC 3986 normalizer, but I'd like to point out some quirks I found in my 3986 unit tests, just in case:

static char *
url_normalize(char const *url)
{
	CURLU *curlu;
	char *normal;
	CURLUcode err;

	curlu = curl_url();
	if (!curlu) {
		printf("Out of memory.\n");
		exit(ENOMEM);
	}

	err = curl_url_set(curlu, CURLUPART_URL, url,
	    CURLU_NON_SUPPORT_SCHEME | CURLU_NORMALIZE);
	if (err)
		goto einval;
	err = curl_url_get(curlu, CURLUPART_URL, &normal, 0);
	if (err)
		goto einval;

	curl_url_cleanup(curlu);

	return normal;

einval:	/* printf("Error parsing URL: %s\n", curl_url_strerror(err)); */
	curl_url_cleanup(curlu);
	return NULL;
}

static void
check_normal(char const *query, char const *expected)
{
	char *actual;
	bool equal;

	actual = url_normalize(query);
	if (expected == NULL)
		equal = (actual == NULL);
	else
		equal = (actual == NULL) ? false : !strcmp(expected, actual);
	if (!equal)
		if (actual != NULL)
			printf("Oops!\n"
			       "- query          : %s\n"
			       "  expected result: %s\n"
			       "  actual result  : %s\n",
			       query, expected, actual);

	curl_free(actual);
}

void
test_normalization(void)
{
	printf("Invalid URIs\n");
	check_normal("", NULL);
	check_normal("h", NULL);
	check_normal("http", NULL);
	check_normal("https", NULL);
	check_normal("https:", NULL);
	check_normal("https:/", NULL);
	check_normal("rsync://", NULL);
	check_normal("rsync://a.b.c/β", NULL);

	printf("3986#6.2.2.1: Uppercase percent-encoding\n");
	check_normal("https://a.b.c/%3a", "https://a.b.c/%3A");

	printf("3986#6.2.2.1, 9110#4.2.3c: Lowercase scheme and host\n");
	check_normal("http://a.b.c/d", "http://a.b.c/d");
	check_normal("abcde://a.b.c/d", "abcde://a.b.c/d");
	check_normal("HTTPS://a.b.c/d", "https://a.b.c/d");
	check_normal("rSyNc://a.b.c/d", "rsync://a.b.c/d");
	check_normal("HTTPS://A.B.C/d", "https://a.b.c/d");
	check_normal("HTTP://WWW.EXAMPLE.COM/aBc/dEf?gHi#jKl", "http://www.example.com/aBc/dEf?gHi#jKl");

	printf("3986#6.2.2.2, 9110#4.2.3d: Decode unreserved characters\n");
	check_normal("http://%61%7A.%41%5A.%30%39/%61%7A%41%5A%30%39", "http://az.AZ.09/azAZ09");
	check_normal("http://%2D%2E%5F%7E/%2D%2E%5F%7E", "http://-._~/-._~");

	printf("3986#6.2.2.3: Path segment normalization\n");
	check_normal("rsync://a.b.c", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/d", "rsync://a.b.c/d");
	check_normal("rsync://a.b.c//////", "rsync://a.b.c//////");
	check_normal("rsync://a.b.c/d/e", "rsync://a.b.c/d/e");
	check_normal("rsync://a.b.c/d/e/.", "rsync://a.b.c/d/e/");
	check_normal("rsync://a.b.c/d/e/.", "rsync://a.b.c/d/e/");
	check_normal("rsync://a.b.c/././d/././e/./.", "rsync://a.b.c/d/e/");
	check_normal("rsync://a.b.c/d/..", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/x/../x/y/z", "rsync://a.b.c/x/y/z");
	check_normal("rsync://a.b.c/d/../d/../d/e/", "rsync://a.b.c/d/e/");
	check_normal("rsync://x//y/z/../../m/./n/o", "rsync://x//m/n/o");
	check_normal("rsync://.", "rsync://./");
	check_normal("https://./.", "https://./");
	check_normal("https://./d", "https://./d");
	check_normal("rsync://..", "rsync://../");
	check_normal("rsync://../..", "rsync://../");
	check_normal("rsync://../d", "rsync://../d");
	check_normal("rsync://a.b.c/..", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/../..", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/../x", "rsync://a.b.c/x");
	check_normal("rsync://a.b.c/../x/y/z", "rsync://a.b.c/x/y/z");
	check_normal("rsync://a.b.c/d/e/../../..", "rsync://a.b.c/");

	printf("3986#6.2.2: All the above, combined\n");
	check_normal("example://a/b/c/%7Bfoo%7D", "example://a/b/c/%7Bfoo%7D");
	check_normal("eXAMPLE://a/./b/../b/%63/%7bfoo%7d", "example://a/b/c/%7Bfoo%7D");

	printf("3986#6.2.3: Scheme-based normalization\n");
	check_normal("http://example.com/?", "http://example.com/?");
	check_normal("http://example.com/#", "http://example.com/#");

	printf("9110#4.2.3a: Default port\n");
	check_normal("http://example.com:/", "http://example.com/");
	check_normal("http://example.com:80/", "http://example.com/");
	check_normal("http://example.com:443/", "http://example.com:443/");
	check_normal("https://example.com:443/", "https://example.com/");
	check_normal("https://example.com:80/", "https://example.com:80/");

	printf("9110#4.2.3b: Provide default path\n");
	check_normal("http://example.com", "http://example.com/");
	check_normal("http://example.com/", "http://example.com/");

	printf("9110#4.2.3: All 9110, combined\n");
	check_normal("http://example.com:80/~smith/home.html", "http://example.com/~smith/home.html");
	check_normal("http://EXAMPLE.com/%7Esmith/home.html", "http://example.com/~smith/home.html");
	check_normal("http://EXAMPLE.com:/%7esmith/home.html", "http://example.com/~smith/home.html");

	/*
	 * Additional, tricky: RFC 3986 never states that `//` should be
	 * normalized as `/`, which is seemingly implying that `/d//..` equals
	 * `/d/`, not `/` (as Unix would lead one to believe).
	 */
	printf("Extra\n");
	check_normal("http://a.b.c/d//..", "http://a.b.c/d/");
}

This is the output (each URI trio is meant to be a failed test):

Invalid URIs
Oops!
- query          : rsync://a.b.c/β
  expected result: (nil)
  actual result  : rsync://a.b.c/β
3986#6.2.2.1: Uppercase percent-encoding
Oops!
- query          : https://a.b.c/%3a
  expected result: https://a.b.c/%3A
  actual result  : https://a.b.c/%3a
3986#6.2.2.1, 9110#4.2.3c: Lowercase scheme and host
Oops!
- query          : HTTPS://A.B.C/d
  expected result: https://a.b.c/d
  actual result  : https://A.B.C/d
Oops!
- query          : HTTP://WWW.EXAMPLE.COM/aBc/dEf?gHi#jKl
  expected result: http://www.example.com/aBc/dEf?gHi#jKl
  actual result  : http://WWW.EXAMPLE.COM/aBc/dEf?gHi#jKl
3986#6.2.2.2, 9110#4.2.3d: Decode unreserved characters
Oops!
- query          : http://%61%7A.%41%5A.%30%39/%61%7A%41%5A%30%39
  expected result: http://az.AZ.09/azAZ09
  actual result  : http://az.AZ.09/%61%7A%41%5A%30%39
Oops!
- query          : http://%2D%2E%5F%7E/%2D%2E%5F%7E
  expected result: http://-._~/-._~
  actual result  : http://-._~/%2D%2E%5F%7E
3986#6.2.2.3: Path segment normalization
3986#6.2.2: All the above, combined
Oops!
- query          : eXAMPLE://a/./b/../b/%63/%7bfoo%7d
  expected result: example://a/b/c/%7Bfoo%7D
  actual result  : example://a/b/%63/%7bfoo%7d
3986#6.2.3: Scheme-based normalization
Oops!
- query          : http://example.com/?
  expected result: http://example.com/?
  actual result  : http://example.com/
Oops!
- query          : http://example.com/#
  expected result: http://example.com/#
  actual result  : http://example.com/
9110#4.2.3a: Default port
Oops!
- query          : http://example.com:80/
  expected result: http://example.com/
  actual result  : http://example.com:80/
Oops!
- query          : https://example.com:443/
  expected result: https://example.com/
  actual result  : https://example.com:443/
9110#4.2.3b: Provide default path
9110#4.2.3: All 9110, combined
Oops!
- query          : http://example.com:80/~smith/home.html
  expected result: http://example.com/~smith/home.html
  actual result  : http://example.com:80/~smith/home.html
Oops!
- query          : http://EXAMPLE.com/%7Esmith/home.html
  expected result: http://example.com/~smith/home.html
  actual result  : http://EXAMPLE.com/%7Esmith/home.html
Oops!
- query          : http://EXAMPLE.com:/%7esmith/home.html
  expected result: http://example.com/~smith/home.html
  actual result  : http://EXAMPLE.com/%7esmith/home.html
Extra

@ydahhrk
Copy link
ydahhrk commented Apr 8, 2025

(I compiled commit 113f2a5.)

@bagder
Copy link
Member Author
bagder commented Apr 28, 2025

I'm having second thoughts about this entire PR. I'm not merging this cycle.

@bagder bagder added the on-hold label Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-window A merge of this requires an open feature window libcurl API on-hold tests URL
Development

Successfully merging this pull request may close these issues.

libcurl does not normalize percent-encoding in URL paths
4 participants
0