urlapi: normalize retrieved URLs #16841

bagder · 2025-03-26T22:53:32Z

The API documentation for curl_url_get() mentions normalizing when extracting a URL (only), so this code does not affect retrieving the individual components.

normalizing

It implies URL decoding + URL encoding every URL part, independently and then put the cleansed versions together into a full URL. This process can however not be an unconditional decode + encode for each part since generally URL encoding would encode characters that users might actually prefer even need to have as-is in the URL. An example being = in the query part, that is typically treated differently than %3D. This makes normalizing opinionated and not strictly a spec thing.

todo:

Extend test 1560 to verify

Fixes #16829
Reported-by: Alberto Leiva Popper

bagder · 2025-03-27T21:50:11Z

On Mastodon I was provided an example bug that occurred in a different project due to URL normalization, which could serve as an indication that we need a way to opt out.

DemiMarie

As someone who (completely coincidentally) has been doing some related research into URL normalization (but on the server side):

Different implementations normalize paths differently. Notably:

NGINX percent-decodes before path-splitting or evaluating . and .. components.
Recent Apache httpd decodes %2e and %2E to . before normalizing, but does not decode %2F yet. The default is to 404 if %2F is present.
HAProxy does not treat %2F specially, but requires an experimental flag to normalize URLs at all.
Web browsers do not normalize %2F.

I wonder if the safest option is to fail the request by default if %2F is present, as it has server-dependent behavior. Decoding it is definitely not safe as it could change how the request is routed, and I believe that the Wikimedia Foundation uses %2F in paths in places. The worst offender is when where %2F decoding introduces new path components that are empty, ., or .., as Apache httpd and NGINX disagree on what to do here.

It would also be useful to document why some characters are kept unchanged

lib/urlapi.c

bagder · 2025-03-29T15:03:43Z

Thanks @DemiMarie for taking a look and your comments.

I actually did not intend for %2f in paths to get decoded in the normalization, since I would presume that a user would use that sequence to separate them from proper slashes for a reason. I think I should stick to that intent and tweak this PR accordingly.

curl actually does not decode either them nor %2e to do its "../" dedotdotification. (I filed #16869 for that, it is a bug.)

I opted to leave a whole bunch of letters non-percent encoded in the query part basically because they are letters that are commonly used unencoded in query parts and I thought doing it this way would leave them a little more readable and a little more like users would write them if they would do them manually. I agree that we should document the normalization rules once we have them set so that users can learn exactly what to expect from this.

The API documentation for curl_url_get() mentions normalizing when extracting a URL (only), so this code does not affect retrieving the individual components. Normalizes: user, password, options, (lowercase) hostnames, path, query and fragment Added CURLU_NO_NORMALIZE flag for retrieving the URL without normalization. Uses this option internally for redirect handling. Extend test 1560 to verify Fixes #16829 Reported-by: Alberto Leiva Popper Closes #16841

The API documentation for curl_url_get() mentions normalizing when extracting a URL (only), so this code does not affect retrieving the individual components. Normalizes: user, password, options, (lowercase) hostnames, path, query and fragment Added CURLU_NO_NORMALIZE flag for retrieving the URL without normalization. Uses this option internally for redirect handling. Extend test 1560 to verify Assisted-by: Demi Marie Obenour Fixes #16829 Reported-by: Alberto Leiva Popper Closes #16841

bagder · 2025-04-01T13:31:06Z

I think this normalization is a little too intrusive to switch on by default like this (partly based on how it caused problems even to us when using this API internally), in spite of the existing wording in the documentation.

As I suspect the normalization can still be useful for selected users, I intend to revert the work here and make the URL-normalization opt-in instead of opt-out.

No longer claim curl_url_get() does URL normalizing by default. With this option used, curl_url_get() normalizes the user, password, options, (lowercase) hostnames, path, query and fragment parts. Extend test 1560 to verify Assisted-by: Demi Marie Obenour Fixes #16829 (sort of) Reported-by: Alberto Leiva Popper Closes #16841

ydahhrk · 2025-04-08T23:18:26Z

This process can however not be an unconditional decode + encode for each part since generally URL encoding would encode characters that users might actually prefer even need to have as-is in the URL. An example being = in the query part, that is typically treated differently than %3D. This makes normalizing opinionated and not strictly a spec thing.

Can you elaborate? = is a reserved character. RFC 3986 says

If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.

Which I interpret as "if a normalizer finds a reserved character, or a percent-encoded reserved character, it must leave it alone. (Not decode nor encode it.)"

This is because the semantics of that character are defined by whoever generated the URI, not the normalizer.

Normalized here means using consistent percent
encoding and decoding for the result.

Pedantic note: If the host name lowercasing is part of the normalization (which appears to be implied by the next paragraph), it appears "normalized" means more than just persistent percent encoding.

In the query component, %3D, %2B and %26 are left encoded is already
present and =, + and & left left as-is when provided.

Seems to need more coffee :-)

"is already" should probably be "if already," "left left" should prolly be "are left." Even then, there's a bit of redundancy. Suggestion:

"In the query component, %3D, %2B and %26 are left encoded; =, + and & are left decoded."

The host name is lowercased. All percent encodings are done using uppercase.

Does not match my tests; see below.

Ok, so I realize this is not necessarily meant to be a perfect RFC 3986 normalizer, but I'd like to point out some quirks I found in my 3986 unit tests, just in case:

static char *
url_normalize(char const *url)
{
	CURLU *curlu;
	char *normal;
	CURLUcode err;

	curlu = curl_url();
	if (!curlu) {
		printf("Out of memory.\n");
		exit(ENOMEM);
	}

	err = curl_url_set(curlu, CURLUPART_URL, url,
	    CURLU_NON_SUPPORT_SCHEME | CURLU_NORMALIZE);
	if (err)
		goto einval;
	err = curl_url_get(curlu, CURLUPART_URL, &normal, 0);
	if (err)
		goto einval;

	curl_url_cleanup(curlu);

	return normal;

einval:	/* printf("Error parsing URL: %s\n", curl_url_strerror(err)); */
	curl_url_cleanup(curlu);
	return NULL;
}

static void
check_normal(char const *query, char const *expected)
{
	char *actual;
	bool equal;

	actual = url_normalize(query);
	if (expected == NULL)
		equal = (actual == NULL);
	else
		equal = (actual == NULL) ? false : !strcmp(expected, actual);
	if (!equal)
		if (actual != NULL)
			printf("Oops!\n"
			       "- query          : %s\n"
			       "  expected result: %s\n"
			       "  actual result  : %s\n",
			       query, expected, actual);

	curl_free(actual);
}

void
test_normalization(void)
{
	printf("Invalid URIs\n");
	check_normal("", NULL);
	check_normal("h", NULL);
	check_normal("http", NULL);
	check_normal("https", NULL);
	check_normal("https:", NULL);
	check_normal("https:/", NULL);
	check_normal("rsync://", NULL);
	check_normal("rsync://a.b.c/β", NULL);

	printf("3986#6.2.2.1: Uppercase percent-encoding\n");
	check_normal("https://a.b.c/%3a", "https://a.b.c/%3A");

	printf("3986#6.2.2.1, 9110#4.2.3c: Lowercase scheme and host\n");
	check_normal("http://a.b.c/d", "http://a.b.c/d");
	check_normal("abcde://a.b.c/d", "abcde://a.b.c/d");
	check_normal("HTTPS://a.b.c/d", "https://a.b.c/d");
	check_normal("rSyNc://a.b.c/d", "rsync://a.b.c/d");
	check_normal("HTTPS://A.B.C/d", "https://a.b.c/d");
	check_normal("HTTP://WWW.EXAMPLE.COM/aBc/dEf?gHi#jKl", "http://www.example.com/aBc/dEf?gHi#jKl");

	printf("3986#6.2.2.2, 9110#4.2.3d: Decode unreserved characters\n");
	check_normal("http://%61%7A.%41%5A.%30%39/%61%7A%41%5A%30%39", "http://az.AZ.09/azAZ09");
	check_normal("http://%2D%2E%5F%7E/%2D%2E%5F%7E", "http://-._~/-._~");

	printf("3986#6.2.2.3: Path segment normalization\n");
	check_normal("rsync://a.b.c", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/d", "rsync://a.b.c/d");
	check_normal("rsync://a.b.c//////", "rsync://a.b.c//////");
	check_normal("rsync://a.b.c/d/e", "rsync://a.b.c/d/e");
	check_normal("rsync://a.b.c/d/e/.", "rsync://a.b.c/d/e/");
	check_normal("rsync://a.b.c/d/e/.", "rsync://a.b.c/d/e/");
	check_normal("rsync://a.b.c/././d/././e/./.", "rsync://a.b.c/d/e/");
	check_normal("rsync://a.b.c/d/..", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/x/../x/y/z", "rsync://a.b.c/x/y/z");
	check_normal("rsync://a.b.c/d/../d/../d/e/", "rsync://a.b.c/d/e/");
	check_normal("rsync://x//y/z/../../m/./n/o", "rsync://x//m/n/o");
	check_normal("rsync://.", "rsync://./");
	check_normal("https://./.", "https://./");
	check_normal("https://./d", "https://./d");
	check_normal("rsync://..", "rsync://../");
	check_normal("rsync://../..", "rsync://../");
	check_normal("rsync://../d", "rsync://../d");
	check_normal("rsync://a.b.c/..", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/../..", "rsync://a.b.c/");
	check_normal("rsync://a.b.c/../x", "rsync://a.b.c/x");
	check_normal("rsync://a.b.c/../x/y/z", "rsync://a.b.c/x/y/z");
	check_normal("rsync://a.b.c/d/e/../../..", "rsync://a.b.c/");

	printf("3986#6.2.2: All the above, combined\n");
	check_normal("example://a/b/c/%7Bfoo%7D", "example://a/b/c/%7Bfoo%7D");
	check_normal("eXAMPLE://a/./b/../b/%63/%7bfoo%7d", "example://a/b/c/%7Bfoo%7D");

	printf("3986#6.2.3: Scheme-based normalization\n");
	check_normal("http://example.com/?", "http://example.com/?");
	check_normal("http://example.com/#", "http://example.com/#");

	printf("9110#4.2.3a: Default port\n");
	check_normal("http://example.com:/", "http://example.com/");
	check_normal("http://example.com:80/", "http://example.com/");
	check_normal("http://example.com:443/", "http://example.com:443/");
	check_normal("https://example.com:443/", "https://example.com/");
	check_normal("https://example.com:80/", "https://example.com:80/");

	printf("9110#4.2.3b: Provide default path\n");
	check_normal("http://example.com", "http://example.com/");
	check_normal("http://example.com/", "http://example.com/");

	printf("9110#4.2.3: All 9110, combined\n");
	check_normal("http://example.com:80/~smith/home.html", "http://example.com/~smith/home.html");
	check_normal("http://EXAMPLE.com/%7Esmith/home.html", "http://example.com/~smith/home.html");
	check_normal("http://EXAMPLE.com:/%7esmith/home.html", "http://example.com/~smith/home.html");

	/*
	 * Additional, tricky: RFC 3986 never states that `//` should be
	 * normalized as `/`, which is seemingly implying that `/d//..` equals
	 * `/d/`, not `/` (as Unix would lead one to believe).
	 */
	printf("Extra\n");
	check_normal("http://a.b.c/d//..", "http://a.b.c/d/");
}

This is the output (each URI trio is meant to be a failed test):

Invalid URIs
Oops!
- query          : rsync://a.b.c/β
  expected result: (nil)
  actual result  : rsync://a.b.c/β
3986#6.2.2.1: Uppercase percent-encoding
Oops!
- query          : https://a.b.c/%3a
  expected result: https://a.b.c/%3A
  actual result  : https://a.b.c/%3a
3986#6.2.2.1, 9110#4.2.3c: Lowercase scheme and host
Oops!
- query          : HTTPS://A.B.C/d
  expected result: https://a.b.c/d
  actual result  : https://A.B.C/d
Oops!
- query          : HTTP://WWW.EXAMPLE.COM/aBc/dEf?gHi#jKl
  expected result: http://www.example.com/aBc/dEf?gHi#jKl
  actual result  : http://WWW.EXAMPLE.COM/aBc/dEf?gHi#jKl
3986#6.2.2.2, 9110#4.2.3d: Decode unreserved characters
Oops!
- query          : http://%61%7A.%41%5A.%30%39/%61%7A%41%5A%30%39
  expected result: http://az.AZ.09/azAZ09
  actual result  : http://az.AZ.09/%61%7A%41%5A%30%39
Oops!
- query          : http://%2D%2E%5F%7E/%2D%2E%5F%7E
  expected result: http://-._~/-._~
  actual result  : http://-._~/%2D%2E%5F%7E
3986#6.2.2.3: Path segment normalization
3986#6.2.2: All the above, combined
Oops!
- query          : eXAMPLE://a/./b/../b/%63/%7bfoo%7d
  expected result: example://a/b/c/%7Bfoo%7D
  actual result  : example://a/b/%63/%7bfoo%7d
3986#6.2.3: Scheme-based normalization
Oops!
- query          : http://example.com/?
  expected result: http://example.com/?
  actual result  : http://example.com/
Oops!
- query          : http://example.com/#
  expected result: http://example.com/#
  actual result  : http://example.com/
9110#4.2.3a: Default port
Oops!
- query          : http://example.com:80/
  expected result: http://example.com/
  actual result  : http://example.com:80/
Oops!
- query          : https://example.com:443/
  expected result: https://example.com/
  actual result  : https://example.com:443/
9110#4.2.3b: Provide default path
9110#4.2.3: All 9110, combined
Oops!
- query          : http://example.com:80/~smith/home.html
  expected result: http://example.com/~smith/home.html
  actual result  : http://example.com:80/~smith/home.html
Oops!
- query          : http://EXAMPLE.com/%7Esmith/home.html
  expected result: http://example.com/~smith/home.html
  actual result  : http://EXAMPLE.com/%7Esmith/home.html
Oops!
- query          : http://EXAMPLE.com:/%7esmith/home.html
  expected result: http://example.com/~smith/home.html
  actual result  : http://EXAMPLE.com/%7esmith/home.html
Extra

ydahhrk · 2025-04-08T23:24:33Z

(I compiled commit 113f2a5.)

bagder · 2025-04-28T06:07:51Z

I'm having second thoughts about this entire PR. I'm not merging this cycle.

bagder added the URL label Mar 26, 2025

github-actions bot added the tests label Mar 26, 2025

This comment was marked as outdated.

Sign in to view

bagder force-pushed the bagder/normalize-urls branch 2 times, most recently from d5debbb to 23a3843 Compare March 28, 2025 11:37

github-actions bot added the libcurl API label Mar 28, 2025

bagder added feature-window A merge of this requires an open feature window and removed libcurl API labels Mar 28, 2025

github-actions bot added the libcurl API label Mar 28, 2025

DemiMarie reviewed Mar 29, 2025

View reviewed changes

lib/urlapi.c Outdated Show resolved Hide resolved

lib/urlapi.c Outdated Show resolved Hide resolved

lib/urlapi.c Show resolved Hide resolved

bagder force-pushed the bagder/normalize-urls branch from f875245 to 7323ebe Compare March 29, 2025 16:37

bagder force-pushed the bagder/normalize-urls branch from beb5a15 to 617accf Compare March 30, 2025 12:33

florianernst mentioned this pull request Mar 30, 2025

Possible regression in handling URI containing carriage return #16874

Closed

bagder force-pushed the bagder/normalize-urls branch from 617accf to 6c3526b Compare March 30, 2025 21:52

bagder force-pushed the bagder/normalize-urls branch from 722461d to 56fbbd4 Compare April 1, 2025 14:52

bagder force-pushed the bagder/normalize-urls branch from 56fbbd4 to 113f2a5 Compare April 7, 2025 07:21

bagder mentioned this pull request Apr 7, 2025

libcurl does not normalize percent-encoding in URL paths #16829

Open

Karlson2k mentioned this pull request Apr 11, 2025

Modern libcurl versions encodes symbols in URI when CURLU_URLENCODE in NOT used #17029

Closed

bagder added the on-hold label Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urlapi: normalize retrieved URLs #16841

urlapi: normalize retrieved URLs #16841

This comment was marked as outdated.

urlapi: normalize retrieved URLs #16841

Are you sure you want to change the base?

urlapi: normalize retrieved URLs #16841

Conversation

normalizing

todo:

This comment was marked as outdated.

Choose a reason for hiding this comment