Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZnUrl replaces percent-encoded octets for reserved characters #97

Open
Rinzwind opened this issue Jun 27, 2022 · 3 comments
Open

ZnUrl replaces percent-encoded octets for reserved characters #97

Rinzwind opened this issue Jun 27, 2022 · 3 comments

Comments

@Rinzwind
Copy link

Rinzwind commented Jun 27, 2022

ZnUrl seems to go against RFC 3986, in that it replaces percent-encoded octets for some reserved characters by those characters. Take the following block:

[ :url | (ZnUrl fromString: url) asString ]

Examples of how this block transforms URLs:

  • https://example.com/?a=b%3Dchttps://example.com/?a=b%3Dc
    The two URLs are exactly the same.

  • https://example.com/?a~b%7Echttps://example.com/?a~b~c
    The two URLs differ (%7E versus ~), but per section ‘2.3. Unreserved Characters’ in RFC 3986 they are equivalent: “URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent”.

The problem is in the third example:

  • https://example.com/?a;b%3Bchttps://example.com/?a;b;c
    The two URLs differ (%3B versus ;), and per section ‘2.2 Reserved Characters’ in RFC 3986, they are not equivalent: “URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent”.

Note that the equals sign, used in the first example, is also a reserved character and is used as a delimiter in the URL-encoding of forms in HTML. As far as I understand, the intent of section 2.2 in RFC 3986 is that one could define a similar encoding that uses other reserved characters as delimiters: the queries of the URLs in the third example could be encodings of arrays of strings, in which the array #('a' 'b;c') is encoded as a;b%3Bc and the array#('a' 'b' 'c') as a;b;c.

Section ‘4.2.3. http(s) Normalization and Comparison’ in RFC 9110 states the following, for which it refers back to RFC 3986: “characters other than those in the "reserved" set are equivalent to their percent-encoded octets”.

See my comment in issue #89 for how this is related to that issue.

@Rinzwind
Copy link
Author

Additional example:

  • https://example.com/?a+b%2Bchttps://example.com/?a%20b%2Bc
    The two URLs differ (+ versus %20) and are not equivalent as the plus sign is a reserved character. Note that the plus sign is replaced by %20 (space in ASCII) rather than %2B (plus sign in ASCII). Plus signs are used to encode spaces in the URL-encoding of HTML forms, but the query of a URL is not necessarily an encoded HTML-form.

@svenvc
Copy link
Owner

svenvc commented Jun 27, 2022

Hi Kris,

Thanks a lot for your input, you are certainly on to something.

But basically, you are saying that parsing/printing is not symmetrical, right ?

The question remains what are we going to do, and how are we going to implement it ?

I believe there might be room to improve on the current situation, but I am not yet seeing it clearly.

Sven

@Rinzwind
Copy link
Author

I’m not sure either. The easiest aspect of ZnUrl to look at first w.r.t. this issue is likely the #fragment: method though. Examples using x := ZnUrl fromString: 'https://example.com/' as a starting point:

  • x copy fragment: 'a;b'; asString'https://example.com/#a;b'
  • x copy fragment: 'a%3Bb'; asString'https://example.com/#a%253Bb'
  • x copy fragment: 'a^b'; asString'https://example.com/#a%5Eb'

The problem here is that it’s not possible to get 'https://example.com/#a%3Bb' (which, due to the semicolon being a reserved character, is not equivalent to 'https://example.com/#a;b'). A method #basicFragment: could allow that:

  • x copy basicFragment: 'a;b'; asString'https://example.com/#a;b'
  • x copy basicFragment: 'a%3Bb'; asString'https://example.com/#a%3Bb'
  • x copy basicFragment: 'a^b' ⇒ an error is signaled (as a caret cannot occur in the fragment per the ABNF in RFC 3986)

One possible question regarding #basicFragment: is whether it should make a distinction between these two examples or not:

  • x copy basicFragment: 'a~b'; asString
  • x copy basicFragment: 'a%7Eb'; asString

As ~ is unreserved, 'https://example.com/#a~b' and 'https://example.com/#a%7Eb' are equivalent. But there might be cases in which one wishes to distinguish between URLs that are equivalent but not equal (for example, to deal with a server which incorrectly does not treat them as equivalent).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants