Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#pathSegments returns segments that are either Character or ByteString #63

Open
chisandrei opened this issue Jun 1, 2021 · 8 comments

Comments

@chisandrei
Copy link
Contributor

In the current version of Pharo 9 on MacOS (Pharo9.0.0
Build information: Pharo-9.0.0+build.1444.sha.a22a97ca8a50d8c0d470c3d1dde29af0e49a85f3 (64 Bit)), when pathSegments is called on a url with an empty path a ByteSymbol is returned.

znUrl := 'https://github.com/feenkcom/gtoolkit/issues/' asZnUrl.
znUrl pathSegments fourth

Screenshot 2021-06-01 at 17 49 14

When updating to the latest version of Zinc from this repository, this behaviour changes. A character is returned instead.
Screenshot 2021-06-01 at 17 49 54

We are not sure if this change was on purpose?

Also if there is a path segment then we get a ByteString:
Screenshot 2021-06-01 at 17 54 08

So one cannot assume pathSegments returns a collection of Strings.

@svenvc
Copy link
Owner

svenvc commented Jun 1, 2021

It was intentional

d962746#diff-47063f74b25c99af0ddf4f0840769d07aca3728dabb14c78d1f741f685d75235

but I can't immediately remember why we did this, maybe @noha can ...

@noha
Copy link
Contributor

noha commented Jun 1, 2021

Yes, I'm a mail archiver. The point was that / is valid as trailing slash and that had the problem that

'http://foo.bar.com/p/' asZnUrl / #otherSegment

gave

http://foo.bar.com/p//otherSegment

(not the double slash). You can see this up to pharo7. Now when you do it you get only a single slash how it should be

georgeganea pushed a commit to feenkcom/gtoolkit that referenced this issue Jun 1, 2021
Metacello new
    baseline: 'GToolkit';
    repository: 'github://feenkcom/gtoolkit:v0.8.679/src';
    load

All commits (including upstream repositories) since last build:
feenkcom/gtoolkit-releaser@939849 by Andrei Chis
fix merge conflict

feenkcom/gtoolkit-releaser@bd3405 by Andrei Chis
Merge c4ca89f1bbd68cc90a15803f177386cccd5986d4

feenkcom/gtoolkit-releaser@8c8f4d by Andrei Chis
Add Zinc and Zodiac with a release strategy of type GtRlBaselineTagReleaseStrategy

feenkcom/Pharo-Chrome@a57c7e by Andrei Chis
Do not load explicitly the WebSocket package

feenkcom/Pharo-Chrome@f67ab5 by Andrei Chis
Merge f2e52c468bed3e4e363cd6d94efad9cfceeeeb3e

feenkcom/Pharo-Chrome@0c07ad by Andrei Chis
Depend on ZincHTTPComponentsWebSocket

feenkcom/lepiter@a668cd by Andrei Chis
Merge 92fcf1d6b0c53948360273cdb1ce1aedb4e32f6f

feenkcom/lepiter@e6aea1 by Andrei Chis
Explicitly convert a path segment to a string to avoid the issue in [svenvc/zinc#63]

feenkcom/lepiter@92fcf1 by Alistair Grant
LeDatabase: construct attachment paths in a platform independent way

Issue: feenkcom/lepiter#1

6da451 by Andrei Chiș
Enable local loading using GtRlProjectsLoader
@svenvc
Copy link
Owner

svenvc commented Jun 1, 2021

Also, Andrei, a URL that ends with a slash is fundamentally different from one that does not. See ZnUrl>>#isDirectoryPath

Maybe #pathSegments exposes a bit too much about the internal implementation. From this perspective, #path might be a better accessor.

What is it that you want to do ?

@chisandrei
Copy link
Contributor Author

thanks @svenvc and @noha for the details. I though there was a reason but I did not find that commit.

The code that failed in our case was something like below. On a url that has four segments we were checking if the fourth segment was a number, to validate a url like https://github.com/user/repo/issues/1234 and invalidate https://github.com/user/repo/issues/.

(NumberParser on: anUrl pathSegments fourth) 
		failBlock: [ isValid := false. ^ self ];
		nextNumber

Now I'll check explicitly for / since I see it can be a valid path segment.

@syrel
Copy link

syrel commented Jun 1, 2021

Hello @svenvc and @noha.
I was just randomly reading a URI RFC and accidentally stumbled across a path segment grammar.
https://datatracker.ietf.org/doc/html/rfc3986#section-3.3

According to the grammar, a segment can not contain slash / but can be an empty string.

path-absolute = "/" [ segment-nz *( "/" segment ) ]
...
segment       = *pchar

According to RFC, the url from your example should be parsed so that it contains two path segments: p and a trailing empty segment.

Here are test cases that exemplify this behavior:
https://github.com/uriparser/uriparser/blob/1762d5ff025fb07b4b8ccd1a8a9635009b2e9e34/test/test.cpp#L112

Have you considered implementing a separate RFC conform URL parser (with AST nodes), having that there is a grammar available?

### References
ABNF for URI:
https://datatracker.ietf.org/doc/html/rfc3986#appendix-A

@svenvc
Copy link
Owner

svenvc commented Jun 2, 2021

Hi Andrei,

I would like to keep this issue open.

I am not saying that I want to change the current implementation, but you do have a point. I reread the RFC you mentioned.

It strikes me that maybe we could replace the current $/ marker by an empty segment: this would keep the current design, but might result in a less surprising output of the path segments accessor. But I have to think about this more and hopefully I find the time to test it out.

Thanks for the feedback.

Sven

PS: Yes, a parser based on the official RFC ABNF might also be a good idea, but that you be a separate project first.

@chisandrei
Copy link
Contributor Author

Hi Sven. I opened the issue again. Thanks for looking into this!

@chisandrei chisandrei reopened this Jun 2, 2021
@Rinzwind
Copy link

Rinzwind commented May 5, 2023

About the idea of “a parser based on the official RFC ABNF”: the PetitParser grammar I gave in issue 100 might serve as a starting point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants