-
Notifications
You must be signed in to change notification settings - Fork 179
Update "Character encoding" and related provisions #438 #461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reference: #438 Signed-off-by: John M. Horan <[email protected]>
@pombredanne @mprpic @jkowalleck @ppkarwasz @matt-phylum (and the rest of the PURL community ;-) This is an update to the encoding clarification made via PR #439. I've tried to provide a few additional clear, concise statements addressing the various related issues/PRs as well as recent comments. Please do not hesitate to correct, clarify and question as needed -- and please provide alternative language proposals whenever you can. |
PURL-SPECIFICATION.rst
Outdated
(https://datatracker.ietf.org/doc/html/rfc3986#section-2). In the event of any | ||
conflict between this specification and RFC 3986 section 2, this specification | ||
governs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going back to how it was before? The way I understood the current version was that it was a breaking change to intentionally break from RFC3986 section 2 because the RFC3986 encoding rules are more complicated to implement and most PURL implementations did not try to implement them, instead applying mostly the same encoding rules to all components. I've already had somebody ask why phylum-dev/purl doesn't encode plus signs in version numbers, which isn't required by RFC 3968 and is non-canonical according to the old PURL but is required by the rules in the current version of PURL, or at least we both thought so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-phylum can you elaborate? I am not sure I understand fully your point. I thing that what we are trying to convey is that:
- this spec defines WHICH characters to encode and WHERE/WHEN (e.g., with specifics for separators and components)
- we defer to RFC3986 to define HOW to encode characters we want encoded.
Would this be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By saying "In the event of any conflict between this specification and RFC 3986 section 2" it sounds to me like the implementer is supposed to combine RFC 3986 and PURL rules, eg merging RFC 3986 pchar with the PURL character rules when outputting a package name. If the PURL spec is clearly specifying WHICH characters WHERE/WHEN and RFC 3986 is specifying HOW then it's much easier to implement and there shouldn't be conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-phylum That's my goal -- each component should specify clearly what characters are permitted or prohibited as well as what needs to be percent-encoded and when. (scheme
, type
and qualifiers
already do exactly that, and namespace
, name
, version
and subpath
should do the same to eliminate the possibility of ambiguity.)
With respect to RFC 3986 and the HOW, I'm adopting your earlier suggestion that the RFC 3986 reference be changed from section 2 to section 2.1, which addresses the mechanics of percent-encoding, i.e., the HOW. Please take a look once I push an update and let me know if more fine-tuning is needed and if so I'll take care of it.
PURL-SPECIFICATION.rst
Outdated
When percent-encoding is required, all Permitted Characters MUST be encoded as | ||
UTF-8 and then percent-encoded except for the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems at odds with the first paragraph. "When percent-encoding" should be redundant because percent-encoding is always required (PURL has components that must not be encoded, but I would say it's more accurate that those components do not allow any characters that require encoding, especially for qualifier keys), but then because it says to use RFC 3986 rules and then "when percent-encoding is required, [...] characters must be [...] percent-encoded except," this could be taken to mean these are exceptions to the RFC 3986 rules instead of a replacement of the RFC 3986 rules (which don't map one-to-one with PURL).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-phylum how would you phrase this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that the process of applying "percent-encoding" to a string does not necessarily change the string because it complicates talking about which characters need to be changed vs which characters do not need to be changed. RFC 3986 talks about percent-encoding a (byte) string, a process which may or may not alter the string, and percent-encoding an octet, a process which consistently converts one octet into three. WHATWG URL is similar, talking about percent-encoding a byte sequence, a process which may or may not alter the byte sequence, and percent-encoding a byte, a process which consistently converts one octet into three. PURL, at least in this section, is less specific about what "percent-encoding" means.
Merging in @ppkarwasz's comment, it could say something like this:
When serializing a string, unless excluded by the following rules, every code point must be replaced by the percent-encoded bytes of the code point's UTF-8 encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here is to refer to those component definitions that require percent-encoding, e.g., something like
When percent-encoding is required by a component definition, each
codepoint MUST be replaced by the percent-encoded bytes of the codepoint's
UTF-8 encoding using the percent-encoding mechanism defined in RFC 3986
section 2.1 . . . .
I think that the main problems of the encoding sections are:
Section 2 of RFC 3986 is all about encoding octets and it only defines an operation that takes one octet and represents it as three octets of the form Section 2.4 doesn't give hard rules on when this single-byte operation must be performed. It only says that:
|
As I just noted in a comment in @mprcic's encoding discussion, on reflection I'm leaning towards @matt-phylum's suggestion that our RFC 3986 section 2 reference be limited to just section 2.1's percent-encoding process, which could reduce ambiguity and ease implementation. Would we lose anything we can't afford (or don't want) to lose? |
PURL-SPECIFICATION.rst
Outdated
In the "Rules for each ``purl`` component" section above, each component | ||
defines when and how to apply percent-encoding and decoding to its content. | ||
|
||
When percent-encoding is required, all Permitted Characters MUST be encoded as | ||
UTF-8 and then percent-encoded except for the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here on the other hand we are talking about the decoded form of a component (e.g. a package named Pan語
).
The domain of the percent-encode
function is not restricted to the "Permitted Characters", but any Unicode character can be present (components can restrict this set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnmhoran Thanks! See my comments for your consideration.
PURL-SPECIFICATION.rst
Outdated
(https://datatracker.ietf.org/doc/html/rfc3986#section-2). In the event of any | ||
conflict between this specification and RFC 3986 section 2, this specification | ||
governs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-phylum can you elaborate? I am not sure I understand fully your point. I thing that what we are trying to convey is that:
- this spec defines WHICH characters to encode and WHERE/WHEN (e.g., with specifics for separators and components)
- we defer to RFC3986 to define HOW to encode characters we want encoded.
Would this be clearer?
@pombredanne @matt-phylum @ppkarwasz @mjherzog Thank you for your thoughtful comments. I'll give them a closer read a bit later today and then draft an update (if I miss any of your points in that update, please advise and I'll correct). Thanks as well for the language proposals -- super helpful -- keep them coming! 👍 @jkowalleck and @mprpic (and others) -- please share your thoughts as well. |
Reference: #438 Signed-off-by: John M. Horan <[email protected]>
Signed-off-by: John M. Horan <[email protected]>
I've just pushed a proposed update that addresses many of the points and comments so far, though not all, which still need to be agreed upon. That includes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very nice! 💯
My 2 cents.
PURL-SPECIFICATION.rst
Outdated
- the percent sign '%' when used to represent a percent-encoded character, | ||
- a ``purl`` separator when being used as a ``purl`` separator, and | ||
- the colon ':', whether used as a ``purl`` separator or otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since each component specifies the arguments of the percent-encode
method, I think this is not necessary.
The argument of percent-encode
will never contain "characters used as purl
separators"; those characters will be added afterwards.
For example:
- percent-encode each segment of the namespace.
- join the encoded segments with the
/
character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this comment @ppkarwasz . Not sure if you're addressing all 3 lines you excerpt or just line 278, so just in case:
- line 277: that is definitely more than needed -- was just trying to be thorough ;-) no objections at all to deleting if others agree
- line 278: believe it or not, there have been issues that this point addresses, e.g., does the colon between scheme and type need to be percent-encoded? See, e.g., Percent encoding spec and : and /; imho this is needed to avoid such issues in the future and make the use of a PURL as clear as possible.
- line 279: perhaps I am misunderstanding your point here -- without line 279, how will users know that colons do not need to be percent-encoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- line 278: believe it or not, there have been issues that this point addresses, e.g., does the colon between scheme and type need to be percent-encoded?
Lines 242-243 say:
purl-spec/PURL-SPECIFICATION.rst
Lines 242 to 243 in e391329
These ``purl`` separator characters MUST NOT be percent-encoded when used as | |
``purl`` separators: |
Do we need to repeat it here too?
- line 279: perhaps I am misunderstanding your point here -- without line 279, how will users know that colons do not need to be percent-encoded?
Sure we need to say that colon :
does not need to be percent-encoded, but I think we don't need to repeat that it also does not need to be encoded when used as a separator.
Maybe we could make this paragraph less descriptive and more imperative like:
To percent-encode a string of characters:
1. encode it using UTF-8,
2. for each byte of the encoded string:
- if the byte corresponds to:
- an alphanumeric ASCII character (``A to Z``, ``a to z``, ``0 to 9``)
- or one of the ASCII characters `.`, `-`, `_`, `~` and `:`.
copy the byte to the output.
- otherwise, append the percent-encoding of the byte to the output, as defined in RFC 3986
section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1).
Reference: #438 Signed-off-by: John M. Horan <[email protected]>
Reference: #438 Signed-off-by: John M. Horan <[email protected]>
@pombredanne I've committed and pushed the latest set of changes to the character-encoding section. Looking forward to comments and suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Reference: #438