Skip to content

Update "Character encoding" and related provisions #438 #461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

johnmhoran
Copy link
Member

Reference: #438

@johnmhoran johnmhoran added PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL encoding Ecma specification Work on the core specification labels Apr 22, 2025
@johnmhoran
Copy link
Member Author

@pombredanne @mprpic @jkowalleck @ppkarwasz @matt-phylum (and the rest of the PURL community ;-) This is an update to the encoding clarification made via PR #439. I've tried to provide a few additional clear, concise statements addressing the various related issues/PRs as well as recent comments. Please do not hesitate to correct, clarify and question as needed -- and please provide alternative language proposals whenever you can.

Comment on lines 262 to 264
(https://datatracker.ietf.org/doc/html/rfc3986#section-2). In the event of any
conflict between this specification and RFC 3986 section 2, this specification
governs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going back to how it was before? The way I understood the current version was that it was a breaking change to intentionally break from RFC3986 section 2 because the RFC3986 encoding rules are more complicated to implement and most PURL implementations did not try to implement them, instead applying mostly the same encoding rules to all components. I've already had somebody ask why phylum-dev/purl doesn't encode plus signs in version numbers, which isn't required by RFC 3968 and is non-canonical according to the old PURL but is required by the rules in the current version of PURL, or at least we both thought so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-phylum can you elaborate? I am not sure I understand fully your point. I thing that what we are trying to convey is that:

  1. this spec defines WHICH characters to encode and WHERE/WHEN (e.g., with specifics for separators and components)
  2. we defer to RFC3986 to define HOW to encode characters we want encoded.

Would this be clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By saying "In the event of any conflict between this specification and RFC 3986 section 2" it sounds to me like the implementer is supposed to combine RFC 3986 and PURL rules, eg merging RFC 3986 pchar with the PURL character rules when outputting a package name. If the PURL spec is clearly specifying WHICH characters WHERE/WHEN and RFC 3986 is specifying HOW then it's much easier to implement and there shouldn't be conflicts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-phylum That's my goal -- each component should specify clearly what characters are permitted or prohibited as well as what needs to be percent-encoded and when. (scheme, type and qualifiers already do exactly that, and namespace, name, version and subpath should do the same to eliminate the possibility of ambiguity.)

With respect to RFC 3986 and the HOW, I'm adopting your earlier suggestion that the RFC 3986 reference be changed from section 2 to section 2.1, which addresses the mechanics of percent-encoding, i.e., the HOW. Please take a look once I push an update and let me know if more fine-tuning is needed and if so I'll take care of it.

Comment on lines 269 to 270
When percent-encoding is required, all Permitted Characters MUST be encoded as
UTF-8 and then percent-encoded except for the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems at odds with the first paragraph. "When percent-encoding" should be redundant because percent-encoding is always required (PURL has components that must not be encoded, but I would say it's more accurate that those components do not allow any characters that require encoding, especially for qualifier keys), but then because it says to use RFC 3986 rules and then "when percent-encoding is required, [...] characters must be [...] percent-encoded except," this could be taken to mean these are exceptions to the RFC 3986 rules instead of a replacement of the RFC 3986 rules (which don't map one-to-one with PURL).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-phylum how would you phrase this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that the process of applying "percent-encoding" to a string does not necessarily change the string because it complicates talking about which characters need to be changed vs which characters do not need to be changed. RFC 3986 talks about percent-encoding a (byte) string, a process which may or may not alter the string, and percent-encoding an octet, a process which consistently converts one octet into three. WHATWG URL is similar, talking about percent-encoding a byte sequence, a process which may or may not alter the byte sequence, and percent-encoding a byte, a process which consistently converts one octet into three. PURL, at least in this section, is less specific about what "percent-encoding" means.

Merging in @ppkarwasz's comment, it could say something like this:

When serializing a string, unless excluded by the following rules, every code point must be replaced by the percent-encoded bytes of the code point's UTF-8 encoding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is to refer to those component definitions that require percent-encoding, e.g., something like

When percent-encoding is required by a component definition, each
codepoint MUST be replaced by the percent-encoded bytes of the codepoint's
UTF-8 encoding using the percent-encoding mechanism defined in RFC 3986
section 2.1 . . . .

@ppkarwasz
Copy link

I think that the main problems of the encoding sections are:

  • the ambiguity between "character" used in the sense of byte/octet and "character" used to designate a Unicode codepoint.
  • the ambiguity between "percent-encoding" as operation that transforms one octet into three octets and operation that encodes a sequence of Unicode characters into a sequence of octets.

Section 2 of RFC 3986 is all about encoding octets and it only defines an operation that takes one octet and represents it as three octets of the form %[a-fA-F0-9]{2} (using the mapping between characters and octets provided by US-ASCII).

Section 2.4 doesn't give hard rules on when this single-byte operation must be performed. It only says that:

  • the octets in the unreserved set SHOULD be represented as single octets (i.e. no "percent-encoding"). We want to change this to MUST, at least for the canonical form.
  • each implementation should determine whether an octet in the reserved set, that is not used as a delimiter, should be "percent-encoded" or can be safely copied directly to the output. Here we want a simple rule:
    • : can be used directly.
    • the other PURL separators MUST be "percent-encoded"
    • we didn't say anything about the remaining reserved characters that are not used in PURL (e.g. [). IMHO these can be used directly.

@johnmhoran
Copy link
Member Author

As I just noted in a comment in @mprcic's encoding discussion, on reflection I'm leaning towards @matt-phylum's suggestion that our RFC 3986 section 2 reference be limited to just section 2.1's percent-encoding process, which could reduce ambiguity and ease implementation. Would we lose anything we can't afford (or don't want) to lose?

Comment on lines 266 to 270
In the "Rules for each ``purl`` component" section above, each component
defines when and how to apply percent-encoding and decoding to its content.

When percent-encoding is required, all Permitted Characters MUST be encoded as
UTF-8 and then percent-encoded except for the following:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here on the other hand we are talking about the decoded form of a component (e.g. a package named Pan語).

The domain of the percent-encode function is not restricted to the "Permitted Characters", but any Unicode character can be present (components can restrict this set).

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmhoran Thanks! See my comments for your consideration.

Comment on lines 262 to 264
(https://datatracker.ietf.org/doc/html/rfc3986#section-2). In the event of any
conflict between this specification and RFC 3986 section 2, this specification
governs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-phylum can you elaborate? I am not sure I understand fully your point. I thing that what we are trying to convey is that:

  1. this spec defines WHICH characters to encode and WHERE/WHEN (e.g., with specifics for separators and components)
  2. we defer to RFC3986 to define HOW to encode characters we want encoded.

Would this be clearer?

@johnmhoran
Copy link
Member Author

@pombredanne @matt-phylum @ppkarwasz @mjherzog Thank you for your thoughtful comments. I'll give them a closer read a bit later today and then draft an update (if I miss any of your points in that update, please advise and I'll correct). Thanks as well for the language proposals -- super helpful -- keep them coming! 👍

@jkowalleck and @mprpic (and others) -- please share your thoughts as well.

@johnmhoran
Copy link
Member Author

johnmhoran commented Apr 27, 2025

I've just pushed a proposed update that addresses many of the points and comments so far, though not all, which still need to be agreed upon. That includes

  • referring to serialization when defining the percent-encoding process -- this latest update takes a different approach, but perhaps serialization is better -- whatever we agree on works for me
  • the proposed change from Permitted Characters to Permitted Bytes and related changes, including questions about the way we currently use the term "character"

Copy link

@ppkarwasz ppkarwasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice! 💯

My 2 cents.

Comment on lines 277 to 279
- the percent sign '%' when used to represent a percent-encoded character,
- a ``purl`` separator when being used as a ``purl`` separator, and
- the colon ':', whether used as a ``purl`` separator or otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since each component specifies the arguments of the percent-encode method, I think this is not necessary.
The argument of percent-encode will never contain "characters used as purl separators"; those characters will be added afterwards.

For example:

  • percent-encode each segment of the namespace.
  • join the encoded segments with the / character.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment @ppkarwasz . Not sure if you're addressing all 3 lines you excerpt or just line 278, so just in case:

  • line 277: that is definitely more than needed -- was just trying to be thorough ;-) no objections at all to deleting if others agree
  • line 278: believe it or not, there have been issues that this point addresses, e.g., does the colon between scheme and type need to be percent-encoded? See, e.g., Percent encoding spec and : and /; imho this is needed to avoid such issues in the future and make the use of a PURL as clear as possible.
  • line 279: perhaps I am misunderstanding your point here -- without line 279, how will users know that colons do not need to be percent-encoded?

Copy link

@ppkarwasz ppkarwasz Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • line 278: believe it or not, there have been issues that this point addresses, e.g., does the colon between scheme and type need to be percent-encoded?

Lines 242-243 say:

These ``purl`` separator characters MUST NOT be percent-encoded when used as
``purl`` separators:

Do we need to repeat it here too?

  • line 279: perhaps I am misunderstanding your point here -- without line 279, how will users know that colons do not need to be percent-encoded?

Sure we need to say that colon : does not need to be percent-encoded, but I think we don't need to repeat that it also does not need to be encoded when used as a separator.

Maybe we could make this paragraph less descriptive and more imperative like:

To percent-encode a string of characters:

1. encode it using UTF-8,
2. for each byte of the encoded string:
    - if the byte corresponds to:
       - an alphanumeric ASCII character (``A to Z``, ``a to z``, ``0 to 9``)
       - or one of the ASCII characters `.`, `-`, `_`, `~` and `:`.
       
      copy the byte to the output.
    - otherwise, append the percent-encoding of the byte to the output, as defined in RFC 3986
      section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1).

@johnmhoran
Copy link
Member Author

@pombredanne I've committed and pushed the latest set of changes to the character-encoding section. Looking forward to comments and suggestions.

Copy link

@ppkarwasz ppkarwasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ecma specification Work on the core specification PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL encoding
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants