Update "Character encoding" and related provisions #438 #461

johnmhoran · 2025-04-22T04:43:21Z

Reference: #438

Reference: #438 Signed-off-by: John M. Horan <[email protected]>

johnmhoran · 2025-04-22T05:00:47Z

@pombredanne @mprpic @jkowalleck @ppkarwasz @matt-phylum (and the rest of the PURL community ;-) This is an update to the encoding clarification made via PR #439. I've tried to provide a few additional clear, concise statements addressing the various related issues/PRs as well as recent comments. Please do not hesitate to correct, clarify and question as needed -- and please provide alternative language proposals whenever you can.

matt-phylum · 2025-04-22T12:20:54Z

PURL-SPECIFICATION.rst

+(https://datatracker.ietf.org/doc/html/rfc3986#section-2).  In the event of any
+conflict between this specification and RFC 3986 section 2, this specification
+governs.


Are we going back to how it was before? The way I understood the current version was that it was a breaking change to intentionally break from RFC3986 section 2 because the RFC3986 encoding rules are more complicated to implement and most PURL implementations did not try to implement them, instead applying mostly the same encoding rules to all components. I've already had somebody ask why phylum-dev/purl doesn't encode plus signs in version numbers, which isn't required by RFC 3968 and is non-canonical according to the old PURL but is required by the rules in the current version of PURL, or at least we both thought so.

@matt-phylum can you elaborate? I am not sure I understand fully your point. I thing that what we are trying to convey is that:

this spec defines WHICH characters to encode and WHERE/WHEN (e.g., with specifics for separators and components)

we defer to RFC3986 to define HOW to encode characters we want encoded.

Would this be clearer?

By saying "In the event of any conflict between this specification and RFC 3986 section 2" it sounds to me like the implementer is supposed to combine RFC 3986 and PURL rules, eg merging RFC 3986 pchar with the PURL character rules when outputting a package name. If the PURL spec is clearly specifying WHICH characters WHERE/WHEN and RFC 3986 is specifying HOW then it's much easier to implement and there shouldn't be conflicts.

@matt-phylum That's my goal -- each component should specify clearly what characters are permitted or prohibited as well as what needs to be percent-encoded and when. (scheme, type and qualifiers already do exactly that, and namespace, name, version and subpath should do the same to eliminate the possibility of ambiguity.)

With respect to RFC 3986 and the HOW, I'm adopting your earlier suggestion that the RFC 3986 reference be changed from section 2 to section 2.1, which addresses the mechanics of percent-encoding, i.e., the HOW. Please take a look once I push an update and let me know if more fine-tuning is needed and if so I'll take care of it.

matt-phylum · 2025-04-22T12:23:37Z

PURL-SPECIFICATION.rst

+When percent-encoding is required, all Permitted Characters MUST be encoded as
+UTF-8 and then percent-encoded except for the following:


This seems at odds with the first paragraph. "When percent-encoding" should be redundant because percent-encoding is always required (PURL has components that must not be encoded, but I would say it's more accurate that those components do not allow any characters that require encoding, especially for qualifier keys), but then because it says to use RFC 3986 rules and then "when percent-encoding is required, [...] characters must be [...] percent-encoded except," this could be taken to mean these are exceptions to the RFC 3986 rules instead of a replacement of the RFC 3986 rules (which don't map one-to-one with PURL).

@matt-phylum how would you phrase this?

It's unfortunate that the process of applying "percent-encoding" to a string does not necessarily change the string because it complicates talking about which characters need to be changed vs which characters do not need to be changed. RFC 3986 talks about percent-encoding a (byte) string, a process which may or may not alter the string, and percent-encoding an octet, a process which consistently converts one octet into three. WHATWG URL is similar, talking about percent-encoding a byte sequence, a process which may or may not alter the byte sequence, and percent-encoding a byte, a process which consistently converts one octet into three. PURL, at least in this section, is less specific about what "percent-encoding" means.

Merging in @ppkarwasz's comment, it could say something like this:

When serializing a string, unless excluded by the following rules, every code point must be replaced by the percent-encoded bytes of the code point's UTF-8 encoding.

The intent here is to refer to those component definitions that require percent-encoding, e.g., something like

When percent-encoding is required by a component definition, each codepoint MUST be replaced by the percent-encoded bytes of the codepoint's UTF-8 encoding using the percent-encoding mechanism defined in RFC 3986 section 2.1 . . . .

PURL-SPECIFICATION.rst

ppkarwasz · 2025-04-22T13:58:50Z

I think that the main problems of the encoding sections are:

the ambiguity between "character" used in the sense of byte/octet and "character" used to designate a Unicode codepoint.
the ambiguity between "percent-encoding" as operation that transforms one octet into three octets and operation that encodes a sequence of Unicode characters into a sequence of octets.

Section 2 of RFC 3986 is all about encoding octets and it only defines an operation that takes one octet and represents it as three octets of the form %[a-fA-F0-9]{2} (using the mapping between characters and octets provided by US-ASCII).

Section 2.4 doesn't give hard rules on when this single-byte operation must be performed. It only says that:

the octets in the unreserved set SHOULD be represented as single octets (i.e. no "percent-encoding"). We want to change this to MUST, at least for the canonical form.
each implementation should determine whether an octet in the reserved set, that is not used as a delimiter, should be "percent-encoded" or can be safely copied directly to the output. Here we want a simple rule:
- : can be used directly.
- the other PURL separators MUST be "percent-encoded"
- we didn't say anything about the remaining reserved characters that are not used in PURL (e.g. [). IMHO these can be used directly.

johnmhoran · 2025-04-24T01:35:56Z

As I just noted in a comment in @mprcic's encoding discussion, on reflection I'm leaning towards @matt-phylum's suggestion that our RFC 3986 section 2 reference be limited to just section 2.1's percent-encoding process, which could reduce ambiguity and ease implementation. Would we lose anything we can't afford (or don't want) to lose?

PURL-SPECIFICATION.rst

ppkarwasz · 2025-04-24T19:58:58Z

PURL-SPECIFICATION.rst

+In the "Rules for each ``purl`` component" section above, each component
+defines when and how to apply percent-encoding and decoding to its content.
+
+When percent-encoding is required, all Permitted Characters MUST be encoded as
+UTF-8 and then percent-encoded except for the following:


Here on the other hand we are talking about the decoded form of a component (e.g. a package named Pan語).

The domain of the percent-encode function is not restricted to the "Permitted Characters", but any Unicode character can be present (components can restrict this set).

pombredanne

@johnmhoran Thanks! See my comments for your consideration.

PURL-SPECIFICATION.rst

pombredanne · 2025-04-24T20:40:06Z

PURL-SPECIFICATION.rst

+(https://datatracker.ietf.org/doc/html/rfc3986#section-2).  In the event of any
+conflict between this specification and RFC 3986 section 2, this specification
+governs.


@matt-phylum can you elaborate? I am not sure I understand fully your point. I thing that what we are trying to convey is that:

this spec defines WHICH characters to encode and WHERE/WHEN (e.g., with specifics for separators and components)

we defer to RFC3986 to define HOW to encode characters we want encoded.

Would this be clearer?

johnmhoran · 2025-04-24T21:58:42Z

@pombredanne @matt-phylum @ppkarwasz @mjherzog Thank you for your thoughtful comments. I'll give them a closer read a bit later today and then draft an update (if I miss any of your points in that update, please advise and I'll correct). Thanks as well for the language proposals -- super helpful -- keep them coming! 👍

@jkowalleck and @mprpic (and others) -- please share your thoughts as well.

Reference: #438 Signed-off-by: John M. Horan <[email protected]>

Signed-off-by: John M. Horan <[email protected]>

johnmhoran · 2025-04-27T23:29:37Z

I've just pushed a proposed update that addresses many of the points and comments so far, though not all, which still need to be agreed upon. That includes

referring to serialization when defining the percent-encoding process -- this latest update takes a different approach, but perhaps serialization is better -- whatever we agree on works for me
the proposed change from Permitted Characters to Permitted Bytes and related changes, including questions about the way we currently use the term "character"

ppkarwasz

Looks very nice! 💯

My 2 cents.

PURL-SPECIFICATION.rst

ppkarwasz · 2025-04-28T07:12:02Z

PURL-SPECIFICATION.rst

+  - the percent sign '%' when used to represent a percent-encoded character,
+  - a ``purl`` separator when being used as a ``purl`` separator, and
+  - the colon ':', whether used as a ``purl`` separator or otherwise.


Since each component specifies the arguments of the percent-encode method, I think this is not necessary.
The argument of percent-encode will never contain "characters used as purl separators"; those characters will be added afterwards.

For example:

percent-encode each segment of the namespace.

join the encoded segments with the / character.

Thanks for this comment @ppkarwasz . Not sure if you're addressing all 3 lines you excerpt or just line 278, so just in case:

line 277: that is definitely more than needed -- was just trying to be thorough ;-) no objections at all to deleting if others agree

line 278: believe it or not, there have been issues that this point addresses, e.g., does the colon between scheme and type need to be percent-encoded? See, e.g., Percent encoding spec and : and /; imho this is needed to avoid such issues in the future and make the use of a PURL as clear as possible.

line 279: perhaps I am misunderstanding your point here -- without line 279, how will users know that colons do not need to be percent-encoded?

line 278: believe it or not, there have been issues that this point addresses, e.g., does the colon between scheme and type need to be percent-encoded?

Lines 242-243 say:

purl-spec/PURL-SPECIFICATION.rst

Lines 242 to 243 in e391329

These ``purl`` separator characters MUST NOT be percent-encoded when used as

``purl`` separators:

Do we need to repeat it here too?

line 279: perhaps I am misunderstanding your point here -- without line 279, how will users know that colons do not need to be percent-encoded?

Sure we need to say that colon : does not need to be percent-encoded, but I think we don't need to repeat that it also does not need to be encoded when used as a separator.

Maybe we could make this paragraph less descriptive and more imperative like:

To percent-encode a string of characters: 1. encode it using UTF-8, 2. for each byte of the encoded string: - if the byte corresponds to: - an alphanumeric ASCII character (``A to Z``, ``a to z``, ``0 to 9``) - or one of the ASCII characters `.`, `-`, `_`, `~` and `:`. copy the byte to the output. - otherwise, append the percent-encoding of the byte to the output, as defined in RFC 3986 section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1).

Reference: #438 Signed-off-by: John M. Horan <[email protected]>

johnmhoran · 2025-05-09T01:54:16Z

@pombredanne I've committed and pushed the latest set of changes to the character-encoding section. Looking forward to comments and suggestions.

ppkarwasz

Looks good to me!

Update "Character encoding" and related provisions #438

0beba16

Reference: #438 Signed-off-by: John M. Horan <[email protected]>

johnmhoran added PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL encoding Ecma specification Work on the core specification labels Apr 22, 2025

matt-phylum reviewed Apr 22, 2025

View reviewed changes

ppkarwasz reviewed Apr 24, 2025

View reviewed changes

pombredanne reviewed Apr 24, 2025

View reviewed changes

johnmhoran added 2 commits April 27, 2025 16:07

Clarify percent-encoding references #438

91f07a0

Reference: #438 Signed-off-by: John M. Horan <[email protected]>

Merge branch 'main' into 438-update-character-encoding

e391329

Signed-off-by: John M. Horan <[email protected]>

ppkarwasz reviewed Apr 28, 2025

View reviewed changes

ppkarwasz mentioned this pull request May 6, 2025

fix: don't encode ':' or '/' as part of the canonical representation package-url/packageurl-java#161

Open

johnmhoran added 2 commits May 8, 2025 18:37

Restructure and clarify character-encoding section #438

e7119e8

Reference: #438 Signed-off-by: John M. Horan <[email protected]>

Merge branch 'main' into 438-update-character-encoding

90b017d

Reference: #438 Signed-off-by: John M. Horan <[email protected]>

ppkarwasz approved these changes May 9, 2025

View reviewed changes

pombredanne added this to Core PURL spec May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update "Character encoding" and related provisions #438 #461

Update "Character encoding" and related provisions #438 #461

johnmhoran commented Apr 22, 2025

johnmhoran commented Apr 22, 2025

matt-phylum Apr 22, 2025

pombredanne Apr 24, 2025

matt-phylum Apr 24, 2025

johnmhoran Apr 24, 2025

matt-phylum Apr 22, 2025

pombredanne Apr 24, 2025

matt-phylum Apr 24, 2025

johnmhoran Apr 27, 2025

ppkarwasz commented Apr 22, 2025

johnmhoran commented Apr 24, 2025

ppkarwasz Apr 24, 2025

pombredanne left a comment

pombredanne Apr 24, 2025

johnmhoran commented Apr 24, 2025

johnmhoran commented Apr 27, 2025 •

edited

Loading

ppkarwasz left a comment

ppkarwasz Apr 28, 2025

johnmhoran Apr 28, 2025

ppkarwasz Apr 29, 2025 •

edited

Loading

johnmhoran commented May 9, 2025

ppkarwasz left a comment

		When percent-encoding is required, all Permitted Characters MUST be encoded as
		UTF-8 and then percent-encoded except for the following:

	These ``purl`` separator characters MUST NOT be percent-encoded when used as
	``purl`` separators:

Update "Character encoding" and related provisions #438 #461

Are you sure you want to change the base?

Update "Character encoding" and related provisions #438 #461

Conversation

johnmhoran commented Apr 22, 2025

johnmhoran commented Apr 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppkarwasz commented Apr 22, 2025

johnmhoran commented Apr 24, 2025

Choose a reason for hiding this comment

pombredanne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnmhoran commented Apr 24, 2025

johnmhoran commented Apr 27, 2025 • edited Loading

ppkarwasz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppkarwasz Apr 29, 2025 • edited Loading

Choose a reason for hiding this comment

johnmhoran commented May 9, 2025

ppkarwasz left a comment

Choose a reason for hiding this comment

johnmhoran commented Apr 27, 2025 •

edited

Loading

ppkarwasz Apr 29, 2025 •

edited

Loading