Skip to content

@b_str removes backslashes twice #39092

Open
@mgkuhn

Description

@mgkuhn

The byte-array literals syntax

julia> @show b"hi\n";
b"hi\n" = UInt8[0x68, 0x69, 0x0a]

is currently implemented as

"""
    @b_str

Create an immutable byte (`UInt8`) vector using string syntax.
"""
macro b_str(s)
    v = codeunits(unescape_string(s))
    QuoteNode(v)
end

This implementation hides a rather counter-intuitive and undocumented property: in certain situations, the unescaping procedure to remove backslashes is applied twice. As a result, a user needs to use no less than five (5) backslashes to obtain the byte sequence of the ASCII string \":

julia> @show b"\\\\\"";
b"\\\\\"" = UInt8[0x5c, 0x22]

Julia's raw strings use the following escaping rule:

  • if a " is preceded by 2n+1 backslashes, these are replaced by n backslashes, and the " is passed through literally
  • if a " is preceded by 2n backslashes, these are replaced by n backslashes, and the " acts as the string terminator

(This is also the escaping mechanism that the Microsoft C runtime library uses when parsing quoted strings from the Windows command line into argv.)

This removal of backslashes before " occurs not only in raw strings, but in all non-standard string literals, which are just macros ending in _str. This can be seen from the trivial implementation of the macro behind raw string literals, which is just the identity function:

macro raw_str(s); s; end

Therefore, when b"\\\\\"" is processed, backslashes are removed in the following two steps:

  1. The raw-string parser replaces 5 = 2×2+1 backslashes in front of the " with 2 backslashes
  2. The call to the unescape_string() function by macro @b_str() replaces the remaining \\ with \.

This duplicate backspace reduction is entirely unnecessary in non-standard string literals where the corresponding macro calls unescape_string(), because that function does already perform the same \\\ and \"" mapping that is behind the 2n+1 rule of the raw-string processing. This redundant, duplicate processing is also likely to surprise users, especially since the documentation does not warn about this at all. It certainly surprised me!

There is a simple workaround in the case of @b_str(), namely to undo the backslash removal performed by the raw-string processing, using Base.escape_raw_string:

import Base.@b_str
macro b_str(s)
    v = codeunits(unescape_string(Base.escape_raw_string(s)))
    QuoteNode(v)
end

Now we get

julia> @show b"\\\"";
b"\\\"" = UInt8[0x5c, 0x22]

julia> @show b"\\\\\"";
b"\\\\\"" = UInt8[0x5c, 0x5c, 0x22]

which seems much more intuitive and unsurprising.

But @b_str() may be just one example of a type of non-standard string literal that further processes the string received with unescape_string(), or with any other function that uses backslashes as escape symbols, and therefore performs the same \\\ and \"" mapping. If this is indeed the case, then perhaps the compiler mechanics behind non-standard string literals should not remove any backslashes at all, and leave this to the author of the macro? The 2n+1 vs 2n rule would then merely be used to identify the terminating quotation mark, but all characters before that would be passed through to the macro untouched.

Metadata

Metadata

Assignees

No one assigned

    Labels

    breakingThis change will break codemacros@macrosstrings"Strings!"

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions