Description
The byte-array literals syntax
julia> @show b"hi\n";
b"hi\n" = UInt8[0x68, 0x69, 0x0a]
is currently implemented as
"""
@b_str
Create an immutable byte (`UInt8`) vector using string syntax.
"""
macro b_str(s)
v = codeunits(unescape_string(s))
QuoteNode(v)
end
This implementation hides a rather counter-intuitive and undocumented property: in certain situations, the unescaping procedure to remove backslashes is applied twice. As a result, a user needs to use no less than five (5) backslashes to obtain the byte sequence of the ASCII string \"
:
julia> @show b"\\\\\"";
b"\\\\\"" = UInt8[0x5c, 0x22]
Julia's raw strings use the following escaping rule:
- if a
"
is preceded by 2n+1 backslashes, these are replaced by n backslashes, and the"
is passed through literally - if a
"
is preceded by 2n backslashes, these are replaced by n backslashes, and the"
acts as the string terminator
(This is also the escaping mechanism that the Microsoft C runtime library uses when parsing quoted strings from the Windows command line into argv
.)
This removal of backslashes before "
occurs not only in raw strings, but in all non-standard string literals, which are just macros ending in _str
. This can be seen from the trivial implementation of the macro behind raw string literals, which is just the identity function:
macro raw_str(s); s; end
Therefore, when b"\\\\\""
is processed, backslashes are removed in the following two steps:
- The raw-string parser replaces 5 = 2×2+1 backslashes in front of the
"
with 2 backslashes - The call to the
unescape_string()
function by macro@b_str()
replaces the remaining\\
with\
.
This duplicate backspace reduction is entirely unnecessary in non-standard string literals where the corresponding macro calls unescape_string()
, because that function does already perform the same \\
→ \
and \"
→ "
mapping that is behind the 2n+1 rule of the raw-string processing. This redundant, duplicate processing is also likely to surprise users, especially since the documentation does not warn about this at all. It certainly surprised me!
There is a simple workaround in the case of @b_str()
, namely to undo the backslash removal performed by the raw-string processing, using Base.escape_raw_string
:
import Base.@b_str
macro b_str(s)
v = codeunits(unescape_string(Base.escape_raw_string(s)))
QuoteNode(v)
end
Now we get
julia> @show b"\\\"";
b"\\\"" = UInt8[0x5c, 0x22]
julia> @show b"\\\\\"";
b"\\\\\"" = UInt8[0x5c, 0x5c, 0x22]
which seems much more intuitive and unsurprising.
But @b_str()
may be just one example of a type of non-standard string literal that further processes the string received with unescape_string()
, or with any other function that uses backslashes as escape symbols, and therefore performs the same \\
→ \
and \"
→ "
mapping. If this is indeed the case, then perhaps the compiler mechanics behind non-standard string literals should not remove any backslashes at all, and leave this to the author of the macro? The 2n+1 vs 2n rule would then merely be used to identify the terminating quotation mark, but all characters before that would be passed through to the macro untouched.