Description
The ambiguous Unicode character table generates by far the most lllvm instructions (1.3% of all generated instructions).
RUSTFLAGS="-Csymbol-mangling-version=v0" cargo llvm-lines -p ruff --lib | head -20
Lines Copies Function name
----- ------ -------------
1532601 35211 (TOTAL)
20559 (1.3%, 1.3%) 1 (0.0%, 0.0%) ruff[9ef597ff107e57e5]::rules::ruff::rules::ambiguous_unicode_character::CONFUSABLES::{closure#0}
8800 (0.6%, 1.9%) 1 (0.0%, 0.0%) <ruff[9ef597ff107e57e5]::codes::RuleCodePrefix>::iter
7408 (0.5%, 2.4%) 1 (0.0%, 0.0%) <ruff[9ef597ff107e57e5]::checkers::ast::Checker as ruff_python_ast[47d0f9a8c15cbca1]::visitor::Visitor>::visit_stmt
6898 (0.5%, 2.8%) 1 (0.0%, 0.0%) <ruff[9ef597ff107e57e5]::registry::Rule>::noqa_code
I think we can do better by generating fixed arrays that we use for lookups similar to unicode-width or unicode-ident. The input is a mapping from "UTF8" character to the ascii character (ideally, as character and not the number representation).
Why could this work? Because UTF8 only uses a subset of the 4-byte space, the first bit is always 0 for the first byte and always 1 for the following bytes. That means, each byte has at most 128 valid values, and it should be sufficient to have at most 4-128bit tables.
This should also improve performance because it simplifies the lookup to a few bit operations and a static array lookup.