Reduce ambiguous unicode character table

The ambiguous Unicode character table generates by far the most lllvm instructions (1.3% of all generated instructions). 

```bash
RUSTFLAGS="-Csymbol-mangling-version=v0" cargo llvm-lines -p ruff --lib | head -20

Lines                  Copies               Function name
-----                  ------               -------------
1532601                35211                (TOTAL)
  20559 (1.3%,  1.3%)      1 (0.0%,  0.0%)  ruff[9ef597ff107e57e5]::rules::ruff::rules::ambiguous_unicode_character::CONFUSABLES::{closure#0}
   8800 (0.6%,  1.9%)      1 (0.0%,  0.0%)  <ruff[9ef597ff107e57e5]::codes::RuleCodePrefix>::iter
   7408 (0.5%,  2.4%)      1 (0.0%,  0.0%)  <ruff[9ef597ff107e57e5]::checkers::ast::Checker as ruff_python_ast[47d0f9a8c15cbca1]::visitor::Visitor>::visit_stmt
   6898 (0.5%,  2.8%)      1 (0.0%,  0.0%)  <ruff[9ef597ff107e57e5]::registry::Rule>::noqa_code
```

I think we can do better by generating fixed arrays that we use for lookups similar to [unicode-width](https://github.com/dtolnay/unicode-ident/blob/master/src/tables.rs) or [unicode-ident](https://github.com/dtolnay/unicode-ident/blob/master/src/tables.rs). The input is a mapping from "UTF8" character to the ascii character (ideally, as character and not the number representation). 

Why could this work? Because UTF8 only uses a subset of the 4-byte space, the first bit is always 0 for the first byte and always 1 for the following bytes. That means, each byte has at most 128 valid values, and it should be sufficient to have at most 4-128bit tables. 

This should also improve performance because it simplifies the lookup to a few bit operations and a static array lookup. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce ambiguous unicode character table #3808

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce ambiguous unicode character table #3808

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions