Implement code overrides matching via tokens parsing

## What are you trying to do?

Add token based code overrides. The code overrides make the existing language servers work with non-standard syntax which is added by the kernels on top of the underlying language. For example:
- IPython (a Jupyter kernel for Python) implements so called magics, which are prefixed by various special characters that otherwise would be an invalid syntax in the pure Python. An example IPython magic is `%ls` (variable names cannot start with `%` in Python, and prefixing magic names with `%` so IPython uses it to mark some of its magics). The IPython interpreter modifies the Abstract Syntax Tree, executing a special action when such a symbol is found: it calls `ipython.run_line_magic('ls', '')` where `ipython` represents the global IPython instance. This call returns a value and can have side-effects. It is possible to replace this magic with valid Python via the proposed code override mechanism, as the valid pure Python equivalent would be: 

```python
from IPython import get_ipython
get_ipython().run_line_magic('ls', '')
```

The _reverse code replacement_ translates the above back to `%ls`. This is needed to make the LSP functions which modify the document work (e.g. rename/reformat/quick fix).

Some magics take variables as arguments. For example, `%store data` would save the value of `data` variable. A customised code replacement for this IPython magic would call a function using this variable, e.g.:

```python
from IPython import get_ipython
get_ipython().run_line_magic('store', (data))  # mypy: ignore type check
```

This has several advantages:
  - if the users renames the data variable elsewhere in the code using LSP, the reference in the magic will be replaced too
  - the linter will see that the variable is referenced at least once and will not warn about unused variable
  - if the variable is not defined, a warning will be show

## How is it done today?

We only support regular expression-based code overrides. This was initially a "band-aid" to make LSP work with IPython, but we are hitting the limitations of regular expressions.

## What are the limitations of using regular expressions?

- screening code blocks with multiple complex regular expressions is expensive
- regular expressions cannot distinguish escaped code (not enough memory, fundamental grammar limitation)
   - an example of code in which the magic cannot be correctly substituted: `x['a = !ls'] = !ls` (possibly more apparent on: `x['\'a = !ls\''] = !ls`)
     - explanation: the string fragment which does NOT contain a magic would be incorrectly substituted by regular expression:
         - we cannot just disallow matching `'` and `"` as `x['a'] = !ls` is a valid expression with magic
         - we cannot start parsing from `=` as `!` and `%` magics have to start from new line or assignment; for example `x == !ls` is not a valid magic;
      - every magic in a docstring (a multi-line comment in Python) is currently prone to incorrect substitution as regular expression is not context aware (the context being docstring)

### The proposal

- Each kernel should be able to provide a JSON (file or message) with the default code overrides, which would be defined in terms of tokens. With good token parser we can hugely reduce the computational complexity and work with syntax which has to be parsed as contextual grammar.
- The users will still be able to install plugins providing extra code overrides for the libraries which define magics with additional side effects (such as assignment ot variables or reference of variables in the `%store` magic; for examples see rpy2 magics or ipython-sql magics).

### Why should it be declared by the kernel? Should not the LSP servers work harder to adopt to the language variation?

- the later could impede adoption of LSP, as writing your own LSP server is a challenge.
- extending existing servers could be feasible, but not really... For example the Python servers would now need to install IPython to parse the code (this is ok); then for the pyls - which is jedi, black, mypy, pyflakes, flake8 etc bundled together - each of the underlying tool would need to be adopted to work with IPython - a huge task.
   - adding a new tool to pyls would mean it had to be adopted for IPython
   - adding a new magic for IPython would trigger a need for update in dozens of packages
- there is a ton of custom user-defined magics which would benefit greatly from custom overrides; these would not be possible to be handled upstream, as most magics are not known about by the kernel developers, and even less so by the static analysis tool developers.

Now, I believe that magics are fun. Magics are what makes the IPython and others kernels easier to work with in the interactive setting. I would like to enable kernel developers, especially developers of kernels for languages which are not so well established (I do not expect to see huge changes in IPython magics) to add, change and refine magics without the fear that the cool language features will stop working for them.

## Design notes

Before reading further, you may want to have a quick look at the current regular-expression-based implementation:
- [IPython overrides](https://github.com/krassowski/jupyterlab-lsp/blob/ff7b74d82b25de515486774447261e56a65292be/packages/jupyterlab-lsp/src/transclusions/ipython/overrides.ts) ([jest tests](https://github.com/krassowski/jupyterlab-lsp/blob/ff7b74d82b25de515486774447261e56a65292be/packages/jupyterlab-lsp/src/transclusions/ipython/overrides.spec.ts))
- [IPython-rpy2 overrides](https://github.com/krassowski/jupyterlab-lsp/blob/ff7b74d82b25de515486774447261e56a65292be/packages/jupyterlab-lsp/src/transclusions/ipython-rpy2/overrides.ts) ([jest tests](https://github.com/krassowski/jupyterlab-lsp/blob/ff7b74d82b25de515486774447261e56a65292be/packages/jupyterlab-lsp/src/transclusions/ipython-rpy2/overrides.spec.ts))

Design considerations:
- it could be as simple as a three js function callbacks: `find_matches`, `replace`, `reverse_replace`
- we would prefer something which can be serialize to JSON and provided by the kernel easily
- we would want to offer `DocumentStart` and `DocumentEnd` tokens, so that we can prepend the document with necessary code. Regular expressions cannot do that as ^ is either beginning of the document OR of the line. This would allow, for example:
   - in IPython import the things which are available as IPython built-ins, e.g. `get_ipython()` function
   - in C prepend with `int main() {` and append `return 0; }`
- we need to have a better way of tracking the substitutions; currently the reverse transformation leads to limitations on what we remember on the form of original expression. For example:
   - two rpy2 magics calls, one with with `-o` and `--output`  arguments will lead to the same result and will be always reversed to `-o`).
    - similarly implementing `%Rpull` would conflict with the above
    - `int?` and `?int` are evaluated to the same code, so cannot be distinguished during reverse substitution
 - therefore we want to have some kind of argument embedding in the comments; simple "lets embed the whole thing in the comment" is not a solution as we do want the code to be dynamic, variables referenced in magics to be properly renamed etc. A single system which can be used by every override would greatly simplify writing overrides. It could be encoded in comments with JSON (maybe collapsed to a single line) to be language agnostic. The actual responsibility of generating the comment would be left to the code override provider as what is considered a valid comment differs between languages.

Draft of the interface (to be converted to JSON schema once finished):

```typescript
interface IArgumentStorageOptions {
   commentPrefix: string;
   commentSuffix?: string;
   /** Which characters should be escaped. */
   charactersToEscape: string[];
   /** Character(s) to be pretended before characters to be escaped */
   escapeCharacter: string;
}

interface ITokenType;

interface IToken {
  type: ITokenType;
  value: string;
}

interface ITokenMatcher {
  type: ITokenType;
  pattern?: string;
  /* If true, this is what should be extracted from the match (compare with regexp capturing groups) */
  capture: bool;
}

interface ITokenGroupMatcher {
   tokens: ITokenMatcher[];
  /** how many repetitions of the argument should be supported; can be Inf to support any number */
  repeats: number;
}

interface IArgument {
  id: string;
  match: (ITokenMatcher | ITokenGroupMatcher)[];
}

interface ITokenizer {
   // TODO
}

interface IArgumentMatch {
   id: string;
   // how to join values if multiple captured under id
   join?: string;
}

interface ITokenCodeOverride {
  tokenizer: ITokenizer;
  argumentStorage: IArgumentStorageOptions;
  match: (IToken | IArgument)[];
  replace: (string | IArgumentMatch)[];
  reverse: ITokenCodeOverride;
}
```

Example IPython magic for `%store`:

```typescript
{
    tokenizer: python,
    argumentStorage: {
       commentPrefix: '#'
    },
    match: [
      {type: 'operator', pattern: '%'} as ITokenMatcher,
      {type: 'variable', pattern: 'store'} as ITokenMatcher,
      {
         id: 'store-argument',
         match: [ 
            {type: 'variable', capture: true} as ITokenMatcher,
            {
                tokens: [
                    {type: 'separator'} as ITokenMatcher,
                    {type: 'variable', capture: true} as ITokenMatcher
                ],
                repeats: Math.Inf
            } as ITokenGroupMatcher
         ]
       } as IArgument
   ],
   replace: [
      "get_ipython().run_line_magic('store', (",
      {id: 'store-argument', join: ', '} as IArgumentMatch,
      "))"
   ],
   reverse: {} // TODO
}
```

### Questions to consider

- should we try to build our own ITokenizer, or should we:
  - re-use tokenizers from CodeMirror? It is probably the easiest solution as it gives us implementation for most languages at hand. The con is that other frontends would need to play along - but they need to use a tokenizer, why not to stick to this one? Ideally there would be a package/standard which is about the tokens only and focuses on compliance with the language specification rather than on visuals/editing usability as the CodeMirror modes/tokenizers do...
   - use [ANTLR](https://www.antlr.org/) (they got a testimonial from Guido van Rossum among others ;)) which is more dedicated tool for grammar specification and parsing; I have not looked in depth but it might be giving just a syntax tree rather than linear tokens which  could be an overkill for we are trying to achieve here.
- ideally the tokenizer would need to be specified once and then only referenced by id.
- it would be nice of us to make a solution which could be adopted by other front-ends should they wish to, to prevent fragmentation of the ecosystem.

### How to make this work for kernel developers?

Testing overrides would need to be made easy for the kernel developers.
- we could host a website on GitHub pages with a playground where the developer provides text and JSON with `ITokenCodeOverride` and overrides are presented.
   - we could define an `ITestCase` interface and accept that `ITestCase[]` on the static website to give developers a hint on how it all should work like
- for their use in CI we could make a conda package with a test runner accepting `ITokenCodeOverride` and `ITestCase[]`

### Will this work?

- Yes, this will technically work
- Not sure - will kernel developers in Jupyter community be willing to go this route, or would they prefer that a LSP server is implemented for each kernel / existing language servers are amended for each kernel?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement code overrides matching via tokens parsing #347

What are you trying to do?

How is it done today?

What are the limitations of using regular expressions?

The proposal

Why should it be declared by the kernel? Should not the LSP servers work harder to adopt to the language variation?

Design notes

Questions to consider

How to make this work for kernel developers?

Will this work?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement code overrides matching via tokens parsing #347

Description

What are you trying to do?

How is it done today?

What are the limitations of using regular expressions?

The proposal

Why should it be declared by the kernel? Should not the LSP servers work harder to adopt to the language variation?

Design notes

Questions to consider

How to make this work for kernel developers?

Will this work?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions