Skip to content

feat: add configurable file indexing logic #967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Apr 17, 2025
Merged

Conversation

taldekar
Copy link
Contributor

@taldekar taldekar commented Apr 15, 2025

Problem

When the LSP is initialized, local project context needs to crawl the workspaces, build a repo map, and create an index with it that will be used to provide additional context to Q. The workspace need to be crawled and indexed and what files to include in the index needs to be configurable.

Solution

File indexing options are exposed to the user through the workspace/configuration call. This information is used to determine what files to include in the index by the processWorkspaceFolders method in localProjectContextController.

Justification for newly introduced libraries

  • fdir: small (< 2KB), well-tested library that significantly speeds up directory crawling (claims to be capable of crawling 1 million files in < 1 second) when building the initial file directory.
  • picomatch: required by fdir under the hood to apply a glob filter. This is used to limit the results of the crawl to file extensions supported by the language server.
  • ignore: well tested, small (~63 kB) package that allows us to enforce file exclusion rules in the .gitignore convention. These rules are configurable by different IDEs and are passed in through the workspace/configuration call. It is also used to parse the users .gitignore files for further filtering.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@taldekar taldekar requested a review from a team as a code owner April 15, 2025 21:36
@taldekar taldekar requested a review from justinmk3 April 15, 2025 23:52
const localGitIgnoreFiles: string[] = []

const crawler = new fdir()
.withSymlinks({ resolvePaths: !includeSymLinks })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symlinks can cause loops, or just really poor performance on complex directories. I assume includeSymLinks decides whether symlinks are followed. It should probably be false by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup it's false by default. Not sure if we want to surface this setting to the user yet, but loops and poor performance are definitely a possible risk. We have an upper bound on the index size though, so we wouldn't explicitly break anything.

this.fileExtensions = Object.keys(languageByExtension)
private ignoreFilePatterns?: string[]
private includeSymlinks?: boolean
private maxFileSizeMb?: number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this by MB? or is are they actually representing bits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, yup it should be MB. I'll make the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpinkney-aws fixed this!

private readonly DEFAULT_MAX_FILE_SIZE = 10
private readonly MB_TO_BYTES = 1024 * 1024

constructor(clientName: string, workspaceFolders: WorkspaceFolder[], logging?: Logging) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why has the logger now become optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change for testing on my local. I'll revert this change before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpinkney-aws fixed this as well!

Comment on lines 15 to 17
const ignore = require('ignore')
const { fdir } = require('fdir')
const fs = require('fs')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these necessary to be required or can they be import as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these should be importable as well. I can make the change.

@@ -43,13 +73,28 @@ export class LocalProjectContextController {
return this.instance
}

public async init(vectorLib?: any): Promise<void> {
public async init({
Copy link
Contributor

@kmile kmile Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to deliver a no-op version of the entire workspace indexing that has no dependencies on os and fs and vectorLib? We need to make sure that this server can be loaded in a browser environment eventually, with local indexing either disabled or replaced by something suitable for the browser. This is not urgent but we'll need to come up with a solution eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe if the vector lib cannot be resolved (in the browser environment for instance), all calls to the controller are effectively no-ops. But @breedloj would be able to answer that better.

@taldekar taldekar merged commit dd49420 into aws:main Apr 17, 2025
6 checks passed
@taldekar taldekar deleted the file-indexing branch April 17, 2025 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants