Refactor chunkDocument in chunck.ts to avoid processing all chunks upfront

### Validations

- [x] I believe this is a way to improve. I'll try to join the [Continue Discord](https://discord.gg/NWtdYexhMs) for questions
- [x] I'm not able to find an [open issue](https://github.com/continuedev/continue/issues?q=is%3Aopen+is%3Aissue+label%3Aenhancement) that requests the same enhancement

### Problem

The current implementation of chunkDocument builds a full list of chunk promises before yielding any results, which can be inefficient and unnecessary, especially when we only need the first n chunks in practice (for example in BaseRetrievalPiepline.ts

The function collects all chunkPromises before yielding.
We waste memory and processing time when dealing with large documents.
It's currently structured like this:

```ts
const chunkPromises: Promise<Chunk | undefined>[] = [];
for await (const chunkWithoutId of chunkDocumentWithoutId(...)) {
  chunkPromises.push(new Promise(...));
}
for await (const chunk of chunkPromises) {
  yield chunk;
}
```


### Solution

- Lazily evaluate and yield each chunk as soon as it's ready.
- Skip chunks that exceed the maxChunkSize immediately.
- Avoid allocating memory for unused chunks.

Refactor chunkDocument like this:

```ts
export async function* chunkDocument({
  filepath,
  contents,
  maxChunkSize,
  digest,
}: ChunkDocumentParam): AsyncGenerator<Chunk> {
  let index = 0;

  for await (const chunkWithoutId of chunkDocumentWithoutId(
    filepath,
    contents,
    maxChunkSize,
  )) {
    const tokenCount = await countTokensAsync(chunkWithoutId.content);

    if (tokenCount > maxChunkSize) {
      continue; // skip oversized chunks
    }

    yield {
      ...chunkWithoutId,
      digest,
      index,
      filepath,
    };

    index++;
  }
}
```

In practice, this is a huge memory and time gain for large file (more than 1000 lines), I can make a pull request for this one if needed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor chunkDocument in chunck.ts to avoid processing all chunks upfront #6258

Validations

Problem

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor chunkDocument in chunck.ts to avoid processing all chunks upfront #6258

Description

Validations

Problem

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions