Skip to content

Refactor chunkDocument in chunck.ts to avoid processing all chunks upfront #6258

Open
@clmentcoutet

Description

@clmentcoutet

Validations

  • I believe this is a way to improve. I'll try to join the Continue Discord for questions
  • I'm not able to find an open issue that requests the same enhancement

Problem

The current implementation of chunkDocument builds a full list of chunk promises before yielding any results, which can be inefficient and unnecessary, especially when we only need the first n chunks in practice (for example in BaseRetrievalPiepline.ts

The function collects all chunkPromises before yielding.
We waste memory and processing time when dealing with large documents.
It's currently structured like this:

const chunkPromises: Promise<Chunk | undefined>[] = [];
for await (const chunkWithoutId of chunkDocumentWithoutId(...)) {
  chunkPromises.push(new Promise(...));
}
for await (const chunk of chunkPromises) {
  yield chunk;
}

Solution

  • Lazily evaluate and yield each chunk as soon as it's ready.
  • Skip chunks that exceed the maxChunkSize immediately.
  • Avoid allocating memory for unused chunks.

Refactor chunkDocument like this:

export async function* chunkDocument({
  filepath,
  contents,
  maxChunkSize,
  digest,
}: ChunkDocumentParam): AsyncGenerator<Chunk> {
  let index = 0;

  for await (const chunkWithoutId of chunkDocumentWithoutId(
    filepath,
    contents,
    maxChunkSize,
  )) {
    const tokenCount = await countTokensAsync(chunkWithoutId.content);

    if (tokenCount > maxChunkSize) {
      continue; // skip oversized chunks
    }

    yield {
      ...chunkWithoutId,
      digest,
      index,
      filepath,
    };

    index++;
  }
}

In practice, this is a huge memory and time gain for large file (more than 1000 lines), I can make a pull request for this one if needed

Metadata

Metadata

Assignees

Labels

javascriptPull requests that update Javascript codekind:enhancementIndicates a new feature request, imrovement, or extension

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions