Open
Description
Validations
- I believe this is a way to improve. I'll try to join the Continue Discord for questions
- I'm not able to find an open issue that requests the same enhancement
Problem
The current implementation of chunkDocument builds a full list of chunk promises before yielding any results, which can be inefficient and unnecessary, especially when we only need the first n chunks in practice (for example in BaseRetrievalPiepline.ts
The function collects all chunkPromises before yielding.
We waste memory and processing time when dealing with large documents.
It's currently structured like this:
const chunkPromises: Promise<Chunk | undefined>[] = [];
for await (const chunkWithoutId of chunkDocumentWithoutId(...)) {
chunkPromises.push(new Promise(...));
}
for await (const chunk of chunkPromises) {
yield chunk;
}
Solution
- Lazily evaluate and yield each chunk as soon as it's ready.
- Skip chunks that exceed the maxChunkSize immediately.
- Avoid allocating memory for unused chunks.
Refactor chunkDocument like this:
export async function* chunkDocument({
filepath,
contents,
maxChunkSize,
digest,
}: ChunkDocumentParam): AsyncGenerator<Chunk> {
let index = 0;
for await (const chunkWithoutId of chunkDocumentWithoutId(
filepath,
contents,
maxChunkSize,
)) {
const tokenCount = await countTokensAsync(chunkWithoutId.content);
if (tokenCount > maxChunkSize) {
continue; // skip oversized chunks
}
yield {
...chunkWithoutId,
digest,
index,
filepath,
};
index++;
}
}
In practice, this is a huge memory and time gain for large file (more than 1000 lines), I can make a pull request for this one if needed
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Todo