-
-
Notifications
You must be signed in to change notification settings - Fork 12.5k
RAG 场景大Table解析缺陷 #7481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
👀 @bsq1989 Thank you for raising an issue. We will investigate into the matter and get back to you as soon as possible. |
Hello @bsq1989! I'm here to help you solve bugs, answer questions, and assist you in becoming a contributor. While you wait for a human maintainer, feel free to let me know how I can assist you! To address the issue of handling large Excel tables in the RAG process without exceeding the embedding model's input limit, you can implement a chunking strategy that ensures the data is divided into manageable parts. Here's a refined approach based on your modification:
Here's a code snippet reflecting these changes: // Step 1: Decode the Base64 encoded string to binary
const binaryBuffer = Buffer.from(base64EncodedString, 'base64');
// Step 2: Decompress the binary data
const elements = await this.decompressGzip(binaryBuffer);
// Check if the table element should be merged or split
const merge_Flag = element.type === 'Table' && elements[0].text.length < 5000;
if (merge_Flag) {
if (element.metadata.is_continuation) {
return;
}
compositeElements = [...compositeElements, elements[0]];
originElements = [...originElements, elements[0]];
return;
}
// For larger elements, handle them appropriately
compositeElements = [...compositeElements, element];
originElements = originElements.concat(
elements.map(
(e) => ({ ...e, compositeId: element.element_id }) as UnstructuredPartitionElement,
),
); This approach ensures that large tables are split into smaller chunks, preventing issues with the embedding model's input limits, and maintains the integrity of the To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
📦 部署环境
Docker
📦 部署模式
服务端模式(lobe-chat-database 镜像)
📌 软件版本
1.73.0
💻 系统环境
Other Linux, Ubuntu
🌐 浏览器
Chrome
🐛 问题描述
`
// Your Base64 encoded string
const base64EncodedString = element.metadata.orig_elements as string;
delete element.metadata.orig_elements;
`
我是自己部署的unstructured 服务,unstructured解析excel 的逻辑是,将整个表格的内容会存在base64的origin中,然后正常分块的element会被标记为is_continuation,那么在这里的逻辑是,出于rag整体的效果的考虑,会把表格整体的内容传给后续的embedding流程。
这里存在一个隐形的问题,当这个Table数据特别大时,会超出embedding模型的输入范围,以bge-m3为例,输入token限制为8k,当table总字数超过这个限制,就会造成向量化失败的情况。导致后续流程无法继续
另外好像originElements 会存储unstructured_chunck表,但是这个表在rag的时候并不会被查询?
📷 复现步骤
准备一个excel,在一个sheet页中写入大量文本
🚦 期望结果
📝 补充信息
这里面我对代码进行了简单修改,通过了本地测试,如果你们觉得可以,可以将这段逻辑加入到新的版本中。目前我的时间不太充裕,无法直接提pr
`
// Step 1: Decode the Base64 encoded string to binary
const binaryBuffer = Buffer.from(base64EncodedString, 'base64');
`
The text was updated successfully, but these errors were encountered: