Skip to content

RAG 场景大Table解析缺陷 #7481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bsq1989 opened this issue Apr 19, 2025 · 2 comments
Open

RAG 场景大Table解析缺陷 #7481

bsq1989 opened this issue Apr 19, 2025 · 2 comments
Labels
🐛 Bug Something isn't working | 缺陷 unconfirm 未被维护者确认的问题

Comments

@bsq1989
Copy link

bsq1989 commented Apr 19, 2025

📦 部署环境

Docker

📦 部署模式

服务端模式(lobe-chat-database 镜像)

📌 软件版本

1.73.0

💻 系统环境

Other Linux, Ubuntu

🌐 浏览器

Chrome

🐛 问题描述

  • 代码位置 src/libs/unstructured/index.ts 108 - 124

`
// Your Base64 encoded string
const base64EncodedString = element.metadata.orig_elements as string;
delete element.metadata.orig_elements;

      if (!base64EncodedString) return;

      // Step 1: Decode the Base64 encoded string to binary
      const binaryBuffer = Buffer.from(base64EncodedString, 'base64');

      // Step 2: Decompress the binary data
      const elements = await this.decompressGzip(binaryBuffer);

      // if element is Table type then get the origin
      if (element.type === 'Table') {
        // skip continuation table due to being split by chunk strategy
        if (element.metadata.is_continuation) {
          return;
        }

        compositeElements = [...compositeElements, elements[0]];
        originElements = [...originElements, elements[0]];
        return;
      }

      compositeElements = [...compositeElements, element];

      originElements = originElements.concat(
        elements.map(
          (e) => ({ ...e, compositeId: element.element_id }) as UnstructuredPartitionElement,
        ),
      );
    });

`
我是自己部署的unstructured 服务,unstructured解析excel 的逻辑是,将整个表格的内容会存在base64的origin中,然后正常分块的element会被标记为is_continuation,那么在这里的逻辑是,出于rag整体的效果的考虑,会把表格整体的内容传给后续的embedding流程。
这里存在一个隐形的问题,当这个Table数据特别大时,会超出embedding模型的输入范围,以bge-m3为例,输入token限制为8k,当table总字数超过这个限制,就会造成向量化失败的情况。导致后续流程无法继续

另外好像originElements 会存储unstructured_chunck表,但是这个表在rag的时候并不会被查询?

📷 复现步骤

准备一个excel,在一个sheet页中写入大量文本

🚦 期望结果

  • 大excel table 可以正确分块并不会找过embedding模型的限制。

📝 补充信息

这里面我对代码进行了简单修改,通过了本地测试,如果你们觉得可以,可以将这段逻辑加入到新的版本中。目前我的时间不太充裕,无法直接提pr
`
// Step 1: Decode the Base64 encoded string to binary
const binaryBuffer = Buffer.from(base64EncodedString, 'base64');

      // Step 2: Decompress the binary data
      const elements = await this.decompressGzip(binaryBuffer);

      const merge_Flag = element.type === 'Table' && elements[0].text.length < 5000;
      
      if (merge_Flag) {
        if (element.metadata.is_continuation) {
          return;
        }

        compositeElements = [...compositeElements, elements[0]];
        originElements = [...originElements, elements[0]];
        return;
      }

      compositeElements = [...compositeElements, element];
      originElements = originElements.concat(
        elements.map(
          (e) => ({ ...e, compositeId: element.element_id }) as UnstructuredPartitionElement,
        ),
      );

`

@bsq1989 bsq1989 added the unconfirm 未被维护者确认的问题 label Apr 19, 2025
@lobehubbot
Copy link
Member

👀 @bsq1989

Thank you for raising an issue. We will investigate into the matter and get back to you as soon as possible.
Please make sure you have given us as much context as possible.
非常感谢您提交 issue。我们会尽快调查此事,并尽快回复您。 请确保您已经提供了尽可能多的背景信息。

Copy link

dosubot bot commented Apr 19, 2025

Hello @bsq1989! I'm here to help you solve bugs, answer questions, and assist you in becoming a contributor. While you wait for a human maintainer, feel free to let me know how I can assist you!

To address the issue of handling large Excel tables in the RAG process without exceeding the embedding model's input limit, you can implement a chunking strategy that ensures the data is divided into manageable parts. Here's a refined approach based on your modification:

  1. Chunking Strategy: Implement a condition to check the length of the text in the table elements. If the text length is below a certain threshold (e.g., 5000 characters), you can process it as a single element. Otherwise, apply a chunking strategy to split the data into smaller parts.

  2. Handling Large Tables: For large tables, ensure that the data is split into chunks that do not exceed the model's token limit. This prevents vectorization failures due to input size constraints.

  3. Storing and Querying Origin Elements: Ensure that originElements are correctly stored and can be queried during the RAG process. This involves appending decompressed elements to originElements and ensuring they are accessible for subsequent processing.

Here's a code snippet reflecting these changes:

// Step 1: Decode the Base64 encoded string to binary
const binaryBuffer = Buffer.from(base64EncodedString, 'base64');

// Step 2: Decompress the binary data
const elements = await this.decompressGzip(binaryBuffer);

// Check if the table element should be merged or split
const merge_Flag = element.type === 'Table' && elements[0].text.length < 5000;

if (merge_Flag) {
  if (element.metadata.is_continuation) {
    return;
  }

  compositeElements = [...compositeElements, elements[0]];
  originElements = [...originElements, elements[0]];
  return;
}

// For larger elements, handle them appropriately
compositeElements = [...compositeElements, element];
originElements = originElements.concat(
  elements.map(
    (e) => ({ ...e, compositeId: element.element_id }) as UnstructuredPartitionElement,
  ),
);

This approach ensures that large tables are split into smaller chunks, preventing issues with the embedding model's input limits, and maintains the integrity of the originElements for querying during the RAG process [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@dosubot dosubot bot added the 🐛 Bug Something isn't working | 缺陷 label Apr 19, 2025
@github-project-automation github-project-automation bot moved this to Roadmap - Chat 1.x in Lobe Chat Routine Apr 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 Bug Something isn't working | 缺陷 unconfirm 未被维护者确认的问题
Projects
Status: Roadmap - Chat 1.x
Development

No branches or pull requests

2 participants