-
Notifications
You must be signed in to change notification settings - Fork 6.1k
[Bug]: Additional exception handling needed in retriever.py #7791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
🤖 Thank you for reporting this issue. It seems like the problem arises when the number of chunks is smaller than the length k from similarity_top_k=k, causing a KeyError: '-1'. To prevent this error, you can add a check to ensure that the length of the query_result.ids is not less than the similarity_top_k value. If it is, you can either raise an error or adjust the similarity_top_k value to match the length of query_result.ids. Here is how you can modify the _build_node_list_from_query_result method to include this check: def _build_node_list_from_query_result(
self, query_result: VectorStoreQueryResult
) -> List[NodeWithScore]:
if query_result.nodes is None:
if query_result.ids is None:
raise ValueError(
"Vector store query result should return at "
"least one of nodes or ids."
)
assert isinstance(self._index.index_struct, IndexDict)
# Check if the length of query_result.ids is less than similarity_top_k
if len(query_result.ids) < self._similarity_top_k:
raise ValueError(
"The number of chunks is smaller than the length k from similarity_top_k=k."
)
node_ids = [
self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
]
nodes = self._docstore.get_nodes(node_ids)
query_result.nodes = nodes
else:
for i in range(len(query_result.nodes)):
source_node = query_result.nodes[i].source_node
if (not self._vector_store.stores_text) or (
source_node is not None and source_node.node_type != ObjectType.TEXT
):
node_id = query_result.nodes[i].node_id
if node_id in self._docstore.docs:
query_result.nodes[
i
] = self._docstore.get_node(node_id)
log_vector_store_query_result(query_result)
node_with_scores: List[NodeWithScore] = []
for ind, node in enumerate(query_result.nodes):
score: Optional[float] = None
if query_result.similarities is not None:
score = query_result.similarities[ind]
node_with_scores.append(NodeWithScore(node=node, score=score))
return node_with_scores In this modification, if the length of query_result.ids is less than the similarity_top_k value, a ValueError is raised with a message indicating the issue. This will prevent the KeyError from occurring. This issue seems to be similar to a couple of previously solved issues in the LlamaIndex repository, namely [Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store and Potential bug with 0.6.0 shared storage_context and vector store retrieval. The solutions provided in these issues might also be helpful for you. Please try this solution and let me know if it resolves your issue. If you have any further questions or if the problem persists, feel free to ask. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Sorry, there was a typo in my original post. Inside the _build_node_list_from_query_result() function in retriever.py, the original line was:
This needs changed to:
The solution provided by the bot doesn't work (I believe it's just the original code). So to summarize, to fix this problem this is needed:
|
Hi, @mphipps2, I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you reported a bug in retriever.py that caused a KeyError when the number of chunks was smaller than the length k from similarity_top_k=k. I provided a potential solution, but you clarified that there was a typo in your original post and provided the corrected code snippet to fix the problem. JingggW reacted with a thumbs up to your clarification. Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your contribution to the LlamaIndex repository! If you have any further questions or updates, feel free to reach out. |
I needed to apply the fix manually because it is still not updated in the PR. My retriever.py was also still the old version from last year. The error only occurs when no chunks are returned at all, i.e. when a search query contains a particular combination of words that somehow don't yield any results. I found that if I modified or omitted just one word from my query text, the retriever would find chunks again and this problem wouldn't surface. So it depends also on the query text. I am wondering if it also depends on how the vector store and index was built. I am using faiss-CPU with the "BAAI/bge-small-en-v1.5" from Huggingface |
Uh oh!
There was an error while loading. Please reload this page.
Bug Description
This line from _build_node_list_from_query_result in retriever.py:
node_ids = [
self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
]
causes a crach and returns a KeyError: '-1' when the number of chunks is smaller than the length k from similarity_top_k=k. This happens because query_result.ids = ['0', '1', '-1'] --- so that '-1' causes the crash then. This should be an easy fix, something like this would do the trick:
****Edit: (10/03/23) fixing previous typo. The correct solution is: ****
node_ids = [
self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
]
or at the very least by having some sort of exception handling logic in place
Version
0.8.30
Steps to Reproduce
Running a simple document qa pipeline with a custom LLM class, a local embedding model and using FAISS
Relevant Logs/Tracbacks
The text was updated successfully, but these errors were encountered: