Skip to content

[Bug]: Additional exception handling needed in retriever.py #7791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mphipps2 opened this issue Sep 22, 2023 · 4 comments
Closed

[Bug]: Additional exception handling needed in retriever.py #7791

mphipps2 opened this issue Sep 22, 2023 · 4 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@mphipps2
Copy link

mphipps2 commented Sep 22, 2023

Bug Description

This line from _build_node_list_from_query_result in retriever.py:

node_ids = [
self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
]

causes a crach and returns a KeyError: '-1' when the number of chunks is smaller than the length k from similarity_top_k=k. This happens because query_result.ids = ['0', '1', '-1'] --- so that '-1' causes the crash then. This should be an easy fix, something like this would do the trick:

****Edit: (10/03/23) fixing previous typo. The correct solution is: ****
node_ids = [
self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
]

or at the very least by having some sort of exception handling logic in place

Version

0.8.30

Steps to Reproduce

Running a simple document qa pipeline with a custom LLM class, a local embedding model and using FAISS

Relevant Logs/Tracbacks

Traceback (most recent call last):
File "/python3.9/site-packages/llama_index/indices/vector_store/retrievers/retriever.py", line 165, in _get_nodes_with_embeddings
    return self._build_node_list_from_query_result(query_result)
File "/python3.9/site-packages/llama_index/indices/vector_store/retrievers/retriever.py", line 129, in <listcomp>
    self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
KeyError: '-1'
@mphipps2 mphipps2 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Sep 22, 2023
@dosubot
Copy link

dosubot bot commented Sep 22, 2023

🤖

Thank you for reporting this issue. It seems like the problem arises when the number of chunks is smaller than the length k from similarity_top_k=k, causing a KeyError: '-1'.

To prevent this error, you can add a check to ensure that the length of the query_result.ids is not less than the similarity_top_k value. If it is, you can either raise an error or adjust the similarity_top_k value to match the length of query_result.ids.

Here is how you can modify the _build_node_list_from_query_result method to include this check:

def _build_node_list_from_query_result(
        self, query_result: VectorStoreQueryResult
    ) -> List[NodeWithScore]:
        if query_result.nodes is None:
            if query_result.ids is None:
                raise ValueError(
                    "Vector store query result should return at "
                    "least one of nodes or ids."
                )
            assert isinstance(self._index.index_struct, IndexDict)
            
            # Check if the length of query_result.ids is less than similarity_top_k
            if len(query_result.ids) < self._similarity_top_k:
                raise ValueError(
                    "The number of chunks is smaller than the length k from similarity_top_k=k."
                )
            
            node_ids = [
                self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
            ]
            nodes = self._docstore.get_nodes(node_ids)
            query_result.nodes = nodes
        else:
            for i in range(len(query_result.nodes)):
                source_node = query_result.nodes[i].source_node
                if (not self._vector_store.stores_text) or (
                    source_node is not None and source_node.node_type != ObjectType.TEXT
                ):
                    node_id = query_result.nodes[i].node_id
                    if node_id in self._docstore.docs:
                        query_result.nodes[
                            i
                        ] = self._docstore.get_node(node_id)

        log_vector_store_query_result(query_result)

        node_with_scores: List[NodeWithScore] = []
        for ind, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[ind]
            node_with_scores.append(NodeWithScore(node=node, score=score))

        return node_with_scores

In this modification, if the length of query_result.ids is less than the similarity_top_k value, a ValueError is raised with a message indicating the issue. This will prevent the KeyError from occurring.

This issue seems to be similar to a couple of previously solved issues in the LlamaIndex repository, namely [Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store and Potential bug with 0.6.0 shared storage_context and vector store retrieval. The solutions provided in these issues might also be helpful for you.

Please try this solution and let me know if it resolves your issue. If you have any further questions or if the problem persists, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@mphipps2
Copy link
Author

mphipps2 commented Oct 3, 2023

Sorry, there was a typo in my original post. Inside the _build_node_list_from_query_result() function in retriever.py, the original line was:

        node_ids = [
            self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
        ]

This needs changed to:

        node_ids = [
            self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
        ]

The solution provided by the bot doesn't work (I believe it's just the original code). So to summarize, to fix this problem this is needed:

def _build_node_list_from_query_result(
    self, query_result: VectorStoreQueryResult
) -> List[NodeWithScore]:
    if query_result.nodes is None:
        # NOTE: vector store does not keep text and returns node indices.
        # Need to recover all nodes from docstore
        if query_result.ids is None:
            raise ValueError(
                "Vector store query result should return at "
                "least one of nodes or ids."
            )
        assert isinstance(self._index.index_struct, IndexDict)

        node_ids = [
            self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
        ]

        nodes = self._docstore.get_nodes(node_ids)
        query_result.nodes = nodes
    else:
        # NOTE: vector store keeps text, returns nodes.
        # Only need to recover image or index nodes from docstore
        for i in range(len(query_result.nodes)):
            source_node = query_result.nodes[i].source_node
            if (not self._vector_store.stores_text) or (
                source_node is not None and source_node.node_type != ObjectType.TEXT
            ):
                node_id = query_result.nodes[i].node_id
                if self._docstore.document_exists(node_id):
                    query_result.nodes[
                        i
                    ] = self._docstore.get_node(  # type: ignore[index]
                        node_id
                    )

    log_vector_store_query_result(query_result)

    node_with_scores: List[NodeWithScore] = []
    for ind, node in enumerate(query_result.nodes):
        score: Optional[float] = None
        if query_result.similarities is not None:
            score = query_result.similarities[ind]
        node_with_scores.append(NodeWithScore(node=node, score=score))

    return node_with_scores

Copy link

dosubot bot commented Jan 30, 2024

Hi, @mphipps2,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you reported a bug in retriever.py that caused a KeyError when the number of chunks was smaller than the length k from similarity_top_k=k. I provided a potential solution, but you clarified that there was a typo in your original post and provided the corrected code snippet to fix the problem. JingggW reacted with a thumbs up to your clarification.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LlamaIndex repository! If you have any further questions or updates, feel free to reach out.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 30, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 6, 2024
@bruhacsA
Copy link

I needed to apply the fix manually because it is still not updated in the PR. My retriever.py was also still the old version from last year.

The error only occurs when no chunks are returned at all, i.e. when a search query contains a particular combination of words that somehow don't yield any results. I found that if I modified or omitted just one word from my query text, the retriever would find chunks again and this problem wouldn't surface. So it depends also on the query text.

I am wondering if it also depends on how the vector store and index was built. I am using faiss-CPU with the "BAAI/bge-small-en-v1.5" from Huggingface

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants