[Bug]: Additional exception handling needed in retriever.py #7791

mphipps2 · 2023-09-22T20:27:09Z

Bug Description

This line from _build_node_list_from_query_result in retriever.py:

node_ids = [
self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
]

causes a crach and returns a KeyError: '-1' when the number of chunks is smaller than the length k from similarity_top_k=k. This happens because query_result.ids = ['0', '1', '-1'] --- so that '-1' causes the crash then. This should be an easy fix, something like this would do the trick:

****Edit: (10/03/23) fixing previous typo. The correct solution is: ****
node_ids = [
self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
]

or at the very least by having some sort of exception handling logic in place

Version

0.8.30

Steps to Reproduce

Running a simple document qa pipeline with a custom LLM class, a local embedding model and using FAISS

Relevant Logs/Tracbacks

Traceback (most recent call last):
File "/python3.9/site-packages/llama_index/indices/vector_store/retrievers/retriever.py", line 165, in _get_nodes_with_embeddings
    return self._build_node_list_from_query_result(query_result)
File "/python3.9/site-packages/llama_index/indices/vector_store/retrievers/retriever.py", line 129, in <listcomp>
    self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
KeyError: '-1'

dosubot · 2023-09-22T20:34:54Z

🤖

Thank you for reporting this issue. It seems like the problem arises when the number of chunks is smaller than the length k from similarity_top_k=k, causing a KeyError: '-1'.

To prevent this error, you can add a check to ensure that the length of the query_result.ids is not less than the similarity_top_k value. If it is, you can either raise an error or adjust the similarity_top_k value to match the length of query_result.ids.

Here is how you can modify the _build_node_list_from_query_result method to include this check:

def _build_node_list_from_query_result(
        self, query_result: VectorStoreQueryResult
    ) -> List[NodeWithScore]:
        if query_result.nodes is None:
            if query_result.ids is None:
                raise ValueError(
                    "Vector store query result should return at "
                    "least one of nodes or ids."
                )
            assert isinstance(self._index.index_struct, IndexDict)
            
            # Check if the length of query_result.ids is less than similarity_top_k
            if len(query_result.ids) < self._similarity_top_k:
                raise ValueError(
                    "The number of chunks is smaller than the length k from similarity_top_k=k."
                )
            
            node_ids = [
                self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
            ]
            nodes = self._docstore.get_nodes(node_ids)
            query_result.nodes = nodes
        else:
            for i in range(len(query_result.nodes)):
                source_node = query_result.nodes[i].source_node
                if (not self._vector_store.stores_text) or (
                    source_node is not None and source_node.node_type != ObjectType.TEXT
                ):
                    node_id = query_result.nodes[i].node_id
                    if node_id in self._docstore.docs:
                        query_result.nodes[
                            i
                        ] = self._docstore.get_node(node_id)

        log_vector_store_query_result(query_result)

        node_with_scores: List[NodeWithScore] = []
        for ind, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[ind]
            node_with_scores.append(NodeWithScore(node=node, score=score))

        return node_with_scores

In this modification, if the length of query_result.ids is less than the similarity_top_k value, a ValueError is raised with a message indicating the issue. This will prevent the KeyError from occurring.

This issue seems to be similar to a couple of previously solved issues in the LlamaIndex repository, namely [Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store and Potential bug with 0.6.0 shared storage_context and vector store retrieval. The solutions provided in these issues might also be helpful for you.

Please try this solution and let me know if it resolves your issue. If you have any further questions or if the problem persists, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

mphipps2 · 2023-10-03T22:46:41Z

Sorry, there was a typo in my original post. Inside the _build_node_list_from_query_result() function in retriever.py, the original line was:

        node_ids = [
            self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
        ]

This needs changed to:

        node_ids = [
            self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
        ]

The solution provided by the bot doesn't work (I believe it's just the original code). So to summarize, to fix this problem this is needed:

def _build_node_list_from_query_result(
    self, query_result: VectorStoreQueryResult
) -> List[NodeWithScore]:
    if query_result.nodes is None:
        # NOTE: vector store does not keep text and returns node indices.
        # Need to recover all nodes from docstore
        if query_result.ids is None:
            raise ValueError(
                "Vector store query result should return at "
                "least one of nodes or ids."
            )
        assert isinstance(self._index.index_struct, IndexDict)

        node_ids = [
            self._index.index_struct.nodes_dict[idx] for idx in query_result.ids if idx in self._index.index_struct.nodes_dict
        ]

        nodes = self._docstore.get_nodes(node_ids)
        query_result.nodes = nodes
    else:
        # NOTE: vector store keeps text, returns nodes.
        # Only need to recover image or index nodes from docstore
        for i in range(len(query_result.nodes)):
            source_node = query_result.nodes[i].source_node
            if (not self._vector_store.stores_text) or (
                source_node is not None and source_node.node_type != ObjectType.TEXT
            ):
                node_id = query_result.nodes[i].node_id
                if self._docstore.document_exists(node_id):
                    query_result.nodes[
                        i
                    ] = self._docstore.get_node(  # type: ignore[index]
                        node_id
                    )

    log_vector_store_query_result(query_result)

    node_with_scores: List[NodeWithScore] = []
    for ind, node in enumerate(query_result.nodes):
        score: Optional[float] = None
        if query_result.similarities is not None:
            score = query_result.similarities[ind]
        node_with_scores.append(NodeWithScore(node=node, score=score))

    return node_with_scores

dosubot · 2024-01-30T00:50:12Z

Hi, @mphipps2,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you reported a bug in retriever.py that caused a KeyError when the number of chunks was smaller than the length k from similarity_top_k=k. I provided a potential solution, but you clarified that there was a typo in your original post and provided the corrected code snippet to fix the problem. JingggW reacted with a thumbs up to your clarification.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LlamaIndex repository! If you have any further questions or updates, feel free to reach out.

bruhacsA · 2024-02-21T12:54:32Z

I needed to apply the fix manually because it is still not updated in the PR. My retriever.py was also still the old version from last year.

The error only occurs when no chunks are returned at all, i.e. when a search query contains a particular combination of words that somehow don't yield any results. I found that if I modified or omitted just one word from my query text, the retriever would find chunks again and this problem wouldn't surface. So it depends also on the query text.

I am wondering if it also depends on how the vector store and index was built. I am using faiss-CPU with the "BAAI/bge-small-en-v1.5" from Huggingface

mphipps2 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Sep 22, 2023

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 30, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Additional exception handling needed in retriever.py #7791

[Bug]: Additional exception handling needed in retriever.py #7791

mphipps2 commented Sep 22, 2023 •

edited

Loading

dosubot bot commented Sep 22, 2023

Uh oh!

mphipps2 commented Oct 3, 2023 •

edited

Loading

Uh oh!

dosubot bot commented Jan 30, 2024

Uh oh!

bruhacsA commented Feb 21, 2024

Uh oh!

[Bug]: Additional exception handling needed in retriever.py #7791

[Bug]: Additional exception handling needed in retriever.py #7791

Comments

mphipps2 commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Sep 22, 2023

Sources

Uh oh!

mphipps2 commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Jan 30, 2024

Uh oh!

bruhacsA commented Feb 21, 2024

Uh oh!

mphipps2 commented Sep 22, 2023 •

edited

Loading

mphipps2 commented Oct 3, 2023 •

edited

Loading