-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Similarity_Top_K returns fewer nodes than requested #13457
Comments
The inconsistency in the number of returned nodes from Pinecone index B compared to index A could be due to several factors specific to the configuration and behavior of your vector databases. Here are some potential reasons and areas to investigate:
To address these issues, review the configuration and behavior of your indexes with a focus on the areas mentioned above. Ensure that both indexes are configured similarly in terms of vector handling, filtering, querying parameters, and data insertion practices. For more detailed investigation, consider the specific implementation of the If the problem persists, examining the Pinecone documentation or reaching out to their support with specifics about your index configurations and the observed behavior might provide additional insights.
|
@dosu There is no filtering. The parameters are identical between the indexes. Both indexes use 3072 dimensions. The indexes are both capable of returning the number of documents I request, since index B if asked for 100 docs can return 70, so it must be capable of returning 17 if only asked for 17. |
Given the details you've shared, the difference in chunking between index A and index B (1024 vs. 2048) is a significant factor to consider, especially since both indexes are otherwise identical in configuration and parameters. This chunking difference can affect the internal representation and handling of vectors within Pinecone, potentially leading to the observed discrepancies in the number of returned nodes. Chunking affects how data is stored, indexed, and retrieved. Larger chunks might lead to a situation where fewer, more densely packed vectors are considered closely similar to the query vector, especially in a high-dimensional space like 3072 dimensions. This could result in fewer nodes being returned as the system may deem fewer vectors as being within the top K most similar, based on the internal similarity calculations and thresholds. To mitigate this issue, consider the following steps:
The goal of these steps is to identify and adjust factors that might be influencing the retrieval process due to the difference in chunking strategies. By closely monitoring how these changes affect the number of nodes returned, you can better understand the underlying cause and find a more consistent and predictable retrieval performance across both indexes.
|
@Jez77 there is a step that removes duplicates in the base retriever class -- perhaps you have duplicate data in your pinecone index? |
Could you manually set the similarity thresholds to see if it does anything? I've never used pinecone but wondering if they have some kind of dynamic threshold |
Question Validation
Question
I get inconsistent performance when retrieving documents. I am not sure if this is down to something in llamaindex or something perhaps in Pinecone, my vector db store. I have two indexes: A and B. Both contain substantially the same content except that B is chunked differently to A.
When I run a similarity top K = 17 on index A, I get 17 results, as expected. When I run the same code but set the pinecone index to B, I get some random number of nodes always less than 17. If I ask for 7 nodes I may get back 2 or 4. If I ask for 100 I get back 70 or 75, it varies. I can't fathom why I don't get back the number of nodes I request (17 or 100, or whatever), I do in my other index. Can anyone shed any light on why this might be the case please?
The only setting I'm changing is the name of the pincone index in this code:
pinecone_index = pc.Index(name=os.environ['PINECONE_NAME'])
This is all I'm using for retrieval:
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=17,
)
I'm looking at the number of returned nodes using
len(response)
and looking through the actual nodes via
for node_with_score in response:
node = node_with_score.node
etc.
The text was updated successfully, but these errors were encountered: