Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Unable to delete pinecone stored vectors with new pinecone serverless index architecture #13451

Open
OmriNach opened this issue May 12, 2024 · 3 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@OmriNach
Copy link

Bug Description

Pinecone is forcing all free plans to migrate to serverless indexes. According to their release, with serverless you are no longer able to delete vectors by metadata filtering, which is the primary way to delete using llama_index:
` def delete(self, ref_doc_id: str, **delete_kwargs: Any) -> None:
"""
Delete nodes using with ref_doc_id.

    Args:
        ref_doc_id (str): The doc_id of the document to delete.

    """
    # delete by filtering on the doc_id metadata
    self._pinecone_index.delete(
        filter={"doc_id": {"$eq": ref_doc_id}},
        namespace=self.namespace,
        **delete_kwargs,
    )`

Currently, the new way to remove ids belonging to the same document is to use
# To iterate over all result pages using a generator function for ids in index.list(prefix='doc1#', namespace='ns1'): print(ids) index.delete(ids=ids, namespace=namespace)

This requires knowledge of the vector id that was used when llama_index uses the vector_store.add method.
self._pinecone_index.upsert( entries, namespace=self.namespace, batch_size=self.batch_size, **self.insert_kwargs, ) return ids
However, when you use a vector_store as part of the ingestion pipeline class, the vector ids are not returned in the run method:

        if self.vector_store is not None:
            self.vector_store.add([n for n in nodes if n.embedding is not None])

        return nodes

How should one be deleting from pinecone now? I think it's pretty useful information to store the pinecone vector ids for both tracking purposes and deleting now that it is essential.

Version

0.10.30

Steps to Reproduce

Index a document to pinecone and try to remove it

Relevant Logs/Tracbacks

No response

@OmriNach OmriNach added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 12, 2024
Copy link

dosubot bot commented May 12, 2024

To adapt to Pinecone's new serverless index architecture and handle deletions without direct metadata filtering, you can follow a two-step approach:

  1. Retrieve Vector IDs with a Query: First, use a query to find the vectors you want to delete by applying metadata filters. This step involves querying the index to get the IDs of the vectors that match your deletion criteria.

  2. Delete Vectors Using IDs: With the vector IDs obtained from the query, proceed to delete the vectors by these IDs.

Here's a code snippet to guide you through this process:

def delete_vectors_by_metadata(self, metadata_filters):
    # Query to get vector IDs based on metadata filters
    filter = _to_pinecone_filter(metadata_filters)  # Convert metadata filters to Pinecone's format
    query_result = self._pinecone_index.query(
        vector=[0.0] * self._pinecone_index.describe_index_stats()["dimension"],
        filter=filter,
        top_k=10000,
        include_metadata=False,
        namespace=self.namespace,
    )
    vector_ids = [match.id for match in query_result.matches]

    # Delete vectors by IDs
    for vector_id in vector_ids:
        self._pinecone_index.delete(ids=[vector_id], namespace=self.namespace)

This approach requires you to first identify the vectors to be deleted through a query and then delete them using their IDs. It's a shift from the previous method where metadata filtering directly facilitated deletions.

Regarding the vector_store.add method, it does return the IDs of the vectors added to the Pinecone index, which you can store and use later for deletion purposes. Here's the relevant part of the add method that handles this:

def add(
    self,
    nodes: List[BaseNode],
    **add_kwargs: Any,
) -> List[str]:
    ids = []
    entries = []
    for node in nodes:
        node_id = node.node_id
        # Prepare the entry with node ID, vector, and metadata
        entries.append(entry)
        ids.append(node_id)
    self._pinecone_index.upsert(entries, namespace=self.namespace, batch_size=self.batch_size, **self.insert_kwargs)
    return ids

By following this updated approach, you can effectively manage deletions in Pinecone's serverless indexes, ensuring you can still remove vectors as needed despite the architectural changes.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@OmriNach
Copy link
Author

@dosu although the add method returns the vector ids, when it is called from the ingestion pipeline run method, this run method does not return the ids.

and the approach you provided seems rather convoluted for something as simple as deleting vectors, i think a better solution is to modify the pineconevectorstoree class similar to the modification in the init method to identify the pinecone version and serverless or not in order to initiate the pinecone client, to do the same for delete

Copy link

dosubot bot commented May 12, 2024

To address the challenge of deleting vectors in Pinecone's serverless architecture without direct metadata filtering, especially when vector IDs are not returned by the ingestion pipeline's run method, consider implementing a more streamlined approach within your PineconeVectorStore class. This involves enhancing the ingestion and deletion processes to ensure vector IDs are tracked and utilized efficiently for deletion, even when they're not directly returned by certain methods.

  1. Enhance the Ingestion Process: Modify the ingestion pipeline to ensure vector IDs are captured and stored at the time of vector addition. This can be achieved by adjusting the add method within your vector store class to return vector IDs, and then modifying the pipeline's run method to capture these IDs. Store these IDs in a manner that allows them to be easily retrieved for deletion purposes, such as a database or a structured file.

  2. Streamline the Deletion Process: Implement a deletion method in your PineconeVectorStore class that can handle deletion by vector IDs efficiently. This method should take advantage of Pinecone's deletion capabilities by directly using vector IDs, thus bypassing the need for metadata filtering. For batch deletions, consider implementing batching logic to group deletion requests, reducing the number of API calls and potentially improving performance.

  3. Adapt to Pinecone's Serverless Architecture: Given Pinecone's shift to a serverless model, it's crucial to adapt your deletion strategy to align with this architecture. This may involve revisiting how your application interacts with Pinecone's API, especially regarding rate limits and operational constraints associated with serverless indexes.

By focusing on these key areas, you can create a more efficient and effective solution for managing vector deletions in Pinecone's serverless architecture. This approach not only addresses the immediate challenge of deleting vectors without metadata filtering but also enhances the overall robustness and flexibility of your vector management strategy.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant