[Bug]: Found duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint of the primary key column. #13462

goodrahstar · 2024-05-13T17:04:04Z

Bug Description

When i am trying to create a KG using Kuzu and running following code.

index = KnowledgeGraphIndex.from_documents(documents=all_docs,
                                           kg_triple_extract_template=new_kg_triple_extract,
                                           max_triplets_per_chunk=6,
                                           storage_context=storage_context,
                                           service_context=service_context,
                                           show_progress=True,
                                           include_embeddings=True)

I get this error about duplicate primary key. Which I understand that in KG all the primary keys needs to be unique but I don't see a place to fix this.

Version

llama-index 0.10.36

Steps to Reproduce

Running GH repo reader

branch = "master"
all_docs = []
for repo in list_of_gits:
  owner, repo = extract_owner_and_repo(repo)
  try:
    github_client = GithubClient(github_token=github_token, verbose=True)

    documents = GithubRepositoryReader(
        github_client=github_client,
        owner=owner,
        repo=repo,
        use_parser=False,
        verbose=False,
        filter_file_extensions=(
            [
                ".md", ".rs"
            ],
            GithubRepositoryReader.FilterType.INCLUDE,
        )
    ).load_data(branch=branch)
    all_docs.extend(documents)
  except:
    print(repo)
    pass

Running indexer

storage_context = StorageContext.from_defaults(graph_store=graph_store)

# NOTE: can take a while!
index = KnowledgeGraphIndex.from_documents(documents=all_docs,
                                           kg_triple_extract_template=new_kg_triple_extract,
                                           max_triplets_per_chunk=6,
                                           storage_context=storage_context,
                                           service_context=service_context,
                                           show_progress=True,
                                           include_embeddings=False)

Relevant Logs/Tracbacks

Parsing nodes: 100%|██████████| 3390/3390 [00:32<00:00, 104.49it/s]
Processing nodes:   0%|          | 45/37805 [01:42<23:52:27,  2.28s/it]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[28], line 4
      1 storage_context = StorageContext.from_defaults(graph_store=graph_store)
      3 # NOTE: can take a while!
----> 4 index = KnowledgeGraphIndex.from_documents(documents=all_docs,
      5                                            kg_triple_extract_template=new_kg_triple_extract,
      6                                            max_triplets_per_chunk=6,
      7                                            storage_context=storage_context,
      8                                            service_context=service_context,
      9                                            show_progress=True,
     10                                            include_embeddings=False)

File ~/myenv/lib/python3.12/site-packages/llama_index/core/indices/base.py:145, in BaseIndex.from_documents(cls, documents, storage_context, show_progress, callback_manager, transformations, service_context, **kwargs)
    136     docstore.set_document_hash(doc.get_doc_id(), doc.hash)
    138 nodes = run_transformations(
    139     documents,  # type: ignore
    140     transformations,
    141     show_progress=show_progress,
    142     **kwargs,
    143 )
--> 145 return cls(
    146     nodes=nodes,
    147     storage_context=storage_context,
    148     callback_manager=callback_manager,
...
     82                        parameters)
     83 if not _query_result.isSuccess():
     84     raise RuntimeError(_query_result.getErrorMessage())

RuntimeError: Runtime exception: Found duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint of the primary key column

The text was updated successfully, but these errors were encountered:

dosubot · 2024-05-13T17:05:09Z

The error you're encountering is due to a duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint for the primary key column in your Knowledge Graph (KG). This typically happens when there are attempts to insert a node or document with a primary key that already exists in the index. To resolve this issue, follow these steps:

Ensure Unique Primary Keys Before Insertion: Make sure each primary key is unique before inserting nodes or documents into the index. This can be done by maintaining a set of used primary keys and checking against it before insertion.
Use Document Hashes Wisely: The docstore.set_document_hash(doc.get_doc_id(), doc.hash) method sets a unique hash for each document based on its ID. Ensure that the document ID is unique across all documents to prevent collisions.
Leverage allow_update Parameter: When adding documents to the document store, the allow_update=True parameter allows for existing documents with the same ID to be updated instead of causing a duplication error. Ensure this behavior is desirable for your use case.
Custom Uniqueness Checks: Implement custom logic to check for the existence of a node or document before attempting to insert it into the index. This can be done by querying the existing index or document store.
Handle Updates and Deletions Carefully: Ensure that operations do not inadvertently introduce duplicates. For updates, consider deleting the old node or document before inserting the updated one if the primary key or ID is changing.

Here's a code snippet to help with implementing a uniqueness check:

def insert_unique_document(self, document: Document, **insert_kwargs: Any) -> None:
    """Insert a document only if it's unique based on its primary key."""
    existing_doc_hash = self._docstore.get_document_hash(document.get_doc_id())
    if existing_doc_hash is None:
        # Document is unique, proceed with insertion
        self.insert(document, **insert_kwargs)
    else:
        # Document already exists, handle accordingly
        logger.warning(f"Document with ID {document.get_doc_id()} already exists.")

Integrate this uniqueness check into your workflow where documents or nodes are inserted into the index. Adjust the logic according to your specific requirements, such as updating existing documents instead of skipping them. This approach should help you avoid the duplication error and ensure the uniqueness constraint of the primary key column is not violated during the indexing process.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

goodrahstar · 2024-05-14T15:06:03Z

None of the above solution helps to solve this problem.

dosubot · 2024-05-14T15:20:59Z

If none of the previously suggested solutions have resolved the issue with the duplicated primary key value Runtime::runtimecall when creating your Knowledge Graph, it might be necessary to delve deeper into the specifics of how your Knowledge Graph Index is handling the insertion and uniqueness of nodes or triplets. Here are some refined steps to consider:

Review the Implementation of Your Graph Store: Ensure that the GraphStore or its specific implementation, like SimpleGraphStore, is designed to handle uniqueness checks or is capable of doing so with minor adjustments. This involves verifying how upsert_triplet manages the insertion of new triplets and whether it checks for existing ones before proceeding.
Debug and Log Key Operations: Increase the verbosity of logging around the insertion points in your code. This can help identify exactly where and why the duplicate entries are being attempted. Pay special attention to the values being inserted and any patterns that might suggest why duplicates are occurring.
Manual Data Preprocessing: As a temporary workaround, consider preprocessing your data to remove duplicates before it reaches the point of insertion. This could involve scripting a quick deduplication step based on the primary keys or triplets you're working with.
Database and Transaction Management: If your backend supports it, leveraging database constraints or transaction management can provide a safety net against duplicates. This might involve setting up unique constraints on the database level or wrapping your insertion logic in transactions to rollback in case of an error.
Directly Address the Source of Runtime::runtimecall: Since the error specifically mentions Runtime::runtimecall, identifying the source of this value and how it's being generated or inserted multiple times could provide a direct path to resolving the issue. There might be a logical or data flow error leading to its repeated insertion.

Given the complexity of the issue and the potential for it to be rooted in the specific implementation details of your Knowledge Graph Index or the underlying data, these steps are aimed at guiding you towards identifying and resolving the root cause of the duplication error.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

goodrahstar added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Found duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint of the primary key column. #13462

[Bug]: Found duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint of the primary key column. #13462

goodrahstar commented May 13, 2024

dosubot bot commented May 13, 2024 •

edited

Details

goodrahstar commented May 14, 2024

dosubot bot commented May 14, 2024

Details

[Bug]: Found duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint of the primary key column. #13462

[Bug]: Found duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint of the primary key column. #13462

Comments

goodrahstar commented May 13, 2024

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented May 13, 2024 • edited

Details

goodrahstar commented May 14, 2024

dosubot bot commented May 14, 2024

Details

dosubot bot commented May 13, 2024 •

edited