Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Found duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint of the primary key column. #13462

Open
goodrahstar opened this issue May 13, 2024 · 3 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@goodrahstar
Copy link

Bug Description

When i am trying to create a KG using Kuzu and running following code.

index = KnowledgeGraphIndex.from_documents(documents=all_docs,
                                           kg_triple_extract_template=new_kg_triple_extract,
                                           max_triplets_per_chunk=6,
                                           storage_context=storage_context,
                                           service_context=service_context,
                                           show_progress=True,
                                           include_embeddings=True)

I get this error about duplicate primary key. Which I understand that in KG all the primary keys needs to be unique but I don't see a place to fix this.

Version

llama-index 0.10.36

Steps to Reproduce

Running GH repo reader

branch = "master"
all_docs = []
for repo in list_of_gits:
  owner, repo = extract_owner_and_repo(repo)
  try:
    github_client = GithubClient(github_token=github_token, verbose=True)

    documents = GithubRepositoryReader(
        github_client=github_client,
        owner=owner,
        repo=repo,
        use_parser=False,
        verbose=False,
        filter_file_extensions=(
            [
                ".md", ".rs"
            ],
            GithubRepositoryReader.FilterType.INCLUDE,
        )
    ).load_data(branch=branch)
    all_docs.extend(documents)
  except:
    print(repo)
    pass

Running indexer

storage_context = StorageContext.from_defaults(graph_store=graph_store)

# NOTE: can take a while!
index = KnowledgeGraphIndex.from_documents(documents=all_docs,
                                           kg_triple_extract_template=new_kg_triple_extract,
                                           max_triplets_per_chunk=6,
                                           storage_context=storage_context,
                                           service_context=service_context,
                                           show_progress=True,
                                           include_embeddings=False)

Relevant Logs/Tracbacks

Parsing nodes: 100%|██████████| 3390/3390 [00:32<00:00, 104.49it/s]
Processing nodes:   0%|          | 45/37805 [01:42<23:52:27,  2.28s/it]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[28], line 4
      1 storage_context = StorageContext.from_defaults(graph_store=graph_store)
      3 # NOTE: can take a while!
----> 4 index = KnowledgeGraphIndex.from_documents(documents=all_docs,
      5                                            kg_triple_extract_template=new_kg_triple_extract,
      6                                            max_triplets_per_chunk=6,
      7                                            storage_context=storage_context,
      8                                            service_context=service_context,
      9                                            show_progress=True,
     10                                            include_embeddings=False)

File ~/myenv/lib/python3.12/site-packages/llama_index/core/indices/base.py:145, in BaseIndex.from_documents(cls, documents, storage_context, show_progress, callback_manager, transformations, service_context, **kwargs)
    136     docstore.set_document_hash(doc.get_doc_id(), doc.hash)
    138 nodes = run_transformations(
    139     documents,  # type: ignore
    140     transformations,
    141     show_progress=show_progress,
    142     **kwargs,
    143 )
--> 145 return cls(
    146     nodes=nodes,
    147     storage_context=storage_context,
    148     callback_manager=callback_manager,
...
     82                        parameters)
     83 if not _query_result.isSuccess():
     84     raise RuntimeError(_query_result.getErrorMessage())

RuntimeError: Runtime exception: Found duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint of the primary key column
@goodrahstar goodrahstar added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 13, 2024
Copy link

dosubot bot commented May 13, 2024

The error you're encountering is due to a duplicated primary key value Runtime::runtimecall, which violates the uniqueness constraint for the primary key column in your Knowledge Graph (KG). This typically happens when there are attempts to insert a node or document with a primary key that already exists in the index. To resolve this issue, follow these steps:

  1. Ensure Unique Primary Keys Before Insertion: Make sure each primary key is unique before inserting nodes or documents into the index. This can be done by maintaining a set of used primary keys and checking against it before insertion.

  2. Use Document Hashes Wisely: The docstore.set_document_hash(doc.get_doc_id(), doc.hash) method sets a unique hash for each document based on its ID. Ensure that the document ID is unique across all documents to prevent collisions.

  3. Leverage allow_update Parameter: When adding documents to the document store, the allow_update=True parameter allows for existing documents with the same ID to be updated instead of causing a duplication error. Ensure this behavior is desirable for your use case.

  4. Custom Uniqueness Checks: Implement custom logic to check for the existence of a node or document before attempting to insert it into the index. This can be done by querying the existing index or document store.

  5. Handle Updates and Deletions Carefully: Ensure that operations do not inadvertently introduce duplicates. For updates, consider deleting the old node or document before inserting the updated one if the primary key or ID is changing.

Here's a code snippet to help with implementing a uniqueness check:

def insert_unique_document(self, document: Document, **insert_kwargs: Any) -> None:
    """Insert a document only if it's unique based on its primary key."""
    existing_doc_hash = self._docstore.get_document_hash(document.get_doc_id())
    if existing_doc_hash is None:
        # Document is unique, proceed with insertion
        self.insert(document, **insert_kwargs)
    else:
        # Document already exists, handle accordingly
        logger.warning(f"Document with ID {document.get_doc_id()} already exists.")

Integrate this uniqueness check into your workflow where documents or nodes are inserted into the index. Adjust the logic according to your specific requirements, such as updating existing documents instead of skipping them. This approach should help you avoid the duplication error and ensure the uniqueness constraint of the primary key column is not violated during the indexing process.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@goodrahstar
Copy link
Author

None of the above solution helps to solve this problem.

Copy link

dosubot bot commented May 14, 2024

If none of the previously suggested solutions have resolved the issue with the duplicated primary key value Runtime::runtimecall when creating your Knowledge Graph, it might be necessary to delve deeper into the specifics of how your Knowledge Graph Index is handling the insertion and uniqueness of nodes or triplets. Here are some refined steps to consider:

  1. Review the Implementation of Your Graph Store: Ensure that the GraphStore or its specific implementation, like SimpleGraphStore, is designed to handle uniqueness checks or is capable of doing so with minor adjustments. This involves verifying how upsert_triplet manages the insertion of new triplets and whether it checks for existing ones before proceeding.

  2. Debug and Log Key Operations: Increase the verbosity of logging around the insertion points in your code. This can help identify exactly where and why the duplicate entries are being attempted. Pay special attention to the values being inserted and any patterns that might suggest why duplicates are occurring.

  3. Manual Data Preprocessing: As a temporary workaround, consider preprocessing your data to remove duplicates before it reaches the point of insertion. This could involve scripting a quick deduplication step based on the primary keys or triplets you're working with.

  4. Database and Transaction Management: If your backend supports it, leveraging database constraints or transaction management can provide a safety net against duplicates. This might involve setting up unique constraints on the database level or wrapping your insertion logic in transactions to rollback in case of an error.

  5. Directly Address the Source of Runtime::runtimecall: Since the error specifically mentions Runtime::runtimecall, identifying the source of this value and how it's being generated or inserted multiple times could provide a direct path to resolving the issue. There might be a logical or data flow error leading to its repeated insertion.

Given the complexity of the issue and the potential for it to be rooted in the specific implementation details of your Knowledge Graph Index or the underlying data, these steps are aimed at guiding you towards identifying and resolving the root cause of the duplication error.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant