[Question]: Why indexing so slow? #13313

jiaoqianji · 2024-05-07T08:11:23Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

I'm trying to create indexes for a database that is 50 GB in size, and I've spent more than 5 days running the code below without any results. Then I noticed that the GPU utilization remains at 0; what should I do to speed up the indexing process?

import os
from datetime import datetime, timedelta
from typing import Optional, Any

from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.embeddings import resolve_embed_model
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

data_path = "/databases/"
PERSIST_DIR = "/data_cache/fin_databases"
embedding_model_path = "/Qwen1.5-0.5B/"

documents = SimpleDirectoryReader(data_path).load_data()

embed_model = HuggingFaceEmbedding(
    model_name=embedding_model_path,
    device="cuda:7"
)

index = VectorStoreIndex.from_documents(documents, show_progress=True, transformations=[embed_model])

index.storage_context.persist(persist_dir=PERSIST_DIR)

dosubot · 2024-05-07T08:12:02Z

To address the issue of 0% GPU utilization during the indexing process with your NVIDIA A40 GPU, follow these targeted steps:

Ensure CUDA Compatibility: Confirm that CUDA is correctly installed and that its version is compatible with your GPU. This is crucial for enabling GPU acceleration in Python.
Framework Version Check: Make sure you're using a version of PyTorch or TensorFlow that supports your CUDA version. Compatibility between these components is essential for GPU utilization.

Optimize with LocalTensorRTLLM: For efficient NVIDIA GPU utilization, particularly with TensorRT for language models, use the LocalTensorRTLLM class from llama_index.legacy.llms.nvidia_tensorrt. This class is specifically designed to leverage NVIDIA GPUs:

from llama_index.legacy.llms.nvidia_tensorrt import LocalTensorRTLLM

tensorrt_llm = LocalTensorRTLLM(
    model_path="path/to/your/model",
    tokenizer_dir="path/to/your/tokenizer",
    temperature=0.1,
    max_new_tokens=512,
    context_window=2048,
    verbose=True
)

Verify GPU Device Identifier: Double-check that cuda:7 correctly identifies your NVIDIA A40 GPU. Use nvidia-smi to list all CUDA-capable devices and ensure the identifier matches.
Monitor GPU Utilization: Utilize nvidia-smi to monitor GPU utilization in real-time, confirming that your GPU is being engaged during the indexing operation.
Debugging: If the issue persists, simplify your code to a basic operation that should engage the GPU, helping to isolate the problem.

This approach focuses on ensuring compatibility and proper configuration for GPU utilization, leveraging specific classes designed for NVIDIA GPUs, and monitoring to confirm active GPU engagement.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

logan-markewich · 2024-05-07T15:45:07Z

50GB of data is a ton of data to load into memory. I wonder if it's getting stuck there? If you hit ctrl+c, the traceback would tell us where it is in the code.

I would also suggest testing the embedding model on its own first to ensure it uses GPU
embeddings = embed_model.get_text_embedding("Hello World!") is an easy test

If you have the memory, you can also increase the batch size: HuggingFaceEmbedding(..., embed_batch_size=10) (default is 10)

logan-markewich · 2024-05-07T15:45:52Z

Also, by specifying transformations=[embed_model], you are removing any text splitting (not sure if intended or not)

logan-markewich · 2024-05-07T15:46:41Z

You also didn't specify a vector DB, so all the embeddings (50GB of text of embeddings!!) will be in memory -- probably also not ideal

logan-markewich · 2024-05-07T15:50:34Z

If I was to re-write this, it might look like:

splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=128)
vector_store = <some remote/hosted/optimized vector store, qdrant, chroma, weaviate, etc.>
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes=[], storage_context=storage_context, embed_model=embed_model)

# iter and insert in batches
cur_batch = []
for doc in SimpleDirectoryReader(data_path).iter_data():
  cur_batch.append(doc)
  if len(cur_batch) == 100:
    nodes = splitter(cur_batch)
    index.insert_nodes(nodes)
    cur_batch = []

# get the last batch
nodes = splitter(cur_batch)
index.insert_nodes(nodes)

jiaoqianji · 2024-05-20T09:50:41Z

Thanks for ur help, really apperciate it. I realized it was stuck and split it into small batches then. thanks a lot.

jiaoqianji added the question Further information is requested label May 7, 2024

jiaoqianji closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Why indexing so slow? #13313

[Question]: Why indexing so slow? #13313

jiaoqianji commented May 7, 2024

dosubot bot commented May 7, 2024 •

edited

Details

logan-markewich commented May 7, 2024

logan-markewich commented May 7, 2024

logan-markewich commented May 7, 2024

logan-markewich commented May 7, 2024

jiaoqianji commented May 20, 2024

[Question]: Why indexing so slow? #13313

[Question]: Why indexing so slow? #13313

Comments

jiaoqianji commented May 7, 2024

Question Validation

Question

dosubot bot commented May 7, 2024 • edited

Details

logan-markewich commented May 7, 2024

logan-markewich commented May 7, 2024

logan-markewich commented May 7, 2024

logan-markewich commented May 7, 2024

jiaoqianji commented May 20, 2024

dosubot bot commented May 7, 2024 •

edited