Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Suddenly slow indexing #10379

Closed
1 task done
a3lita opened this issue Jan 31, 2024 · 5 comments
Closed
1 task done

[Question]: Suddenly slow indexing #10379

a3lita opened this issue Jan 31, 2024 · 5 comments
Labels
question Further information is requested

Comments

@a3lita
Copy link

a3lita commented Jan 31, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Hi, I'm working on a RAG application to parse complex PDFs and hosting it on Streamlit. Up until recently, indexing took maybe 10-15 seconds. Now, it's taking minutes (and there are a lot more progress bars) and the derived responses seem to be lower quality. I did not change my indexing function at all, so I'm at a loss for what caused this behavior. Below is my indexing function.

def get_index(documents):
    client = qdrant_client.QdrantClient(location=":memory:")
    vector_store = QdrantVectorStore(client=client, collection_name="test_store")

    # create the ingestion pipeline with transformations
    pipeline = IngestionPipeline(
        transformations=[
            SentenceSplitter(chunk_size=200, chunk_overlap=10), # adjust chunk size and overlap
            TitleExtractor(), # metadata extraction (extracts title)
            OpenAIEmbedding(model='text-embedding-ada-002'), # embeddings are calculated as part of the pipeline
        ],
        vector_store=vector_store, # set vector store to qdrant store
    )

    # run the pipeline to get nodes
    nodes = pipeline.run(documents=documents)

    # create vector store index
    from llama_index import VectorStoreIndex
    index = VectorStoreIndex.from_vector_store(vector_store)

    return index

Any suggestions would be much appreciated, thanks!

@a3lita a3lita added the question Further information is requested label Jan 31, 2024
Copy link

dosubot bot commented Jan 31, 2024

🤖

Hey @a3lita, great to see you back! Hope you've been doing well.

The slowdown in the indexing process and the decrease in the quality of derived responses could be due to several reasons. Here are a few possibilities based on the provided context:

  1. Increased Data Size: If the size of your documents or the number of questions has increased significantly, it could lead to a slower indexing process. The functions generate_synthetic_queries_over_documents and generate_ce_fine_tuning_dataset both iterate over the entire list of documents and questions respectively, which could be time-consuming for large datasets.

  2. Model Performance: The LLM model used (OpenAI model "gpt-3.5-turbo-16k") might be underperforming due to various reasons such as network latency or server-side issues. This could affect both the speed and quality of responses.

  3. Retrieval Top K: The top_k parameter in the generate_ce_fine_tuning_dataset function determines the number of top similar nodes to retrieve for each question. If this value is high, it could slow down the process as the model has to evaluate more nodes.

  4. Tokenizer Performance: The get_tokenizer() function is used in both generate_synthetic_queries_over_documents and generate_ce_fine_tuning_dataset functions. If the tokenizer is slow or inefficient, it could slow down the entire process.

  5. Inefficient Code: The code might have some inefficiencies. For example, in the generate_ce_fine_tuning_dataset function, the llm.complete(msg_prompt) is called for each retrieved node for each question. This could be quite slow if there are many questions and nodes.

Without more specific information about the system's configuration, the data being processed, and the exact nature of the slowdown, it's difficult to provide a more precise answer.

You can find more details in the llama_index/finetuning/cross_encoders/dataset_gen.py file.

As for the average size of the documents being processed by your application, you can calculate it by taking the sum of the lengths of all the documents and dividing it by the number of documents. Here is a Python code snippet that calculates the average size of the documents:

def calculate_average_document_size(documents: List[Document]) -> float:
    total_size = sum(len(doc.text) for doc in documents)
    average_size = total_size / len(documents)
    return average_size

# Usage:
average_size = calculate_average_document_size(documents)

Please note that this code snippet assumes that the size of a document is determined by the length of its text. If the size should be calculated differently, please provide the appropriate method.

You can find more details in the tests/indices/query/conftest.py file.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

-Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

Your code seems fine to me.

As the bot was kind of suggesting, did your data change? If you want, you can also set pipeline.run(..., num_workers=4) or similar to enable multiprocessing

@a3lita
Copy link
Author

a3lita commented Jan 31, 2024

@logan-markewich Nope, the data didn't change at all. Interestingly, the indexing actually still seems to work fine when the app is hosted on Streamlit community cloud, but when I run the Streamlit app locally, the slowdown happens (it used to work fine both ways). I pulled the same code from my repo as is being used for the hosted app to make sure there weren't any discrepancies, so I'm pretty confused on what's suddenly causing this.

The multiprocessing seemed to help some, though -- thanks!

@a3lita
Copy link
Author

a3lita commented Feb 1, 2024

@dosu-bot thank you for the suggestions. I am actually privately hosting this app on Streamlit community cloud, and indexing still takes a reasonable amount of time there (~45 seconds) whereas it suddenly takes a good couple of minutes+ locally (and shows a lot more progress bars than it used to). The code hasn't changed at all (I also tested this with an earlier version of my code that ran fine locally before), so I'm confused as to what could've caused this. I've tried multiprocessing during my ingestion pipeline to speed things up, but it still doesn't index nearly as fast as it used to. The data I'm using has not changed at all, either. I'm assuming the issue is with my machine or perhaps dependencies, but any ideas would be much appreciated, thanks!

Copy link

dosubot bot commented Feb 1, 2024

🤖

Given the information you've provided, it seems like the slowdown might be related to your local machine's resources or the environment in which you're running the application. Here are a few things you could check:

  1. System Requirements: Ensure your local machine meets the system requirements for running LlamaIndex efficiently. This includes having the correct versions of Python and other dependencies installed. You can find the full list of requirements in the pyproject.toml file in the LlamaIndex repository.

  2. Concurrency Limitations: The run_jobs function in LlamaIndex uses a semaphore to limit the number of concurrent jobs to the number of workers specified. If the number of workers is set too low, this could slow down the indexing process as tasks are queued and have to wait for a free worker. You might want to experiment with increasing the number of workers.

  3. Batch Size: In the batch_gather function, tasks are processed in chunks or batches. If the batch size is set too high, it could lead to memory issues and slow down the process. Conversely, if it's set too low, it could lead to inefficiencies as the overhead of context switching between tasks could outweigh the benefits of concurrency. You might want to experiment with adjusting the batch size.

  4. Environment Support: The run_async_tasks function tries to use the tqdm.asyncio module to show progress. However, if an exception occurs (perhaps because tqdm.asyncio is not supported in the environment), it falls back to using the standard asyncio module. This could potentially slow down the process if the exception handling is triggered frequently. You might want to check if tqdm.asyncio is supported in your environment.

  5. I/O Bound Tasks: If the tasks being run are I/O bound (such as reading/writing from disk or network operations), they could be slowing down the indexing process. This is because these tasks are dependent on the speed of the I/O operations and not the CPU. You might want to check if there are any I/O operations that could be optimized.

  6. CPU Bound Tasks: If the tasks being run are CPU bound (such as heavy computations), they could be slowing down the indexing process. This is because these tasks are dependent on the speed of the CPU and not the I/O operations. You might want to check if there are any CPU-intensive tasks that could be optimized.

You can find more information about these potential issues in the async_utils.py file in the LlamaIndex repository.

If none of these suggestions help, it would be helpful to have more information about your local machine's specifications and the environment in which you're running the application. This could help pinpoint the cause of the slowdown.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 2, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants