-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Suddenly slow indexing #10379
Comments
🤖 Hey @a3lita, great to see you back! Hope you've been doing well. The slowdown in the indexing process and the decrease in the quality of derived responses could be due to several reasons. Here are a few possibilities based on the provided context:
Without more specific information about the system's configuration, the data being processed, and the exact nature of the slowdown, it's difficult to provide a more precise answer. You can find more details in the llama_index/finetuning/cross_encoders/dataset_gen.py file. As for the average size of the documents being processed by your application, you can calculate it by taking the sum of the lengths of all the documents and dividing it by the number of documents. Here is a Python code snippet that calculates the average size of the documents: def calculate_average_document_size(documents: List[Document]) -> float:
total_size = sum(len(doc.text) for doc in documents)
average_size = total_size / len(documents)
return average_size
# Usage:
average_size = calculate_average_document_size(documents) Please note that this code snippet assumes that the size of a document is determined by the length of its text. If the size should be calculated differently, please provide the appropriate method. You can find more details in the tests/indices/query/conftest.py file. I hope this helps! If you have any more questions or need further clarification, feel free to ask. -Dosu SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Your code seems fine to me. As the bot was kind of suggesting, did your data change? If you want, you can also set |
@logan-markewich Nope, the data didn't change at all. Interestingly, the indexing actually still seems to work fine when the app is hosted on Streamlit community cloud, but when I run the Streamlit app locally, the slowdown happens (it used to work fine both ways). I pulled the same code from my repo as is being used for the hosted app to make sure there weren't any discrepancies, so I'm pretty confused on what's suddenly causing this. The multiprocessing seemed to help some, though -- thanks! |
@dosu-bot thank you for the suggestions. I am actually privately hosting this app on Streamlit community cloud, and indexing still takes a reasonable amount of time there (~45 seconds) whereas it suddenly takes a good couple of minutes+ locally (and shows a lot more progress bars than it used to). The code hasn't changed at all (I also tested this with an earlier version of my code that ran fine locally before), so I'm confused as to what could've caused this. I've tried multiprocessing during my ingestion pipeline to speed things up, but it still doesn't index nearly as fast as it used to. The data I'm using has not changed at all, either. I'm assuming the issue is with my machine or perhaps dependencies, but any ideas would be much appreciated, thanks! |
🤖 Given the information you've provided, it seems like the slowdown might be related to your local machine's resources or the environment in which you're running the application. Here are a few things you could check:
You can find more information about these potential issues in the async_utils.py file in the LlamaIndex repository. If none of these suggestions help, it would be helpful to have more information about your local machine's specifications and the environment in which you're running the application. This could help pinpoint the cause of the slowdown. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Question Validation
Question
Hi, I'm working on a RAG application to parse complex PDFs and hosting it on Streamlit. Up until recently, indexing took maybe 10-15 seconds. Now, it's taking minutes (and there are a lot more progress bars) and the derived responses seem to be lower quality. I did not change my indexing function at all, so I'm at a loss for what caused this behavior. Below is my indexing function.
Any suggestions would be much appreciated, thanks!
The text was updated successfully, but these errors were encountered: