anything-llm: Can't process large pdf files (72 pages)
When adding a 72 pdf pages (72 pages of table within a pdf), the console is giving that error :
2023-08-02 16:27:44 [2023-08-02 20:27:44 +0000] [13] [CRITICAL] WORKER TIMEOUT (pid:22)
2023-08-02 16:27:44 [2023-08-02 20:27:44 +0000] [22] [INFO] Worker exiting (pid: 22)
2023-08-02 16:27:44 Processing portfolio.pdf
2023-08-02 16:27:44 fetch failed
2023-08-02 16:27:44 Python processing API was not able to process document portfolio.pdf. Reason: fetch failed
2023-08-02 16:27:45 [2023-08-02 20:27:45 +0000] [91] [INFO] Booting worker with pid: 91
The UI does not display any error message.
When trying to take the 5 first pages of the pdf, it works correctly. I’m working locally with chroma DB.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (10 by maintainers)
Indeed it works, thanks. However I’d suggest to go higher than 300 seconds for very large document with some kind of way to indicate the processing.
Fixed by 1b4e29a3b9a642bad4f5d1358fd568b3491674aa where we bump timeout of document processer to 300 seconds. Default was 30 which for larger PDFs will not succeed.
It’s proprietary. Some indication though, it’s an excel document converted to PDF, 10 columns * 800 rows, each cell contains mostly texts (kind of like of a business directory with some texts and essential informations for each of them). The pdf document is in landscape mode, standard letter format, and is 72 pages long.
Edit : After doing some tests, for some reason, I believe it’s the conversion from office to a pdf files that makes it crash.
When uploading into the document uploader.