llama_index: [Bug]: Python `get_nodes_from_documents` is notably slow on large text files

Bug Description

We are migrating from TS getNodesFromDocuments to Python get_nodes_from_documents. TS takes ~3 seconds to generate TextNodes on a ~8MB text file while Python takes ~60 seconds to generate TextNodes on the same file.

One notable observation is SimpleDirectoryReader loads the whole ~8MB file into one Document object. If we split the text file into multiple Document objects, then the same text file generates TextNodes much faster. We tested this by using nltk.tokenize.sent_tokenize to put ~50 sentences in each Document object and when we passed in the list of documents to get_nodes_from_documents it took ~7 seconds to generate TextNodes.

Version

0.9.39

Steps to Reproduce

TypeScript code:

const reader = new SimpleDirectoryReader();
const docs = (await reader.loadData({ filepath })).map((doc) => {
  doc.metadata.filename = filename;
  return doc;
});
const nodeParser = new SimpleNodeParser({
  textSplitter: new SentenceSplitter({
    chunkOverlap: 32,
    chunkSize: 256,
  }),
});
const nodes = nodeParser.getNodesFromDocuments(docs);

Python code:

reader = SimpleDirectoryReader(input_files=[filepath])
documents = reader.load_data()
splitter = SentenceSplitter.from_defaults(
    chunk_overlap=32,
    chunk_size=256,
)
llama_nodes = splitter.get_nodes_from_documents(documents=documents)

Both of these snippets were run over a ~8MB text file.

Relevant Logs/Tracbacks

No response

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

🤖

It’s great that you’ve identified the performance discrepancy and even found a potential solution by splitting the text file into multiple Document objects. Your observation about the SimpleDirectoryReader loading the entire file into one Document object is insightful. If you’re able to implement the change and test it further, it would be valuable to see your proposed solution in a pull request. Keep up the good work!

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@preemoDez sorry, I meant in “my case”.

Ok, I investigated the issue more and found 2 problems. One is with the algo above, but this is a smaller issue in this specific case:

Execution time for split: 12.53 seconds ------> $O(n^2)$ above Execution time for merge: 0.78 seconds Execution time for split_text_metadata_aware: 13.31 seconds Execution time for build_nodes_from_splits: 50.06 seconds ----> ???

When looking further, I noticed that building involves adding the relationship to the source document:

relationships={NodeRelationship.SOURCE: ref_doc.as_related_node_info()},

and I looked into as_related_node_info() and noticed that the last line is problematic:

hash=self.hash,

When I checked the code here, it turned out that the hash is not cached, it’s recomputed for the whole 8 MB text for every node. In the case of 10K nodes, this is a huge perf overhead.

Not sure if caching is the right solution, but you could also get that value once before the loop and just pass it.

Fixing the latter issue brings more perf gains and is easier than redesigning the split algo to be linear.