langchain: RecursiveCharacterTextSplitter strange behavior after v0.0.226
System Info
After v0.0.226, the RecursiveCharacterTextSplitter seems to no longer separate properly at the end of sentences and now cuts many sentences mid-word.
Who can help?
No response
Information
- The official example notebooks/scripts
- My own modified scripts
Related Components
- LLMs/Chat Models
- Embedding Models
- Prompts / Prompt Templates / Prompt Selectors
- Output Parsers
- Document Loaders
- Vector Stores / Retrievers
- Memory
- Agents / Agent Executors
- Tools / Toolkits
- Chains
- Callbacks/Tracing
- Async
Reproduction
splitter = RecursiveCharacterTextSplitter(
chunk_size=450,
chunk_overlap=20,
length_function=len,
#separators=["\n\n", "\n", ".", " ", ""], # tried with and without this
)
Expected behavior
Would like to split at newlines or period marks.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 1
- Comments: 16 (4 by maintainers)
@dosubot This problem is not solved in the latest verison.
Hello there. Also experiencing the same issue. I did some analysis on it and shared it in two comments #8142.
@IlyaMichlin There is another thing I am confused with TokenTextSplitter. When I use token splitter chunking by 100 tokens. Actually, it is giving me around 50 token text len splits counted by the
llm4.get_num_tokens(texts[0])
. What is the actual mechanism inside this splitter? Thanks a lot!@dudesparsh I’m not sure if I follow your comment. The default separators list, which I tried as well, is
["\n\n", "\n", " ", ""]
.Starting in v0.0.226, when I inspect the context retrieved in my RAG system, this retriever along with the default separators list seems to produce very different (and in my opinion poorer) results than it did in previous versions. It does not seem to attempt well to prioritize separating on newlines or end of sentences, and it instead separates quite often directly in the middle of words.