langchain: RecursiveCharacterTextSplitter strange behavior after v0.0.226

System Info

After v0.0.226, the RecursiveCharacterTextSplitter seems to no longer separate properly at the end of sentences and now cuts many sentences mid-word.

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=20,
    length_function=len,
    #separators=["\n\n", "\n", ".", " ", ""], # tried with and without this
)

Expected behavior

Would like to split at newlines or period marks.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 16 (4 by maintainers)

Most upvoted comments

@dosubot This problem is not solved in the latest verison.

Hello there. Also experiencing the same issue. I did some analysis on it and shared it in two comments #8142.

@IlyaMichlin There is another thing I am confused with TokenTextSplitter. When I use token splitter chunking by 100 tokens. Actually, it is giving me around 50 token text len splits counted by the llm4.get_num_tokens(texts[0]). What is the actual mechanism inside this splitter? Thanks a lot!

from langchain.chat_models import ChatOpenAI
llm4 = ChatOpenAI(model_name="gpt-4-0613", temperature=0.0, request_timeout=120, streaming=True)
from langchain.text_splitter import TokenTextSplitter
text = """外围市场来看,美国就业数据大超预期引发投资者对美联储加息预期产生分歧,短期引发市场担忧。 该机构认为,虽目前市场短期有一定波动,但中期中国经济企稳增长预期将不断兑现、市场流动性充足支撑、盈利面逐步回暖等对市场形成支撑。同时春节并未带来二次疫情高峰使内资对疫情反复的担忧下降,未来内资或将逐步接棒北向资金。从风格角度看,2022年11月以来,市场春季躁动效应演绎充分,大盘股长势较好。"""
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(text)
llm4.get_num_tokens(texts[0])

@dudesparsh I’m not sure if I follow your comment. The default separators list, which I tried as well, is ["\n\n", "\n", " ", ""].

Starting in v0.0.226, when I inspect the context retrieved in my RAG system, this retriever along with the default separators list seems to produce very different (and in my opinion poorer) results than it did in previous versions. It does not seem to attempt well to prioritize separating on newlines or end of sentences, and it instead separates quite often directly in the middle of words.