langchain: RecursiveCharacterTextSplitter strange behavior after v0.0.226

Most upvoted comments

@dosubot This problem is not solved in the latest verison.

firezym on Oct 18, 2023

Hello there. Also experiencing the same issue. I did some analysis on it and shared it in two comments #8142.

mali404 on Oct 22, 2023

@IlyaMichlin There is another thing I am confused with TokenTextSplitter. When I use token splitter chunking by 100 tokens. Actually, it is giving me around 50 token text len splits counted by the llm4.get_num_tokens(texts[0]). What is the actual mechanism inside this splitter? Thanks a lot!

from langchain.chat_models import ChatOpenAI
llm4 = ChatOpenAI(model_name="gpt-4-0613", temperature=0.0, request_timeout=120, streaming=True)
from langchain.text_splitter import TokenTextSplitter
text = """外围市场来看，美国就业数据大超预期引发投资者对美联储加息预期产生分歧，短期引发市场担忧。&nbsp;该机构认为，虽目前市场短期有一定波动，但中期中国经济企稳增长预期将不断兑现、市场流动性充足支撑、盈利面逐步回暖等对市场形成支撑。同时春节并未带来二次疫情高峰使内资对疫情反复的担忧下降，未来内资或将逐步接棒北向资金。从风格角度看，2022年11月以来，市场春季躁动效应演绎充分，大盘股长势较好。"""
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(text)
llm4.get_num_tokens(texts[0])

firezym on Jul 18, 2023

@dudesparsh I’m not sure if I follow your comment. The default separators list, which I tried as well, is ["\n\n", "\n", " ", ""].

Starting in v0.0.226, when I inspect the context retrieved in my RAG system, this retriever along with the default separators list seems to produce very different (and in my opinion poorer) results than it did in previous versions. It does not seem to attempt well to prioritize separating on newlines or end of sentences, and it instead separates quite often directly in the middle of words.

austinmw on Jul 13, 2023

langchain: RecursiveCharacterTextSplitter strange behavior after v0.0.226

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

About this issue

Most upvoted comments