langchain: MarkdownTextSplitter removes formatting and line breaks
I was trying to use MarkdownTextSplitter to translate a document and maintain formatting, but I noticed that the splitter removed formatting from the markdown when splitting it.
As an example, the following markdown example when split with chunk_size=200 removes the "## " from the features line, as well as the line breaks preceding and following that line.
# Dillinger
- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨
## Features
- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--
When split using this code:
markdown_splitter = MarkdownTextSplitter(chunk_size=200, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_document])
for doc in docs:
print(doc.page_content)
The output becomes:
# Dillinger
- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨
Features
- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--
The formatting and line breaks around the “Features” line are removed. Expected behavior would be that each split doc, when combined, would be the original text.
Solution would be to never have formatting and line breaks removed, or, add the removed prefix/suffix in metadata or other keys so they could be used to re-construct the document with intact formatting.
~: pip show langchain
Name: langchain
Version: 0.0.138
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 6
- Comments: 19 (3 by maintainers)
Commits related to this issue
- fix: markdown splitting (#2) Remove `langchain` and replace [broken `langchain.text_splitter.MarkdownTextSplitter`](https://github.com/hwchase17/langchain/issues/2836) with a naive in-house implement... — committed to move-fast-and-break-things/minerva by yurijmikhalevich a year ago
- Add option to preserve headers in MarkdownHeaderTextSplitter (#14433) - **Description:** `MarkdownHeaderTextSplitter` currently strips header lines from chunked content. Many applications require th... — committed to langchain-ai/langchain by finnless 6 months ago
definitely in favor of this
would be great to add a bunch of test cases for this when doing
It is not easy do deal with Markdown.
Code should be split by functions or code blocks or at least entire lines. In python the function decorators and comments should be together with the functions. And other languages have their own peculiarities. Code blocks should be properly split for the model to be able to understand the code.
Tables should be split keeping entire rows of content.
Similar applies to Jupiter notebooks, and even normal books, papers, blogs…
It would be better to have a model to do this job, either a general purpose model (using proper prompt) or one trained/fine-tuned specifically for this task
It is even worst for the PythonCodeTextSplitter. The
class
anddef
keywords are removed:These separators are removed from the text using the
split()
function hereSome separators should be re-added at the beginning of the text, like
##
, and others at the end, like.
This would mean that for each separator we should specify how they should be re-added
Additionally, line breaks are removed here
+1. Besides, would be great to have something like splitting by headings only mode, which only splits by headings, not by code blocks etc. This will be very helpful to build an assistant that looks up code documentation.
I wouldn’t mind giving fixing this a try but would like to have some feedback/go-ahead from a maintainer for my solution proposal first:
Thinking split_text could be refractored to return the separator used as well, and this could be stored in the list of documents returned (should maybe be a new class SplitDocument? with key separator/prefix).
Any maintainer who thinks this sounds like a good or bad idea?