langchain: MarkdownTextSplitter removes formatting and line breaks

I was trying to use MarkdownTextSplitter to translate a document and maintain formatting, but I noticed that the splitter removed formatting from the markdown when splitting it.

As an example, the following markdown example when split with chunk_size=200 removes the "## " from the features line, as well as the line breaks preceding and following that line.

# Dillinger

- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨

## Features

- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--

When split using this code:

markdown_splitter = MarkdownTextSplitter(chunk_size=200, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_document])

for doc in docs:
    print(doc.page_content)

The output becomes:

# Dillinger

- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨
Features
- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--

The formatting and line breaks around the “Features” line are removed. Expected behavior would be that each split doc, when combined, would be the original text.

Solution would be to never have formatting and line breaks removed, or, add the removed prefix/suffix in metadata or other keys so they could be used to re-construct the document with intact formatting.

Full code example

~: pip show langchain
Name: langchain
Version: 0.0.138

About this issue

Original URL
State: closed
Created a year ago
Reactions: 6
Comments: 19 (3 by maintainers)

Commits related to this issue

fix: markdown splitting (#2) Remove `langchain` and replace [broken `langchain.text_splitter.MarkdownTextSplitter`](https://github.com/hwchase17/langchain/issues/2836) with a naive in-house implement... — committed to move-fast-and-break-things/minerva by yurijmikhalevich a year ago
Add option to preserve headers in MarkdownHeaderTextSplitter (#14433) - **Description:** `MarkdownHeaderTextSplitter` currently strips header lines from chunked content. Many applications require th... — committed to langchain-ai/langchain by finnless 6 months ago

Most upvoted comments

definitely in favor of this

would be great to add a bunch of test cases for this when doing

hwchase17 on Apr 17, 2023

It is not easy do deal with Markdown.

Code should be split by functions or code blocks or at least entire lines. In python the function decorators and comments should be together with the functions. And other languages have their own peculiarities. Code blocks should be properly split for the model to be able to understand the code.

Tables should be split keeping entire rows of content.

Similar applies to Jupiter notebooks, and even normal books, papers, blogs…

It would be better to have a model to do this job, either a general purpose model (using proper prompt) or one trained/fine-tuned specifically for this task

kroggen on May 2, 2023

It is even worst for the PythonCodeTextSplitter. The class and def keywords are removed:

class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):
    """Attempts to split the text along Python syntax."""

    def __init__(self, **kwargs: Any):
        """Initialize a MarkdownTextSplitter."""
        separators = [
            # First, try to split along class definitions
            "\nclass ",
            "\ndef ",
            "\n\tdef ",
            # Now split by the normal type of lines
            "\n\n",
            "\n",
            " ",
            "",
        ]
        super().__init__(separators=separators, **kwargs)

kroggen on Apr 14, 2023

These separators are removed from the text using the split() function here

Some separators should be re-added at the beginning of the text, like ##, and others at the end, like .

This would mean that for each separator we should specify how they should be re-added

Additionally, line breaks are removed here

kroggen on Apr 13, 2023

+1. Besides, would be great to have something like splitting by headings only mode, which only splits by headings, not by code blocks etc. This will be very helpful to build an assistant that looks up code documentation.

ifsheldon on Apr 26, 2023

I wouldn’t mind giving fixing this a try but would like to have some feedback/go-ahead from a maintainer for my solution proposal first:

Thinking split_text could be refractored to return the separator used as well, and this could be stored in the list of documents returned (should maybe be a new class SplitDocument? with key separator/prefix).

Any maintainer who thinks this sounds like a good or bad idea?

vbelius on Apr 14, 2023