langchain: MarkdownTextSplitter removes formatting and line breaks

I was trying to use MarkdownTextSplitter to translate a document and maintain formatting, but I noticed that the splitter removed formatting from the markdown when splitting it.

As an example, the following markdown example when split with chunk_size=200 removes the "## " from the features line, as well as the line breaks preceding and following that line.

# Dillinger

- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨

## Features

- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--

When split using this code:

markdown_splitter = MarkdownTextSplitter(chunk_size=200, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_document])

for doc in docs:
    print(doc.page_content)

The output becomes:

# Dillinger

- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨
Features
- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--

The formatting and line breaks around the “Features” line are removed. Expected behavior would be that each split doc, when combined, would be the original text.

Solution would be to never have formatting and line breaks removed, or, add the removed prefix/suffix in metadata or other keys so they could be used to re-construct the document with intact formatting.

Full code example

~: pip show langchain
Name: langchain
Version: 0.0.138

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 6
  • Comments: 19 (3 by maintainers)

Commits related to this issue

Most upvoted comments

definitely in favor of this

would be great to add a bunch of test cases for this when doing

It is not easy do deal with Markdown.

Code should be split by functions or code blocks or at least entire lines. In python the function decorators and comments should be together with the functions. And other languages have their own peculiarities. Code blocks should be properly split for the model to be able to understand the code.

Tables should be split keeping entire rows of content.

Similar applies to Jupiter notebooks, and even normal books, papers, blogs…

It would be better to have a model to do this job, either a general purpose model (using proper prompt) or one trained/fine-tuned specifically for this task

It is even worst for the PythonCodeTextSplitter. The class and def keywords are removed:

class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):
    """Attempts to split the text along Python syntax."""

    def __init__(self, **kwargs: Any):
        """Initialize a MarkdownTextSplitter."""
        separators = [
            # First, try to split along class definitions
            "\nclass ",
            "\ndef ",
            "\n\tdef ",
            # Now split by the normal type of lines
            "\n\n",
            "\n",
            " ",
            "",
        ]
        super().__init__(separators=separators, **kwargs)

These separators are removed from the text using the split() function here

Some separators should be re-added at the beginning of the text, like ##, and others at the end, like .

This would mean that for each separator we should specify how they should be re-added

Additionally, line breaks are removed here

+1. Besides, would be great to have something like splitting by headings only mode, which only splits by headings, not by code blocks etc. This will be very helpful to build an assistant that looks up code documentation.

I wouldn’t mind giving fixing this a try but would like to have some feedback/go-ahead from a maintainer for my solution proposal first:

Thinking split_text could be refractored to return the separator used as well, and this could be stored in the list of documents returned (should maybe be a new class SplitDocument? with key separator/prefix).

Any maintainer who thinks this sounds like a good or bad idea?