vscode_rainbow_csv: Supporting linebreaks inside double quoted csv fields

Key,Text
X,"Hello, mechatroner,
This, incorrectly interpreted as a new row with two columns"

image

According to RFC 4180 the quoted entry spanning a couple lines should be a single cell (i.e. everything after “Hello” in this example should be in blue)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 36
  • Comments: 26 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Jut want to comment I would love to see this implemented in the vs code verison

I made a proof of concept in Vim

image

When highlighting gets glitchy this command fixes it:

:syntax sync fromstart

@jjspace Thanks, I think it is possible to adjust CSVLint so that it can accept quoted multi-line strings, and I agree that this could be a useful improvement in some cases.

There were no major issues reported related to the new “dynamic csv” decoration-based highlighting since its launch, so we can assume that it works well enough. I am finally closing this issue. Further discussion of this topic can be held in the new more focused github issues. Thanks, everyone!

Thank you very much, @ajhyndman! Your syntax is very impressive, took me some time to understand how it works. I was absolutely sure that there is no way to implement something like this using TextMate grammar, turns out I was wrong about it. I will definitely consider using your syntax as a backup plan or as an improvement for the solution that I am currently developing in the rfc branch - I recently discovered two other approaches using VSCode tokenization API, one of them being vscode.languages.registerDocumentRangeSemanticTokensProvider - it allows to highlight just the lines which are currently visible in the editor window + some margin lines around them. This allows to fully support RFC 4180 and also use dynamic highlighting for all possible separator strings (even multi-char separators) and also highlight comment lines. But the semantic tokenization method has some drawbacks too, soo it is great that we now have this alternative solution! I hope to publish something within the next month.

I just published version 3.0.0 which supports multiline fields with some limitations and raw edges, the new RFC4180-compatible dialect is implemented a separate “syntax” which is called “dynamic CSV” - it means that highlighting is done through VSCode syntax tokens mechanism instead of a pre-build grammar. The correct multiline “dynamic CSV” dialect can trigger in 2 cases:

  1. Automatically, when Rainbow CSV detects that the newly opened CSV file contains multiline fields.
  2. Manually, you need to select separator (currently only , and ; are supported) and run Rainbow CSV: Set separator - Quoted Multiline Fields command.

If everything goes well I will soon publish version 4.0.0 with the syntax proposed by @ajhyndman which will combine two CSV dialects into one.
BTW version 3.0.0 has also some other features such as comment lines highlighting and support for infinite number of single-character and multicharacter separators - this is the reason why dynamic syntax highlighting is crucial and will remain as one of the highlighting mechanisms of the extension anyway. But pre-build grammars are still superior in many ways so I am planning to continue using them for popular dialects such as csv, tsv, pipe-separated files, and some others.

OK, You are right, it would be reasonable to support linebreaks in fields in a special mode. So I spent some time trying to implement a grammar that would do it. I’ve started with the grammar that I use in Atom, it is similar to VSCode grammar (since both editors use TextMate-based grammars), but has some differences. So the main regex in Atom consists of 10 identical parts:

(?:((?:"(?:[^"]*"")*[^"]*"(?:,|$))|(?:[^",]*(?:,|$)))|(.+))?

This smaller regex, in turn, consists of 3 parts separated by OR | character:

  1. (?:"(?:[^"]*"")*[^"]*"(?:,|$)) - here we try to match double quoted csv field e.g. the second field in this line: 1234,"hello,""world""",1234
  2. (?:[^",]*(?:,|$)) - tries to match unquoted csv field
  3. (.+) - If neither of 1 and 2 has matched - that means that csv line is incorrectly formatted, so we produce “error” match until the end of the line.

So it is easy to modify part 1 and 3, so that quoted field would match across linebreaks and “error” would match till the end of the file. part 1 would look like this (?:"(?:(?:[^"]|\n)*"")*(?:[^"]|\n)*"(?:,|$)) and part 3 like this: ((?:.|\n)+). But this will not work, because Oniguruma/TextMate regexps will not match across linebreaks quote:

Note that the regular expressions are matched against only a single line of the document at a time. That means it is not possible to use a pattern that matches multiple lines. The reason for this is technical: being able to restart the parser at an arbitrary line and having to re-parse only the minimal number of lines affected by an edit. In most situations it is possible to use the begin/end model to overcome this limitation.

So it is currently not possible to use the modified regexp to match across multiple lines (unless VSCode would introduce a special “xlmatch” grammar keyword that would allow usage of \n in it). Someone could try to use the mentioned “begin/end” TextMate model, but I would be very suprised if it would work.

I will probably try to modify rainbow grammar in Vim to support linebreaks in fields just to see whether it would work or not.

And again there is a fundamental difference between csv which allow linebreaks and csv that doesn’t. Many algorithms (like syntax highlighting, column number query, random sampling) will work in O(N) on the former and in O(1) on the later.

Confirm the large file works locally for me (although it was briefly auto-detected as Dynamic CSV before my markdown extension highjacked the language mode to markdown and I had to manually set it back). The large file also works on vscode.dev

The small file also works locally now on 3.1.0 - worth noting though that the comma character must be selected when invoking Rainbow CSV: Set separator - Quoted Multiline Fields

image

Thanks! 🥳

I stumbled on this issue the other day, and decided to try my hand at it.

I am not certain it can be extended to all valid CSV separators, but I found a way to achieve multiline quote support using nested “begin” and “end” matchers. I thought I’d share it in case it turns out to be useful.

Basically, each pattern matches the entire rest of the row, but defines a nested pattern for the next color. Once you get to the last color, you use "$self" to loop back to the start of the grammar. I also use a lookahead matcher to define “end”, so that a single newline character can end all active rainbow patterns.

{
  "name": "csv syntax",
  "scopeName": "text.csv",
  "fileTypes": ["csv"],
  "patterns": [
    {
      "name": "variable.other.rainbow1",
      "begin": "^|\\,",
      "end": "(?=\\n)",
      "patterns": [
        { "include": "#quotedvalue" },
        {
          "name": "keyword.rainbow2",
          "begin": "\\,",
          "end": "(?=\\n)",
          "patterns": [
            { "include": "#quotedvalue" },
            {
              "name": "entity.name.function.rainbow3",
              "begin": "\\,",
              "end": "(?=\\n)",
              "patterns": [
                { "include": "#quotedvalue" },
                { "include": "$self" }
              ]
            }
          ]
        }
      ]
    }
  ],
  "repository": {
    "quotedvalue": {
      "begin": "\"",
      "end": "\""
    }
  },
  "uuid": "ca03e352-04ef-4340-9a6b-9b99aae1c418"
}

I implemented multiline fields support in Sublime Text version of “Rainbow CSV”: https://packagecontrol.io/packages/rainbow_csv Still no solution in sight for VSCode.

Based on the duplicate issues being opened, there really is a practical need to have this feature. From my personal perspective, I’d prefer to have the special mode on by default (in our environment, we write and read CSV files programmatically with csv libraries, and in all of them multi-line values are treated as valid cells), but I’d be happy to manually switch to “the multi-line mode” explicitly, if necessary.

Sorry for the late response.

Yes, working with CSVs containing commas and newlines inside of a cell value is a pain…

It does mean that CSV is not context free and the file line count doesn’t equal the row count. It’s also the kind of place where Rainbow CSV would be the most helpful. If my CSVs had no newlines or commas in records I wouldn’t have needed Rainbow CSV. It’s these more complex files where I really wish it supported quoted values. Consider a setting?

It’s a valid stance to say CSV has no definitive standard and you chose to not handle quoted values at all, but I believe that’s less useful in practice. I would assume RFC 4180-ish parsing would be the more common expectation, but maybe I’m wrong?

Yes RFC 4180’s choice of line endings is bizarre and a sad concession for Excel compatibility. I believe any sane parser would handle any style of line endings. Ours does. Most open-source parsers that I have looked at do.

I’ve tried a number of popular CSV editors including Excel, Google Sheets, Libre Office, CSVPad, and various less popular editors. All of them allow values containing commas and newlines inside of a cell and quote cells using these special characters.

As is Rainbow CSV’s value is a bit limited for anyone working with CSVs containing arbitrary user data emitted by any popular editor. My use case is translation tables which often have text containing commas and newlines. We can’t forbid commas, so we follow RFC 4180 with exception of the line endings. I’m editing parser test suite data by hand and generally avoiding opening a more heavyweight editor for small table tweaks. Editing a non-trivial RFC 4180-ish CSV by hand with no visual assist can be challenging. Rainbow CSV would be all I’d need.

Comments on your numbered points:

  1. Many “programmers” can’t even handle fizzbuzz. I’m not sure if most programmers making a simplistic assumption is an argument for not supporting it? As tool/library writers our job is to think about those things for the people who won’t. For any use case with user data you can’t just say “no commas allowed.” and in that case line.split(',') falls apart.

For all the rest of your points… Those are all real problems. Ignoring them is a an option but I’m not sure if it’s the most appropriate one? They might just be intrinsic to the format…

@JanisE Good question! Yes, it may seem that many other grammars in VSCode don’t have any problems with multiline strings. But this is because they are using begin/end matching which Rainbow CSV can’t use. The 2 types of matching are described here: https://macromates.com/manual/en/language_grammars#language_rules Rainbow CSV has to rely only on the first matching type which is provided in a single match expression. We can’t use begin/end matching because CSV-s are non-recursive.

RFC 4180 is not a “standard”. And according to RFC 4180 the only legal line separator in csv files is CRLF, so if we were to follow the letter of that memo we would have to stop highlighting half of the files because they have wrong LF separators. So we can’t follow the letter of the memo, but maybe we should follow the spirit of it instead? Actually Rainbow CSV already almost does it because it correctly handles escaped separators and double quotes inside quoted fields when there is no line break inside that field. The only discrepancy is when there is a record separator inside a field, as you’ve noticed. Now I have some arguments why allowing linebreaks inside a field in csv is a terrible idea:

  1. Most people do not know that some fields in csv can contain linebreaks. Actually some significant portion of programmers would even think that parsing csv as fields = line.split(',') is a good idea
  2. csv with linebreak is not context free anymore. If we have one line with unbalanced double quote it would break the whole file, whereas in many applications where csv’s are used it is acceptable to just throw away some corrupted lines.
  3. suppose we have a very large file, and we opened it first at the beginning and then in the middle, and we see something like this:
Name,Age,Comment
Jane,29,Nothing special
...
John,30,Nothing special
Mary,25,Drives a red car
John,31,Nothing special
Adam,25,Likes cats
Nancy,19,Nothing special
Tom,40,Nothing special
...

Now if we allow line breaks in fields we can’t tell whether or not “Adam” is in the database, because somewhere at the beginning of the file we could have something like this:

Name,Age,Comment
Jane,29,Nothing special
...
Jack,30,Likes to play football
Sarrah,40,"Has a long list of imaginary friends:
...
John,30,Nothing special
Mary,25,Drives a red car
John,31,Nothing special
Adam,25,Likes cats
Nancy,19,Nothing special
Tom,40,Nothing special
...
Jack,30,Suspects something"
Mike,25,Doesn't like cats
...
Mike,30,Nothing special
  1. grep, ‘wc -l’, head, tail would fail on csv files with linebreaks.
  2. If we would implement a correct VS Code grammar to handle linebreaks in fields, how it would handle a case when a double quote is inserted by mistake into a large csv file without any double quotes? Probably there are some mechanisms that would prevent a spectacular fail, but anyway it could be funny.

So currently my opinion is that CSV files with linebreaks is not practical format, and because of that Rainbow CSV should treat lines with unbalanced double quotes as an error and should not try to find a matching double quote somewhere below.