json: Buggy support for binary string data

I wanted to check whether I could use this library to store binary data w/o prior encoding to base64. Therefore I wrote the following simple function

static void TestBinDump()
{
    const char* data = "A\x13""B\x8B\0C";
    std::string s(data, data + 6);
    
    json js;
    js["key"] = s;

    auto dump = js.dump(4);
    std::cout << s.size() << " " << dump << std::endl;

    auto jd = json::parse(dump.begin(), dump.end());
    std::string v = jd.at("key");
    std::cout << v.size();
}

This correctly outputs 6 { "key": "A\u0013Bï\u0000C" }. The first surprise is that byte data has been encoded to 16-bit Unicode, so I wanted to see whether “8-bit” Unicode characters would get translated back to 8-bit values. I didn’t get so far because parsing threw invalid_argument exception with message "parse error - unexpected '\"'".

I’m using version 2.1.1.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 19 (12 by maintainers)

Most upvoted comments

As you quote, a JSON string is a sequence of “Unicode code points” – but the technical term “code point” has NOTHING to do with bits and bytes. Bits and bytes happen once you start talking about encodings, like UTF-8 or UTF-16.

For example, if you have the “Unicode code point U+0058” (the letter X), then we still have no idea what that looks like in memory. It depends on what encoding you attach to your in-memory string. It could look like “58” (UTF8) or “0058” or “5800” (UTF16-LE, -BE), etc.

JSON cannot encode arbitrary binary, just sequences of code points – which, again, has nothing to do with binary representations 😃

The nlohmann library has chosen that strings be encoded in UTF8 (see the README); he could have chosen any of the Unicode encoding format (or support them all!), but he had to pick one of them, and UTF8 is a great one (see http://utf8everywhere.org/ for a great read on Unicode and encodings; it’s a fascinating topic).

Therefore, your input is not valid UTF8, and therefore should have been rejected by the encoder.

And, in case you still don’t believe me, do a quick search for “JSON encoding binary” and you’ll see lots of people trying to figure out good ways to do this – which should be a hint that it’s not natively supported by JSON. (For example, this popular StackOverflow question: http://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64)