json: Soften the landing when dumping non-UTF8 strings (type_error.316 exception)

When dumping a json object that contains an std:string we could be thrown the type_error.316 exception if the string contains invalid UTF8 sequences. While this sounds like a reasonable thing to do, it actually causes more problems than it fixes. In all the cases where user-entered (unsafe) strings end up stored in nlohmann::json objects the developer now has to “do something” before assigning a string value to some nlohmann::json variable. This effectively renders the whole string assignment functionality unsafe and defeats its purpose.

Below is the wrapper code I had to write in order to investigate the random crashes my application went through due to the 316 exception.

// Declaration:
std::string dump_json(const nlohmann::json &json, const int indent =-1, const char* file = __builtin_FILE(), int line = __builtin_LINE()) const;

// Definition
std::string MANAGER::dump_json(const nlohmann::json &json, const int indent, const char* file, int line) const {
    std::string result;
    try {
        result = json.dump(indent);
    }
    catch (nlohmann::json::type_error& e) {
        bug("%s: %s (%s:%d)", __FUNCTION__, e.what(), file, line);
    }
    return result;
}

This led me to the discovery of the code in my application that was sending json formatted log events to the log server. The log event has to store user entered data and I would expect the dump function to deal with invalid UTF8 sequences.

If I have to use my json dump wrapper everywhere in my application code then of what use is the original dump function to begin with? What is more, I’d actually have to enhance the wrapper function to iterate over all strings stored in the json object and do something about the possible invalid UTF8 sequences. Not very convenient.

Therefore, I would propose the default behavior of the dump function to simply ignore (skip or replace) invalid UTF8 sequences and NOT throw. Or perhaps add a nothrow parameter to the string assignment = operator.

If left like that, I probably won’t be the last developer who assumed that the dump method is safe to use. After all, if the lib allows me to assign a value to the json object then how can it be that it lets me assign values that later invalidate the whole json? This is illogical.

// nothrow assignment would look like this
nlohmann::json j_nothrow = (std::nothrow) "\xF0\xA4\xAD\xC0"; // Does not throw.
// I am not 100% sure if this can be done, but since the `new` operator can have such
// a parameter I would suppose the assignment operator = could also be customized
// like that.

std::string dumped = j_nothrow.dump(); // Never throws, just works.

nlohmann::json j_throw;
try {
    j_throw = "\xF0\xA4\xAD\xC0"; // Throws immediately.
}
catch (nlohmann::json::type_error& e) {
    // handle the invalid UTF8 string
}

One more thing. Why not include the file and line from where the exception originated in the exception description? You can see in my above wrapper function how I am using __builtin_LINE() and __builtin_FILE() to store the file and line information from where the wrapper call was made. It is super useful when debugging and exception description is all about debugging.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 67 (32 by maintainers)

Commits related to this issue

Most upvoted comments

I think it is out of scope of the library to fix invalid UTF-8. I further do not think that silently fixing invalid inputs makes the world a better place, but rather rejecting such inputs with as much noise and as early as possible.

@fawdlstty This library only supports UTF-8, sorry.

I opened a work-in-progress branch (https://github.com/nlohmann/json/compare/feature/codec_errors) for the idea sketched in https://github.com/nlohmann/json/issues/1198#issuecomment-428508453. Feedback is more than welcome! The branch currently only handles serialization.

Couldn’t you just create a function “sanitize_unicode_with_replacement_char(json&)” that recursively crawls the json structure to fix up any invalid strings the way you like? You could do this immediately after importing the structure (ie, reading the file) or right before exporting it (ie, dumping to file) in order to implement it at the right layer for your purpose.

The issue I have with doing anything by default is that someone is going to want something different. So, we’d have to create a policy-based mechanism, which adds a great amount of complexity to the implementation and to the API.

The benefit of the current status quo is that it’s “lossless” and permissive in the sense that it allows you to implement any of the above behaviors by doing some extra work on top of what the library does.

I try to summarize some options we have in this situation. Not all of them were mentioned, but I try to get a picture of all possible ways to proceed:

  • Assignment of invalid Unicode:

    • Option 1: The library should “fix” invalid Unicode input (e.g. by removing invalid parts) and do not throw.
    • Option 2: The library should throw an exception.
    • Option 3: The library should just assign the string (this is the status quo).
  • Serialization of previously stored invalid Unicode:

    • Option 1: This should be made impossible, e.g. by options 1 or 2 above.
    • Option 2: The library should dump the invalid Unicode as is.
    • Option 3: The library should “fix” the invalid Unicode (e.g. by not dumping invalid parts) and proceed.
    • Option 4: The library should throw an exception (this is the status quo).

Did I miss anything?

@gregmarr you do realize that string assignments in its current form are dangerous? sure, build a big application using this lib under the assumption that assignments are always for valid utf8 and then one day you change some unrelated code and everything breaks because now all of sudden some string may contain invalid bytes. So yes, I know why you are like that but do you know what I am saying? It’s easy to regurgitate your point but I can see it’s not so easy to listen what the other person wants to say. And speaking of @nlohmann he’s the one who was rude to begin with, just because it’s his project doesn’t mean his shit doesn’t stink. These days we could always fork, but is this really the solution you want to propagate. A fork should be the last resort, which will unfortunately happen more often if some of the parties is arrogant enough to not discuss things. And of course speaking of egomania, who names their lib after themselves? Not only it is long and tedious to write it, it’s also pathetic.

@jaredgrubb contrary to @nlohmann 's response I respect yours for the reason that you actually bothered to explain. I like your reasoning though, @nlohmann should try to learn from it.

You do realize that this is his project, and he’s free to do with it as he wishes, right? For you to come in here and insist that he must “prove” or “admit” or “learn” anything just because you say so is extremely rude.

@1Hyena or you could just do this, which also avoids the exceptions:

std::string invalid_utf8 = "\xF0\xA4\xAD\xC0";
nlohmann::json json = fix_my_invalid_string(invalid_utf8); // This would work and would never throw.

One good thing about handling this situation in the dump function (rather than during assignment) is that it respects the principle of lazy evaluation. We could argue that not all json instances end up getting dumped and thus checking for invalid UTF8 in those does not make any difference (wastes CPU).

I overworked the code and added some test cases. The error handler is now a new optional parameter to the dump function. I am not sure whether I exactly implemented Python’s behavior, so more test cases may be required.

@jaredgrubb I took the names from Python. You’re right the public API needs better documentation. But I would like to focus on the implementation before.

It is already possible to get an invalid UTF-8 sequence in strings while decoding a JSON file ("\\u0080" is a valid JSON-string), so the most consistent option would probably be to replace invalid UTF-8 sequences with \u-escaped sequences instead of throwing an exception while serializing strings. This will result in valid (UTF-8 encoded) JSON files and invalid UTF-8 (or other binary data) would round-trip. It’s then up to the user to check the strings stored in a JSON value for correct encodings.

Ok, forget this… it’s actually nonsense:

E.g., serializing \x80 would result in \\u0080. While deserializing \\u0080, the resulting character U+0080 will be encoded in UTF-8 and result in \xC2\x80.

It is already possible to get an invalid UTF-8 sequence in strings while decoding a JSON file ("\\u0080" is a valid JSON-string), so the most consistent option would probably be to replace invalid UTF-8 sequences with \u-escaped sequences instead of throwing an exception while serializing strings. This will result in valid (UTF-8 encoded) JSON files and invalid UTF-8 (or other binary data) would round-trip. It’s then up to the user to check the strings stored in a JSON value for correct encodings.

FWIW, I’m very sympathetic. Having the library throw during serialization five bytes from the end is about the worst time to have it happen, and having to manually do these checks means that it’s going to get forgotten/overlooked and lead to a vulnerability.

But, it’s a hard problem without an obvious (let alone universal) answer. It’s made worse by the fact that our STL hasn’t even acknowledged that strings might (should? must!) have encoding.

Couldn’t you just create a function “sanitize_unicode_with_replacement_char(json&)” that recursively crawls the json structure to fix up any invalid strings the way you like? You could do this immediately after importing the structure (ie, reading the file) or right before exporting it (ie, dumping to file) in order to implement it at the right layer for your purpose.

The issue I have with doing anything by default is that someone is going to want something different. So, we’d have to create a policy-based mechanism, which adds a great amount of complexity to the implementation and to the API.

The benefit of the current status quo is that it’s “lossless” and permissive in the sense that it allows you to implement any of the above behaviors by doing some extra work on top of what the library does.

I think the problem with the current behavior is that you need to run into it so as to know that it causes an exception. In our case (referenced above) our SPARQL Engine worked completely fine with >100 GB Knowledge Base (Wikidata) only to fail on a smaller one (Freebase) because we were unlucky and cut a UTF-8 sequence at the wrong point or there was just a wrong sequence in the data.

So our system worked fine for more than >100 GB of (nice) text input before we hit that exception.

Also your workaround would mean going over and checking for wrong sequences twice. The current code already detects wrong sequences and reacts to it in a very non-benign way. So all I’m arguing is that it should give the user a way to control how it handles that case and I think allowing the user to provide a function to handle this (maybe even with a default of the current behavior) is easy, elegant and versatile. Especially since checking for UTF-8 encoding correctness while not very hard isn’t trivial either and it’s very natural to have different preferences of how such an error should be handled (replacing, throwing an exception, ignoring,…)

@nlohmann what do you think?

I think it’s perfectly reasonable for this library to assume the strings it gets are valid (ie, perfect UTF8).

Imagine the chaos if programmers claimed that strdup should fix UTF-8 for you, or that printf and fopen and std::regex_match and all other library functions needed to become encoding-aware and must also sanitize their string-inputs better. No, on the contrary, each library function has a precondition that the programmer must meet, and this library says that “strings are valid UTF-8” – which is perfectly reasonable.

Sure, this library is about text parsing and needs to be aligned towards text-processing things, but that doesn’t mean this library should be burdened with a grab-bag of Unicode transcoding functions. There are libraries whose sole purpose is do exactly that, and they’ll do it far better, far faster, and with far fewer bugs than if this library took those on as a nice-to-have side-project.

If the programmer is unsure whether a string may or may not be valid UTF-8 at any given point in the program, the onus is on them to make sure their code translates them to “valid” inputs for whatever functions they call (and gregmarr provides a great example for how to do that in his comment).

@nlohmann exactly! thanks for at least getting my point. I would definitely argue that if invalid utf8 will throw then it should throw during assignment not during dumping.

@sfinktah IF you didn’t know, then JSON has no formal meaning for non-utf8 byte sequences. it’s as if JSON was color-blind to a certain group of byte sequences. Do color-blind people get thrown an exception every time they look at something that is either red or green? NO. It would be silly. Do you suffer a seizure when infrared light gets into your eyes? Think of nlohmann’s json library the same way. Its main purpose is to make the developer’s life easier, not harder. Stripping out invalid UTF8 sequences is the way to go. I already proposed a std::nothrow version for the assignment, so if you love your exceptions so much you’d have to do nothing as the proposed solution would be opt-in. I would advise you to rethink your position.

I have slapped together a general purpose utf8 trimming function that you may find useful.

It takes the unsafe std::string as an argument and removes all invalid UTF8 sequences from it.

void trim_utf8(std::string& hairy) {
    std::vector<bool> results;
    std::string smooth;
    size_t len = hairy.size();
    results.reserve(len);
    smooth.reserve(len);
    const unsigned char *bytes = (const unsigned char *) hairy.c_str();

    auto read_utf8 = [](const unsigned char *bytes, size_t len, size_t *pos) -> unsigned {
        int code_unit1 = 0;
        int code_unit2, code_unit3, code_unit4;

        if (*pos >= len) goto ERROR1;
        code_unit1 = bytes[(*pos)++];

             if (code_unit1 < 0x80) return code_unit1;
        else if (code_unit1 < 0xC2) goto ERROR1; // continuation or overlong 2-byte sequence
        else if (code_unit1 < 0xE0) {
            if (*pos >= len) goto ERROR1;
            code_unit2 = bytes[(*pos)++]; //2-byte sequence
            if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
            return (code_unit1 << 6) + code_unit2 - 0x3080;
        }
        else if (code_unit1 < 0xF0) {
            if (*pos >= len) goto ERROR1;
            code_unit2 = bytes[(*pos)++]; // 3-byte sequence
            if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
            if (code_unit1 == 0xE0 && code_unit2 < 0xA0) goto ERROR2; // overlong
            if (*pos >= len) goto ERROR2;
            code_unit3 = bytes[(*pos)++];
            if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
            return (code_unit1 << 12) + (code_unit2 << 6) + code_unit3 - 0xE2080;
        }
        else if (code_unit1 < 0xF5) {
            if (*pos >= len) goto ERROR1;
            code_unit2 = bytes[(*pos)++]; // 4-byte sequence
            if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
            if (code_unit1 == 0xF0 && code_unit2 <  0x90) goto ERROR2; // overlong
            if (code_unit1 == 0xF4 && code_unit2 >= 0x90) goto ERROR2; // > U+10FFFF
            if (*pos >= len) goto ERROR2;
            code_unit3 = bytes[(*pos)++];
            if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
            if (*pos >= len) goto ERROR3;
            code_unit4 = bytes[(*pos)++];
            if ((code_unit4 & 0xC0) != 0x80) goto ERROR4;
            return (code_unit1 << 18) + (code_unit2 << 12) + (code_unit3 << 6) + code_unit4 - 0x3C82080;
        }
        else goto ERROR1; // > U+10FFFF

        ERROR4:
        (*pos)--;
        ERROR3:
        (*pos)--;
        ERROR2:
        (*pos)--;
        ERROR1:
        return code_unit1 + 0xDC00;
    };

    unsigned c;
    size_t pos = 0;
    size_t pos_before;
    size_t inc = 0;
    bool valid;

    for (;;) {
        pos_before = pos;
        c = read_utf8(bytes, len, &pos);
        inc = pos - pos_before;
        if (!inc) break; // End of string reached.

        valid = false;

        if ( (                 c <= 0x00007F)
        ||   (c >= 0x000080 && c <= 0x0007FF)
        ||   (c >= 0x000800 && c <= 0x000FFF)
        ||   (c >= 0x001000 && c <= 0x00CFFF)
        ||   (c >= 0x00D000 && c <= 0x00D7FF)
        ||   (c >= 0x00E000 && c <= 0x00FFFF)
        ||   (c >= 0x010000 && c <= 0x03FFFF)
        ||   (c >= 0x040000 && c <= 0x0FFFFF)
        ||   (c >= 0x100000 && c <= 0x10FFFF) ) valid = true;

        if (c >= 0xDC00 && c <= 0xDCFF) {
            valid = false;
        }

        do results.push_back(valid); while (--inc);
    }

    size_t sz = results.size();
    for (size_t i = 0; i < sz; ++i) {
        if (results[i]) smooth.append(1, hairy.at(i));
    }

    hairy.swap(smooth);
}