fluent-bit: Multiple problems parsing JSON, escaping, encoding JSON
I’d like to preface all the following with a note to the developers: Thank you for your work on fluent-bit, and please take the following as an attempt to help others / constructive criticism. These issues have been hounding me for a while and apologies if some of that frustration peeks through in my language. 🙂
I was investigating various problems with JSON parsing in fluent-bit. There’s a lot of other Issues about this, but I felt it’d be useful to get a minimal reproduction, against latest master, and with all the issues I found in one place.
Issues I believe are related, at least based on a quick scan: #278, #580, #615, #617, #673, #831, #907, #965, #1021, #1044, #1114, #1130, #1170
If you came here looking for a quick fix
As far as I can tell, there’s no way currently to configure fluent-bit to correctly parse a JSON string value.
You can get most of the way there with a config that applies the escaped_utf8
decoder followed by the escaped
decoder.
This will work for everything except strings that contain literal backslashes. See below for detail.
Alternately, if you want to parse nested JSON, you should apply the json
decoder only.
Note that any string values contained in this nested JSON will then ALSO need to be decoded as above.
My setup
All the following is based on commit 00dd804420e1c5aff2166fd32e8306ff4e457f28, which is master
at time of writing.
Here is my config file. I am using the tail input and file output for testing purposes and to keep the tests free of other distractions.
[SERVICE]
Log_Level debug
Parsers_File parsers.conf
Daemon off
[INPUT]
Name tail
Tag test
Path test.in
Parser test
[OUTPUT]
Name file
Path test.out
Format out_file
and parsers.conf
:
[PARSER]
Name test
Format json
# We will be adding decoders here
Parsing JSON
The crux of the whole problem is with how fluent-bit parses JSON values that contain strings.
This is done by flb_pack_json()
, which converts the incoming buffer to a list of tokens using the jsmn
library. jsmn
returns tokens corresponding to objects, arrays, strings and primitives.
These tokens are then handled by tokens_to_msgpack
which converts this list of tokens to a msgpack-encoded string.
The issue lies in the JSMN_STRING
case:
flen = (t->end - t->start);
switch (t->type) {
...
case JSMN_STRING:
msgpack_pack_str(&pck, flen);
msgpack_pack_str_body(&pck, js + t->start, flen);
break;
The string tokens returned by jsmn
point to the section of the buffer that is between the opening and closing quotes of the string. For example, in the JSON buffer {"foo": "two\nlines"}
, the token corresponding to the "two\nlines"
value points to:
{"foo": "two\nlines"}
^start ^end
Note how this substring still contains the escaped newline as a backslash then an n, not as an unescaped actual newline character.
fluent-bit does no further processing on this value, but copies it into the new msgpack value verbatim.
The result is a msgpack value which contains the literal characters 't', 'w', 'o', '\', 'n', 'l', 'i', 'n', 'e', 's'
.
Hence, if no other decoding is done, this string would be converted back into JSON as {"foo": "two\\nlines"}
, as the encoder will see the backslash and escape it.
Now, I would argue that the program is already badly behaved because its JSON parser does not, in fact, correctly parse JSON. However, the documentation (somewhat obliquely) indicates that you are meant to use a decoder to rectify this siutation.
If you know in advance exactly what fields you need to apply this to, then we can attempt to use decoders to improve this. However, I’d like to point out that at this point we already are no longer able to process arbitrary JSON objects, since there’s no way to apply a decoder to “all string fields” without hard-coding that set of fields in your config file.
Let’s now consider the behaviour if you attempt to use various decoders on this string value.
Option 1: Use escaped
decoder
The escaped
decoder will convert the escapes \n
, \a
, \b
, \t
, \v
, \f
and \r
to their respective standard interpretations (eg. \n
becomes a newline).
Any other escaped characters (eg. \"
, \\
or \w
) will be converted to the escaped character, eg. "
, \
and w
respectively.
This is problematic for un-escaping a JSON string because it lacks support for the \uXXXX
escape, converting the \u
to a u
. For example, {"foo": "Pok\u00e9mon"}
, when decoding the foo
key in this way, becomes {"foo": "Poku00e9mon"}
.
It also happens to handle escapes that JSON does not support such as \v
, but this is fine since any valid JSON input won’t have those escapes, and jsmn
checks for validity when parsing.
Option 2: Use escaped_utf8
decoder
This decoder will successfully convert \uXXXX
escapes. However, this line:
if (*in_buf == '\\' && !is_json_escape(next)) {
means that any other JSON escapes will be ignored.
In addition, this decoder has a somewhat subtle bug. It doesn’t handle \\
correctly.
The code loops through each character, checking if it is a backslash and the next char is not in the list of chars that would make it a JSON escape. This includes \\
, so it treats the first backslash as not being part of any escape, then moves on to the next character.
The next character, of course, is another backslash. This also gets interpreted as an escape.
So, for example, "\\v"
will be interpreted as \
, then \v
, which gets converted to a vertical tab, and the output will be \<VT>
.
Amusingly, if \\
is the last thing in the string, then the final backslash is instead joined with the terminating nul character of the string, ie. \
, \
, NUL
becomes \
, NUL
,
the nul terminator is skipped over, and one byte past the end of the buffer is dereferenced before the length check kicks in and halts the loop.
Turning this minor buffer overflow into a segfault or further business is left as an exercise for the reader.
Since fluent-bit keeps string lengths alongside all its strings, the fact that we have an extra NUL terminator doesn’t seem to affect things too much. It shows up in the output, eg.
{"foo": "\\\u0000"}
.
Option 3: Both
If you use a config like this:
Decode_Field_As escaped_utf8 MY_FIELD do_next
Decode_Field_As escaped MY_FIELD
then all escapes will be decoded. You will still have to deal with the escaped-backslash issues above, further filtered through the second layer of escaping, for example \\h
-> h
, \\u002f
-> /
.
This is the least broken option I’ve been able to find.
The ‘json’ decoder (nested json)
For fields which contain strings intended to be parsed as JSON, eg. {"log": "{\"inner\": \"foo\"}"}
, the json
decoder is provided.
One important aspect of this decoder (and not documented anywhere as far as I can tell) is that it first does a escaped
decode on the string before attempting to parse it.
ie. Continuing the above example of {"log": "{\"inner\": \"foo\"}"}
, the log
field is initially parsed as {\"inner\": \"foo\"}
, then the json
decoder unescapes it to {"inner": "foo"}
, which is then parsed as JSON.
However, note that as discussed above, the escaped
decoder does not correctly parse unicode escapes.
This is not generally an issue, since there are no characters in valid JSON which would need to be escaped using a unicode escape (\
is escaped as \\
, "
is escaped as \"
, etc).
But there are valid (albeit pathological) JSON documents which will therefore fail to parse, as any character may be replaced with a unicode-escaped version even if it’s otherwise valid.
For example if the same example as above was written as {"log": "{\u0022inner\u0022: \u0022foo\u0022}"}
, the escaped
decoder would result in the string {u0022inneru0022: u0022foou0022}
, which will then fail to parse as JSON.
You can mitigate this by passing the field through the encoded_utf8
decoder first, but that then exposes you to the \\
bugs noted above, which is likely worse since any escaped characters inside the nested JSON will cause \\
s to appear in the string. For example, under this scheme {"log": "{\"ctcp\": \"\\u0001PING\\u0001\"}"
will be first decoded by the encoded_utf8
decoder to {\"ctcp\": \"\<SOH>PING\<SOH>\"}
then by the encoded
decoder to {"ctcp": "<SOH>PING<SOH>"}
(where in both cases <SOH>
indicates a literal 0x01
character). This is not valid JSON due to the unescaped control code and will be rejected by the JSON decoder.
Finally, it should be noted that the implementation currently also has a check that the first character of the JSON string is a {
or [
:
/* It must be a map or array */
if (dec->buffer[0] != '{' && dec->buffer[0] != '[') {
return -1;
}
Hence any JSON document with leading whitespace will not be parsed despite being a valid object.
Note that any fields parsed into JSON using this decoder will themselves need to have the escaped_utf8
and escaped
decoders applied to them in order for escapes in them to be processed.
Output encoding issues
In addition to all the above issues with parsing JSON fields, there are also issues with outputting data in JSON format.
The code which encodes a record to JSON uses non-JSON escapes \a
and \v
. For example, the input {"vtab": "\u000b"}
, properly decoded, should produce the same output, but instead produces {"vtab": "\v"}
, which is not valid JSON.
The code is also not capable of outputting non-BMP unicode characters. This could arguably be considered a parsing bug, not an encoding bug.
Non-BMP unicode chars are encoded in JSON as a surrogate pair in \u
notation (RFC4627).
For example, 🙂 (U+1F642 SLIGHTLY SMILING FACE) is encoded in JSON as \ud83d\ude42
.
Fluent-bit does not handle this case specially, instead (assuming you’ve used the encoded_utf8
decoder) decoding the two code points individually and storing them internally in utf-8.
Surrogate pairs are not valid in utf-8 strings, and when these characters are attempted to be encoded, the implementation (arguably correctly) reports [pack] invalid UTF-8 bytes, skipping
and drops those characters.
Suggested fixes
The following are obviously bugs (ie. not intended behaviour) and should be fixed:
- The behaviour of
\\
inescaped_utf8
- The use of non-JSON escapes (
\a
,\v
) in JSON output - The inability for nested JSON fields to begin with whitespace
After that, things get trickier. I’m not sure why escaped_utf8
does not parse JSON escapes (and in fact this directly contradicts the examples in its documentation), but it appears to be intentional.
I suggest fixing this to bring it in line with its documentation, so that users need only decode with escaped_utf8
in order to get correct JSON string decoding.
As part of this change, the json
decoder should also use this decoding step instead of escaped
, so it can correctly interpret unicode-escaped characters.
Decoding surrogate pairs during escaped_utf8
decoding (and then in turn encoding them back into surrogate pairs when encoding as JSON) should also be done.
However, none of this deals with the case where the user cannot hard-code in their config file the full list of fields that need to be decoded. Not to mention that the current scheme of splitting the full JSON parsing process over two independent steps (JSON parser + escaped_utf8
decoder) is extremely confusing to users.
Unfortunately at this point fixing that behaviour would introduce breaking changes.
I suggest either:
-
Deprecate the existing JSON parser and introduce a new parser which correctly decodes escapes when parsing the JSON (ie. in
tokens_to_msgpack()
). Call itjson2
orjson_full
orjson_decoded
or something. -
Add a flag that affects the behaviour of the JSON parser.
Another option would be to add a new form of decoder specification which applies to all fields, eg. by allowing a wildcard for field name or via a new directive. This would allow all fields to be decoded after being parsed, but will break in cases (such as when parsing nested JSON) where it is unknown which fields are “new” and need decoding and which are not (and double-decoding is a bug, as it will eg. remove literal backslashes or turn \\n
into a newline instead of \n
).
In any case, the documentation should be updated to very clearly instruct the user on how to configure the parser to correctly parse all valid JSON objects.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 46
- Comments: 83 (45 by maintainers)
Commits related to this issue
- unescape: utf8: unescape special characters too (#1278) The current implementation only decoded/unescaped \u characters, this patch extend it support to: \", \\, \/, \n, \a, \b, \t, \v, \f and \r. S... — committed to fluent/fluent-bit by edsiper 5 years ago
- pack: when packaging strings, unescape them first (#1278) When converting from JSON to MessagePack, a string was packed without any processing leading to many issues. This has been the root cause of ... — committed to fluent/fluent-bit by edsiper 5 years ago
- parser: decoder: json: do not unescape and skip empty spaces (#1278) Due to recent improvements in the JSON -> Msgpack encoder, it's not longer necessary to run a new 'unescape' routine since JSON st... — committed to fluent/fluent-bit by edsiper 5 years ago
- filter_kubernetes: fix unescaping and pack on all cases (#1278) When Merge_Log option is set, Filter Kubernetes aims to work with or without the help of a previous parser/decoded step in the original... — committed to fluent/fluent-bit by edsiper 5 years ago
- conf: parsers: update decoding rule for 'docker' (#1278) The new improvements on JSON->Msgpack encoding helps to reduce the work on custom decoders used before. From now using a simple 'json' decoder... — committed to fluent/fluent-bit by edsiper 5 years ago
- tests: runtime: filter_kubernetes: adjust test cases for expected results (#1278) Some test cases were expecting some extra backslashes that are coming from a previous decoding/encoding issues in Flu... — committed to fluent/fluent-bit by edsiper 5 years ago
- unescape: utf-8: do not skip byte if character cannot be encoded (#1278) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 5 years ago
- filter_kubernetes: do not unescape and fix handling of a merged map (#1278) The filter used to unescape a log value before trying to pack it as a JSON->Msgpack map, this is not necessary and avoid so... — committed to fluent/fluent-bit by edsiper 5 years ago
- utils: fix JSON encoding for special characters (#1278 #1415) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 5 years ago
- tests: internal: pack: add unit test for problems reported on #1278 #1415 Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 5 years ago
All,
first of all thanks to @ekimekim for taking time to report the issue, troubleshoot, provide insights and suggest the proper fixes, this ticket was a very valuable contribution, thanks!
I would say “most” of the problems were addressed, the remaining encoding issue is the handling of surrogates (re: Emojis and other chars), that implementation will be handled separately and it’s already in the roadmap.
The fixes will be part of Fluent Bit v1.2 release that will be available during the next week. At this point and after 11k pulls of the test image I am closing this ticket as fixed.
thanks again everyone that helped out providing feedback and testing.
FYI: working on this.
I am doing some progress here. Here are the results of a simple modification where only a ‘json’ parser without decoders do the work.
stdout Msgpack output:
stdout JSON output:
I have not tested “stringified JSONs yet”… work in process.
Update: we are ready to test out the new changes
Please use the following new Docker image:
To make sure it works properly your Docker parser must look like this:
Note: if you skip the decode_field_as line it’s fine too, since filter kubernetes will do it parsing if not done before (if Merge_Log option is set).
If you want to see the recent changes in the code base refer to the following link:
https://github.com/fluent/fluent-bit/commits/issue-1278
Please test it and share some feedback!
Are nested escaped messages covered by the above too? Our Kubernetes log is structured like:
i.e. strings in the RenderedMessage property value are always quoted and so escaped.
@rafaelmagu please open a new ticket and share your configmap
@rafaelmagu as stated above you should not use decoders
@nicosto
On that case, tcp input plugin do not support configurable parsing, it’s just json or messagepack format, to workaround the missing parsing of your data you can use Filter Parser, e.g:
my parsers.conf looks like:
Final output printed with jq is:
Yeah, we should add the parsing option to in_tcp as an enhancement request, but as of now the data processing looks good.
@gitfool Your app JSON output becomes the following content in Docker log:
Processing with Fluent Bit
1. msgpack output:
2 JSON output:
it looks good to me, please confirm if is the expected output in your side.