spec: handling of datacontenttype is inconsistent

CloudEvents 1.0

Consider this example, straight from the spec:

{
    ...
    "datacontenttype" : "text/xml",
    "data" : "<much wow=\"xml\"/>"
}

Clearly, data is some structure that has been encoded using the XML format and put into the event as a string (binary). Naturally, I’d assume the same behavior for JSON encoding:

{
    ...
    "datacontenttype" : "application/json",
    "data" : "{\"foo\": \"bar\"}"
}

However, that’s doesn’t seem to be the case; as the example in the HTTP protocol binding spec shows, the JSON object is not sent in its encoded form but rather nested into the event directly:

{
    ...
    "datacontenttype" : "application/json",
    "data" : {
        "foo": "bar
    }
}

Note that removing the optional datacontenttype attribute doesn’t change this, as the spec clearly states:

A JSON-format event with no datacontenttype is exactly equivalent to one with datacontenttype=“application/json”.

To sum it up, it is not possible to put a JSON-encoded data blob into a CloudEvent; and a parser needs to treat application/json different than any other datacontenttype.

HTTP Protocol Binding 1.0

For structured content mode, the spec says:

The chosen event format defines how all attributes, and data, are represented.

Does this mean that datacontenttype must be present and set to the event format? Or does structured mode implicitly change the default of datacontenttype from application/json to whatever event format is in use? What if datacontenttype is present and set to a different encoding - must a parser treat this event as malformed?

JSON Event Format 1.0

As a side note, the JSON Format spec makes this even more confusing:

If the implementation determines that the type of data is Binary, the value MUST be represented as a JSON string expression containing the Base64 encoded binary value, and use the member name data_base64 to store it inside the JSON object.

This basically says that you have to Base64-encode any simple JSON string (which is, of course, binary). Also, if a receiver does not implement the optional (!) JSON Format spec, it won’t be able to parse the data_base64 value; consequently, implementing the JSON Format spec as a sender means not implementing the full CloudEvents spec.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 2
  • Comments: 52 (27 by maintainers)

Most upvoted comments

Regarding these two comments about data_json, data_text and data_base64: https://github.com/cloudevents/spec/issues/558#issuecomment-873890684 https://github.com/cloudevents/spec/issues/558#issuecomment-876218637

I would recommend to handle it the same way it is done for protobuf. Only one of the data_* fields would be allowed. This way it is up to the producer to determine, if something is text, binary or JSON without any hidden contract or ambiguity for the consumer or intermediary. Unfortunately I don’t see, how this could be introduced as a non-breaking change. So we still need a solution for specversion: 1.0. This issue is on the agenda for today’s CloudEvents call. Let’s see, if someone comes up with a clever proposal.

I realize this is about solving the issue in spec version 1.0 and not be a breaking change, but going beyond that, is there any discussion for the next version anywhere that would allow for breaking changes like this?

I’d prefer to see a dataencoding attribute “re”-added with a value of either json, text or base64 and then only a single data attribute to hold the payload. I’m not seeing the benefit of instead defining individual attributes as mentioned of data_json, data_text or data_base64. It sounds like a dataencoding attribute was once part of the spec but dropped, maybe it needs to be re-introduced. This would remove any “special” case of */json or */*+json for the datacontenttype attribute and simplify the whole confusion here. Or maybe I’m missing why it wouldn’t.

I also question the attribute naming formats for consistency. The other attributes are all lowercase, not camel nor snake, so why is data_base64 all of a sudden using snake case? For consistency is should be database64. But to avoid this inconsistency altogether and to avoid adding any more data_xxx fields later, I propose just use data only and add dataencoding to specify the encoding format.

This issue is still open, so I thought I would add my suggestion. I’m a bit confused by the merges as to whether this is considered fixed for spec 1.0 or not now, but I’m suggesting how I think it could be simplified for a future version anyways.

Examples

JSON as JSON

If dataencoding is json, then only datacontenttype of */json or */*+json is allowed.

"dataencoding": "json",
"datacontenttype": "application/json",
"data": {
    value: 1
}

To read this would be: var value = event.data.value;

JSON as text

"dataencoding": "text",
"datacontenttype": "application/json",
"data": "{ \"value\": 1 }"

To read this would be: var value = parseJson(event.data).value;

XML as text

"dataencoding": "text",
"datacontenttype": "application/xml",
"data": "<much wow=\"xml\"/>"

To read this would be: var wow = parseXml(event.data).attr("wow");

JSON as bytes

"dataencoding": "base64",
"datacontenttype": "application/json",
"data": "ew0KICAgIHZhbHVlOiAxDQp9"

To read this would be: var value = parseJson(toUtf8String(fromBase64(event.data))).value;

XML as bytes

"dataencoding": "base64",
"datacontenttype": "application/xml",
"data": "PG11Y2ggd293PSJ4bWwiLz4="

To read this would be: var wow = parseXml(toUtf8String(fromBase64(event.data))).attr("wow");

Binary as bytes

"dataencoding": "base64",
"datacontenttype": "image/png",
"data": "c29tZWltYWdlZGF0YQ=="

To read this would be: var imageBytes = fromBase64(event.data);

Thank you.

@duglin @deissnerk This is coming up in my work on the Ruby SDK, and I want to bring up a clarification question.

To summarize a conclusion from above:

In the following CE:

const ce1 = new CloudEvent({
  specversion: "1.0",
  id: "C234-1234-1234",
  source: "/mycontext",
  type: "com.example.someevent",
  datacontenttype: "application/json",
  data: "{\"foo\": \"bar\"}"
});

… it sounds like the data should be considered a JSON value of type string. The fact that the string’s value happens to look like serialized JSON is irrelevant. It is simply a string. Therefore, if we were to serialize this CE in HTTP Binary mode, it might look like this:

CE-SpecVersion 1.0
CE-Type: com.example.someevent
CE-Source: /mycontext
CE-ID: C234-1234-1234
Content-Type: application/json

"{\"foo\" : \"bar\"}"

The data must be “escaped” in this way, so that a receiver parsing this content with the application/json content type will end up with a JSON string and not an object.

As a corollary, when deserializing an HTTP Binary mode CE with Content-Type: application/json, the HTTP protocol handler must parse the JSON and set the data attribute in memory to the actual JSON value (rather than the string representation of the JSON document). Otherwise, the content’s semantics will change when the CE gets re-serialized. And this, of course, all implies that an SDK’s HTTP protocol handler (and perhaps other protocol handlers as well) must understand JSON, even if the JSON structured format is not in use.

Taking that as given, consider this implication:

Earlier a comparison was made with application/xml, noting a possible inconsistency. Consider this parallel example:

const ce2 = new CloudEvent({
  specversion: "1.0",
  id: "C234-1234-1234",
  source: "/mycontext",
  type: "com.example.someevent",
  datacontenttype: "application/xml",
  data: "<much wow=\"xml\"/>"
});

If we were to treat this XML data consistently with how we treated the earlier JSON data, we would consider this data as a string node in an XML document, whose contents just happen to look like XML. Hence, serializing this as HTML-Binary might yield something like:

CE-SpecVersion 1.0
CE-Type: com.example.someevent
CE-Source: /mycontext
CE-ID: C234-1234-1234
Content-Type: application/xml

&lt;much wow="xml"/&gt;

However, my understanding of the spec, and my understanding of the current behavior of the SDKs, suggests we are not doing that. (And indeed I’m glad, because that would, in turn, imply that all protocol handlers would also need to understand XML.) Instead, we actually consider the above data as semantically an XML document and not a string. Hence, serializing this as HTML-Binary actually looks like:

CE-SpecVersion 1.0
CE-Type: com.example.someevent
CE-Source: /mycontext
CE-ID: C234-1234-1234
Content-Type: application/xml

<much wow="xml"/>

In other words, our handling of the XML content-type appears to be inconsistent with our handling of the JSON content-type.

So my clarification question is:

  1. Am I correct in my interpretation that the spec intentionally treats data with content-type application/json specially, differently from string data with content-type application/xml (or indeed any other content-type), as illustrated above?

If so, follow-up questions:

  1. Is the reason for this that we (for some reason) consider JSON uniquely special among all content types in the universe, or is the reason simply that the spec currently happens to include a JSON format but not an XML format to define how data with that datacontenttype is rendered? Suppose a future spec version adds an XML format, YAML format, Protobuf format, etc. Would we at that time need to change the behavior of those formats to be like JSON (which would be a breaking change)?
  2. How do we precisely identify which content types are to be treated in this special way? For example, application/json is obvious, but what if the datacontenttype is itself application/cloudevents+json (i.e. a cloudevent whose payload is another cloudevent)? If we do consider JSON special, it seems it might be a good idea for the spec to state that explicitly, and define how it is identified, perhaps with reference to fields in RFC 2046 or similar.

To my mind, one problem is that “dataencoding” is a perfectly valid context attribute name, but its use here isn’t really part of the CloudEvent itself. I would be happier with “data_encoding”, to indicate that it’s metadata about the “data” property rather than a separate context attribute.

In terms of supporting both: I’d prefer not to do that, personally. We can’t make this change until 2.0 (it would be a breaking change) and I’d really like to aim for 2.0 to be very, very long-lived. Instead, I think it makes sense for a CloudEvent 1.0 to use the existing format, and a CloudEvent 2.0 to use “whatever we decide is best” - individual SDKs can decide which versions of CloudEvents they support. They may decide to support both 1.0 and 2.0 forever, or drop 1.0 support after 2.0 is widely adopted. Making that an SDK choice rather than having both options in the spec itself feels like a more flexible approach.

I’m a bit confused by the merges as to whether this is considered fixed for spec 1.0 or not now, but I’m suggesting how I think it could be simplified for a future version anyways.

To my mind, it’s as “fixed for spec 1.0” as it can reasonably be (modulo clarifying tweaks etc). Yes, there’s a lot that could change for 2.0, although the larger the change in 2.0 (not just here, but everywhere), the harder it will be for SDKs etc to support both 1.0 and 2.0. I haven’t heard any detailed discussions of expectations around timelines for a 2.0 - I think more of the activity is around getting Discovery etc across the line first.

I have been looking into this for cloudevents/sdk-javascript. At the moment, the SDK would produce A as the result of this transformation. In the example below, I’ve left out the transformation from a structured event, and am just creating the event from whole cloth, since this is how it would look after deserialization anyway.

> const e = new CloudEvent({
...     "source":"mySource",
...     "type":"a.clever.CloudEvent",
...     "id":"123",
...     "datacontenttype":"application/happy+json",
...     "data":"I'm just a string"
... })
undefined
> const s = HTTP.binary(e)
undefined
> s
{
  headers: {
    'content-type': 'application/happy+json',
    'ce-id': '123',
    'ce-time': '2021-08-02T19:14:34.229Z',
    'ce-type': 'a.clever.CloudEvent',
    'ce-source': 'mySource',
    'ce-specversion': '1.0'
  },
  body: "I'm just a string"
}

In @deissnerk’s example, the event representation in A isn’t actually a code representation. It’s the representation on the wire. In my illustration above, the s object is the in-memory representation of the event as a Message object as defined in the SDK. For users of the SDK, it is their responsibility to push this data across the wire. The reasoning behind this is that the networking world in Node.js is rife with lots and lots of competing frameworks, and the underlying Node.js native APIs are need a lot of scaffolding around them to be very user friendly. So, we’ve just provided interfaces for developers to implement, and we send/receive through whatever framework they want as long as what they hand us conforms to our API.

When the user ultimately sends the event with something like

const resp = axios.post(url, { headers: s.headers, body: s.body });

That string is just a string, and nothing is wrapping it in quotes. So over the wire, there are no quotes.

Which got me wondering. What if a binary event arrives and it looks like A. Is it invalid? Should the SDK wrap it in quote marks? It’s not very clear.

@Thoemmeli I agree to your point, but "{\"foo\": \"bar\"}" is a string and therefore also a JSON value. In that sense the example is valid, but the datacontenttype in this case refers to the string and not to the escaped JSON object.

@deissnerk thank you for the explanation! Got one more question:

CE also defines the “dataschema” property as “Identifies the schema that data adheres to.” Could we then not define our schema as for example: { “type”: “string”, “contentEncoding”: “base64”, “contentMediaType”: “image/png” } (https://json-schema.org/understanding-json-schema/reference/non_json_data.html) which then would clearly identify contents of data (and data_base64 would be not needed at all)?

@n3wscott In the JSON Schema data is defined as being one of two types, object or string:

"data": {
   "type": ["object", "string"]
}

but array type should not be allowed afaik.

You can however of course:

"data" : {
 "foobar": [ ... ] 
}

@duglin I think, the important text in the section you referenced is

For any other type, the implementation MUST translate the data value into a JSON value, and use the member name data to store it inside the JSON object.

So, if data is already a JSON value, no translation is needed.