influxdb: Select a tag key when the tag and field conflict

Problem

This is a single issue to discuss #4630.

When points are written to the same measurement where a tag key and a field key conflict with each other, it is impossible to query the tag key with InfluxQL. The following examples (rendered to different measurements to demonstrate these are different examples) leaves the tag not queryable.

Example 1

example1,host=server01 value=2
example1 host='server02',value=3

Example 2

example2,host=server01 host='server02',value=2

The following queries become ambiguous and cannot be resolved correctly, so they default to selecting the field rather than the tag.

SELECT host, value FROM example1
SELECT host, value FROM example2

Background

There is a bit of a debate on exactly what a tag is. Before talking about what a tag is, here is a list of exactly what capabilities a tag has:

  • They are indexed.
  • They are always strings.
  • Points that have the same tag keys/values make up a series.
  • They are associated with the series, not the individual point.

Since a tag is associated with the series and not the individual point, tags do not have a time associated with them. This makes it impossible to query for only a tag in a measurement since a select works by finding the time associated with a field and returning the value at that time. Since tags are not associated with points, they also do not have a time and cannot be iterated over. To query a tag with a select, they are wholly dependent on being queried with a field in the same query.

There is also no way to differentiate between a field and a tag in InfluxQL at the moment. The search path for how to treat a variable reference is:

  • Check if the measurement (not series) has a field with the value of the key.
  • Check if the series has a tag with the value of the key.
  • Pretend the cursor is a tag with a blank value.

Because of the first step, it makes it impossible to reference a tag once any series in a measurement writes uses a key as a field even if there was previous data with that tag.

Possible Solutions

Solution 1

Change the prioritization so it will read a tag over a field (if present) and restrict points from being written with a conflicting field and tag key as described and implemented in #6410.

This solution makes it so example 1 is possible and works correctly when querying so that the first series will return the value of the tag and the second will return the value in the field. It makes example 2 into an error so it becomes impossible to get into this situation to begin with.

Pros:

  • Tags become equivalent to indexed fields that are always strings.
  • No additional complexity is added to the query language or query engine beyond a small fix to how it creates cursors.
  • Makes SELECT value, host FROM cpu GROUP BY host use a consistent value (currently it will use the tag for the GROUP BY and the field for the actual selection).

Cons:

  • Tags only sometimes act as fields. As described earlier, you cannot query a tag by itself. If you do SELECT host FROM example1 you will get a weird result where it will only return the second point.
  • Old data written where the field and the tag were written as part of the same series will still be unreadable, but you won’t be able to write any new data.
  • Breaking change.

Solution 2

Introduce new syntax to specify whether you meant to grab a tag or a field to InfluxQL as described and implemented in #6509 or as proposed in #4823.

This solution makes it so both examples are queryable and doesn’t require any change to existing data or the point writer, but does require users to use new syntax to reference the correct value.

Pros:

  • Backwards compatible.
  • Keeps tags and fields as completely separate entities rather than tags being a special type of field.
  • May allow us to detect when someone tries to select only a tag as we will be able to notify the user that they must select at least one field. This benefit is only if this new syntax is made required to access tags (not backwards compatible anymore) or if somebody actually uses it.

Cons:

  • The implementation is more complex and InfluxQL will have to be modified.
  • To maintain backwards compatibility, you still have to allow tags to be referenced with the same variable reference as fields. If you write host as a tag and then write it once as a field to one series in the measurement, all of your previous queries will break. It is much more optimal to require that the new syntax is always used for tags, but then we’re not backwards compatible anymore.
Proposed Syntax
Solution 1

Prepend the variable reference with tag or field.

SELECT tag.host, field.value FROM cpu

The syntax for this looks pretty ugly in my personal opinion and I dislike that it adds more reserved keywords to a language that already has a lot of them. It becomes impossible to have a key with these names which makes this a backwards incompatible change. We then also might have to add code that strips the front of the variable reference from the ident as the query engine passes around auxiliary fields as a raw string. That means the identifier cannot just be stripped at parse time since we need to send it to the underlying engine which doesn’t accept the AST structs as arguments. We can modify the iterator options struct to include references for the auxiliary fields, but this changes the wire protocol for RPC and might make things more complicated with protobuf.

Solution 2
SELECT @host, value FROM cpu

Prepend the input with a special sigil to signify that a tag is being referenced. The syntax for this is common in programming languages like Perl and Ruby. Some programmers may be offput by the syntax and so it might not be desirable (Perl and Ruby’s use of sigils is a holy war among programmers). The positive benefit is that it is one character (very easy to check and strip efficiently) and so it is easier to pass around as a string without involving complex AST structures. This is also a backwards compatible change since the @ symbol is not used like this anywhere else.

What sigil is used specifically is debatable. I would like to keep $ reserved for future InfluxQL support for a Template node in the AST since I think this will make some things, like Chronograf or Influx Stress, easier to implement using the influxql package. I also avoided # since that’s a comment character even though I was very tempted to use it because tag and hashtag both reference tags (even if its real name is the pound sign).

Solution 3
SELECT host::tag, value::string FROM cpu

This syntax alleviates one of the problems from solution 1 because it is less likely to conflict with a real measurement name, but it still may be difficult to strip off the end in an efficient manner when passing it through auxiliary fields. Likely, the best solution would be to change the function signature of IteratorOptions and just accept the wire protocol will change. The syntax is supposed to correspond to the casting syntax from PostreSQL, but we’re not technically doing a cast, but a selection. While true, we can extend this to also allow specifying the field type of a field. If a field type is given, that will be considered a cast for a field and will override the default field type chosen. Tag would be treated as a separate type.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 18 (16 by maintainers)

Commits related to this issue

Most upvoted comments

I personally like syntax 3 (::). It’s not technically a cast although I can see us using that syntax for forcing a data type in the future and I think it works well from a user’s point of view.

e.g. SELECT value::string FROM cpu