OpenRefine: Behaviour of date, number and boolean values in a test/list facet is inconsistent

Version of OpenRefine used (Google Refine 2.6, OpenRefine2.8, an other distribution?):

Tested on OpenRefine 2.8 & 3.0 beta, but suspect the same behaviour is in earlier versions of OR as well

Current behaviour

There are a variety of scenarios - some described here: Create a ‘text facet’ on a column that contains date values (i.e. as dates, not strings) Try selecting rows by clicking the date value in the facet See that no rows are selected (and in 2.8 at least an additional value appears in the facet)

screen shot 2018-06-23 at 09 40 48 screen shot 2018-06-23 at 09 40 57

Create a column containing non-string values and strings that have the same visible value - e.g.:

true -> boolean
"true" -> string

Look at the facet and see the “count” is the number you would expect if you treated the string and non-string values as being equivalent.

Try selecting the value in the facet - see only the rows containing either the non-string, or only the string values are selected (which set are selected depends on the order of the cells in the project)

screen shot 2018-06-23 at 09 35 31 screen shot 2018-06-23 at 09 35 41

This latter behaviour also effects numbers and dates

If you are allowed and are OK with making your data public, it would be awesome if you can include the data causing the issue or a URL pointing to where the data is (if your concerned about keeping your data private, ping us on our mailing list):

Expected Behaviour:

There is a fundamental question here of how date, number and boolean values should be treated in text/list facets. They are currently counted as if they are equivalent to strings, and also if you try to use ‘mass edit’ they are treated as if they are the same as the equivalent string (e.g. in the example above do an edit from the facet and the boolean true and the string “true” would both be changed in the project.

I think trying to treat objects as equivalent to some strings in this situation is probably a bad idea. I can see two options

  1. Similarly to how timeline/number facets treat other values OR could simply not include the non-string values in the facet and have a checkbox as to whether they are included in the filter or not - see screen shot 2018-06-23 at 09 49 00

  2. Similar to how nulls/empty strings are handled in text facets, we could have a bucket facet value for “dates”, “numbers”, “booleans” which would allow the user to select the set of boolean values but not see counts of true vs false (a further boolean facet would be needed to see that)

  3. Dates, Booleans and Numbers could be included in the facet but as separate values in the facet - so all boolean true are grouped and counted separately to all string “true”

These are not necessarily exclusive options - we could implement them in combination

These are my views - please make suggestions for other behaviours or indicate which of the 3 I’ve listed individually or in combination would make most sense to you @thadguidry @ettorerizza (feel free to ping others to get feedback)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 26 (25 by maintainers)

Most upvoted comments

Agree with @wetneb

The previous behaviour of the text facet was problematic (as documented above). For example in the previous behaviour a cell with the text string ‘true’ and a boolean cell or result of ‘true’ would cluster together but selection did not work (documented above). Not to mention the issues with dates (above).

I think a “type” facet would be helpful, but this isn’t what the OpenRefine 3.1 behaviour delivers.

The OpenRefine 3.1 behaviour delivers a “text” (or string) facet - and as with “number” and “date” facets, it doesn’t try to blindly convert non-text values into text values - that is left as something the user can choose to do if they want. I feel this brings to consistency to the facet behaviour, although I absolutely acknowledge that this change to behaviour is a breaking change and needs users to amend their previous practices.

We must find a solution at least for Booleans. The current behavior of OR 3.1 does not allow anymore to use text facet “true vs false”, for example to select the first 100 rows of a dataset with row.index < 100. Big breaking change.

OR 3.1 screenshot-127 0 0 1-3333-2019 01 07-20-27-45

OR 3 and previous screenshot-localhost-3333-2019 01 07-20-26-17

Thanks @thadguidry that’s very clear. What you are suggesting:

Change the Text Facet, so that Users can only see String Datatype values. show at the top of the Text Facet something like “non Text Rows” (like we do with Blank and Null) and a Count next to it

This definitely makes sense to me and was the approach I was trying to describe in my Number 2 above:

Similar to how nulls/empty strings are handled in text facets, we could have a bucket facet value for “dates”, “numbers”, “booleans” which would allow the user to select the set of boolean values but not see counts of true vs false (a further boolean facet would be needed to see that)

So with this method

[date 20170101T00:00:00Z]
20170101T00:00:00Z

Would give the facet

(dates)  (1)
20170101T00:00:00Z  (1)

I still wonder if there is a role for a mixed type ‘list’ facet (which is what I was suggesting in (3) above) where

[date 20170101T00:00:00Z]
20170101T00:00:00Z

Would give the facet

[date 20170101T00:00:00Z]  (1)
20170101T00:00:00Z  (1)

but I think you are probably right that we should keep the Text facet as Text, and then if there should be a mixed type List facet we can look at that as a separate issue

Thanks

@thadguidry I’m trying to understand what is the most sensible consistent behaviour here that we should aim for. It seems from what you say, you would like to see the values treated consistently as strings within the context of a text facet? So if we have data:

[date 20170101T00:00:00Z]
20170101T00:00:00Z

Then this should result in a facet:

20170101T00:00:00Z   (2)

And selecting the value in the facet would filter to both rows. Is that correct? Have I understood your preference for how it should work?