OpenRefine: Columnize by key/value column causes misalignment of columns

Importing data from several data sets with version 2.7, gives a result like this:

Column1;Column2
SourceFile1;2
SourceFile1;3
SourceFile1;-1
SourceFile2;3
SourceFile2;4
SourceFile2;6
SourceFile3;-3
SourceFile3;4
SourceFile3;1

When applying "Columnize by key/value column to Column1, the expectation is to get something like this:

SourceFile1;SourceFile2;SourceFile3
2;3;-3
3;4;4
-1;6;1

but instead, the following result is obtained:

SourceFile1;SourceFile2;SourceFile3
2;;
3;;
-1;3;-3
;4;4
;6;1

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 2
Comments: 28 (27 by maintainers)

Commits related to this issue

Add test for issue #1214 — committed to OpenRefine/OpenRefine by wetneb 7 years ago
Adapt test about #1214 to show how the function currently works. With this test, if someone tries to fix this "bug" in the future, they will realize this use case should be preserved. — committed to OpenRefine/OpenRefine by wetneb 7 years ago
Adapt test about #1214 to show how the function currently works. With this test, if someone tries to fix this "bug" in the future, they will realize this use case should be preserved. — committed to OpenRefine/OpenRefine by wetneb 7 years ago
Adapt test about #1214 to show how the function currently works. With this test, if someone tries to fix this "bug" in the future, they will realize this use case should be preserved. — committed to OpenRefine/OpenRefine by wetneb 7 years ago

Most upvoted comments

But when we have an explicit Primary key (unique or composed), the function works exactly as expected and order doesn’t matter.

screencast

Dataset :

country,year,key,value Afghanistan,1999,cases,745 Afghanistan,2000,cases,2666 Brazil,1999,cases,37737 Brazil,2000,cases,80488 China,1999,cases,212258 China,2000,cases,213766 Afghanistan,1999,population,19987071 Afghanistan,2000,population,20595360 Brazil,1999,population,172006362 Brazil,2000,population,174504898 China,1999,population,1272915272 China,2000,population,1280428583

ettorerizza on Jul 28, 2017

@ostephens Yes, as explained above, this function requires indeed its input to be listed in a particular order.

The code assumes that the first key it encounters (here, Key 1) is the primary key in the resulting table (in other words, the first column). Then, for every other (key,value) pair (the ones with Key 2 and Key 3), OpenRefine assigns the key-value pair to the last primary key it has encountered.

So, let us do manually what the algorithm does.

Read the initial table sequentially:

Read (Key 1, Value 1). As this is the first row we are reading, we deduce that Key 1 is going to be the primary key in our new table. We create a row containing Value 1 in the newly-created Key 1 column.
Read (Key 1, Value 4). This is a new value for the primary key. We deduce that the record for the primary key previously seen (Value 1) is complete, so we leave the first row as it is (with only a value for Key 1) and create a new row containing Value 4 in the Key 1 column.
Read (Key 2, Value 2). We create a new column for Key 2 and insert Value 2 at the position of the primary key that we have seen last (Value 4).
We keep doing that for the other values: they are all assigned to the last primary key seen, Value 4.

To understand why such an algorithm is desirable, you can run it on my merchant-fruit-price example above.

So, yes, it is normal that changing the order of the rows beforehand breaks the result.

wetneb on Jul 28, 2017

I’m happy to improve the docs in this area

ostephens on Jul 26, 2017

@ostephens I agree it is confusing! (And I did understand that you were not sure it was a bug - my first comment was for @ettorerizza.) Here is how I understand the logic of the code.

It wants to support input cases like this one:

Key,Value
merchant,Katie
fruit,apple
price,1.2
fruit,pear
price,1.5
merchant,John
fruit,banana
price,3.1

which is (correctly) transposed to

merchant,fruit,price
Katie,apple,1.2
,pear,1.5
John,banana,3.1

(We insert a blank cell in the merchant column to create a record.)

To support this kind of input, the code needs to assume that the first column it encounters (here, SourceFile1) is the primary key. Then, for every other (key,value) pair (the ones that are not primary keys), OpenRefine assigns the pair to the last primary key it has encountered. This implies that the function is order-sensitive and explains the output @belm104 obtained.

As far as I can tell, this is a reasonable assumption. The input format @belm104 had to deal with seems rather uncommon. There could potentially be an option to switch between the two behaviours, but I suspect the current behaviour is useful for some users too (including myself actually)…

wetneb on Jul 26, 2017

@wetneb we welcome users always. But me saying antipattern does not reflect any negativity.

If you take the original 2 column dataset provided at the top of the issue… and in OpenRefine edit the Column2 cell of “-1” and change to “1” and then edit the Column2 cell of “1” and change to “-1”

Then you can see how the function works. But I am still wondering if any of you have an expectation of flattening somehow when Column 1 is sorted as @ostephens example forms ? That’s what I am trying to discover here. We might be able to solve that also with another option

thadguidry on Jul 26, 2017