frictionless-py: Adding a field inferencing "subrule", otherwise very often I would have wrong extracted values
Hi, daily I work with data where some columns have only numerical values in the cells, but which are actually string fields.
A typical case is that of administrative codes, of regions, provinces and cities (in the table below codice_regione, codice_provincia and codice_comune fields). In Italy, the city of Losine has the code 0870, and obviously it’s a string, and it must remain a string, otherwise I would lose the chance to JOIN this data with other data.
| comune | codice_regione | codice_provincia | codice_comune | denominazione_comune | sigla_provincia | data_entrata_in_carica |
|---|---|---|---|---|---|---|
| 030151360 | 03 | 015 | 1360 | POLPENAZZE DEL GARDA | BS | 13/10/2021 |
| 030120530 | 03 | 012 | 0530 | CAROBBIO DEGLI ANGELI | BG | 04/10/2021 |
| 020040580 | 02 | 004 | 0580 | SAINT-DENIS | AO | 23/09/2020 |
| 030150870 | 03 | 015 | 0870 | LOSINE | BS | 04/10/2021 |
| 190480090 | 19 | 048 | 0090 | CAPO D’ORLANDO | ME | 27/10/2021 |
But if I run extract on it
frictionless extract "https://gist.githubusercontent.com/aborruso/076e5fad847b658a535b16cbcf3abdfd/raw/887edc88ab75b01d15ea1ac4bb052ffaf8d5ef9c/tmp.csv" --csv
the codice_regione, codice_provincia and codice_comune fields are no longer strings, they are all numbers and the cell values have changed (in example the code of Losine becomes 870).
| comune | codice_regione | codice_provincia | codice_comune | denominazione_comune | sigla_provincia | data_entrata_in_carica |
|---|---|---|---|---|---|---|
| 30151360 | 3 | 15 | 1360 | POLPENAZZE DEL GARDA | BS | 13/10/2021 |
| 30120530 | 3 | 12 | 530 | CAROBBIO DEGLI ANGELI | BG | 04/10/2021 |
| 20040580 | 2 | 4 | 580 | SAINT-DENIS | AO | 23/09/2020 |
| 30150870 | 3 | 15 | 870 | LOSINE | BS | 04/10/2021 |
| 190480090 | 19 | 48 | 90 | CAPO D’ORLANDO | ME | 27/10/2021 |
I know, octal numbers exist, but in 99% of cases (I’m talking about my experience), if I have fields with cells starting with zero they are not octal numbers, but string codes.
If it were possible I would add a subrule for all fields mapped as numbers: if there are cells that start with a zero, not followed by a , or ., set that inferenced field as string and not as number.
Thank you.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (17 by maintainers)
@shashigharti I think we need to return None if it starts from ‘0’ here:
I guess it’s the right solution as leading zeros might be really confusing e.g. Python doesn’t allow them:
Thansk @aborruso, I’ve created an issue - https://github.com/frictionlessdata/framework/issues/1380
Yea it’s quite weird although it seems still to be allowed for integer type:
@roll I’m trying to investigate but there is something strange. In my opinion it refers to the fact that you can have “0.3265” or “.3265”, but not that you can represent “0.3265” as “03265”, or “3265” as “03265”.
So, for me the standard would be respected
Hi @aborruso,
Yea it might be an error in the metadata but the more general problem is that by the standard we can’t forbid leading zeroes as it’s explicitly allowed - https://www.w3.org/TR/xmlschema-2/#decimal (this is referenced in Table Schema standard)
I’m NOT against of detecting CODES but I think we just need to find another way to doing so e.g. having a special logic in
detector.detect_schema(cc @shashigharti maybe creating a new feature request?)Hi @roll , it seems that we have this problem still in 5 release.
If I run
I have wrong type error. The second columns contains a lot of
01values (and so on), it’s a string column, but it’s mapped as integer.I’m using 5.0.0b10.
Thank you
Wow @roll thank you.
I am very happy to have contributed in terms of ideas.
How to install the current 5 dev version?
Another question if
integer/number.bareNumberis set to False we might need to allow leading zeros to make possible to have a numeric column like thisThanks @aborruso, we will discuss it but I think it makes sense