frictionless-py: Adding a field inferencing "subrule", otherwise very often I would have wrong extracted values

Hi, daily I work with data where some columns have only numerical values in the cells, but which are actually string fields.

A typical case is that of administrative codes, of regions, provinces and cities (in the table below codice_regione, codice_provincia and codice_comune fields). In Italy, the city of Losine has the code 0870, and obviously it’s a string, and it must remain a string, otherwise I would lose the chance to JOIN this data with other data.

comune codice_regione codice_provincia codice_comune denominazione_comune sigla_provincia data_entrata_in_carica
030151360 03 015 1360 POLPENAZZE DEL GARDA BS 13/10/2021
030120530 03 012 0530 CAROBBIO DEGLI ANGELI BG 04/10/2021
020040580 02 004 0580 SAINT-DENIS AO 23/09/2020
030150870 03 015 0870 LOSINE BS 04/10/2021
190480090 19 048 0090 CAPO D’ORLANDO ME 27/10/2021

But if I run extract on it

frictionless extract "https://gist.githubusercontent.com/aborruso/076e5fad847b658a535b16cbcf3abdfd/raw/887edc88ab75b01d15ea1ac4bb052ffaf8d5ef9c/tmp.csv" --csv

the codice_regione, codice_provincia and codice_comune fields are no longer strings, they are all numbers and the cell values have changed (in example the code of Losine becomes 870).

comune codice_regione codice_provincia codice_comune denominazione_comune sigla_provincia data_entrata_in_carica
30151360 3 15 1360 POLPENAZZE DEL GARDA BS 13/10/2021
30120530 3 12 530 CAROBBIO DEGLI ANGELI BG 04/10/2021
20040580 2 4 580 SAINT-DENIS AO 23/09/2020
30150870 3 15 870 LOSINE BS 04/10/2021
190480090 19 48 90 CAPO D’ORLANDO ME 27/10/2021

I know, octal numbers exist, but in 99% of cases (I’m talking about my experience), if I have fields with cells starting with zero they are not octal numbers, but string codes.

If it were possible I would add a subrule for all fields mapped as numbers: if there are cells that start with a zero, not followed by a , or ., set that inferenced field as string and not as number.

Thank you.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

@shashigharti I think we need to return None if it starts from ‘0’ here:

I guess it’s the right solution as leading zeros might be really confusing e.g. Python doesn’t allow them:

SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers

Yea it might be an error in the metadata but the more general problem is that by the standard we can’t forbid leading zeroes as it’s explicitly allowed - https://www.w3.org/TR/xmlschema-2/#decimal (this is referenced in Table Schema standard)

@roll I’m trying to investigate but there is something strange. In my opinion it refers to the fact that you can have “0.3265” or “.3265”, but not that you can represent “0.3265” as “03265”, or “3265” as “03265”.

So, for me the standard would be respected

Hi @aborruso,

Yea it might be an error in the metadata but the more general problem is that by the standard we can’t forbid leading zeroes as it’s explicitly allowed - https://www.w3.org/TR/xmlschema-2/#decimal (this is referenced in Table Schema standard)

I’m NOT against of detecting CODES but I think we just need to find another way to doing so e.g. having a special logic in detector.detect_schema (cc @shashigharti maybe creating a new feature request?)

Hi @roll , it seems that we have this problem still in 5 release.

If I run

frictionless validate --buffer-size 25000000 "https://gist.githubusercontent.com/aborruso/c99d74e88d0f8037219c958a40fe744c/raw/6d8b602f5692306bbc16ee6b05210363be911c63/tmp.csv"

I have wrong type error. The second columns contains a lot of 01 values (and so on), it’s a string column, but it’s mapped as integer.

I’m using 5.0.0b10.

Thank you

Wow @roll thank you.

I am very happy to have contributed in terms of ideas.

How to install the current 5 dev version?

Another question if integer/number.bareNumber is set to False we might need to allow leading zeros to make possible to have a numeric column like this

Thanks @aborruso, we will discuss it but I think it makes sense