pandas: Cannot convert numbers to strings when reading an Excel spreadsheet
I’m reading some excel spreadsheets (xlsx format) into pandas using read_excel
, which generally works great. The problem I have is that when a column contains numbers, pandas converts these to float64 type, and I would like them to be treated as strings. After reading them in, I can convert the column to str:
my_frame.my_col = my_frame.my_col.astype('str')
This works as far as assigning the right type to the column, but when I view the values in this column, the strings are formatted in scientific-format e.g. 8.027770e+14, which is not what I want. I like to be able to tell Pandas to read the columns as strings. My current solution involves dropping down to xlrd to read the spreadsheet.
See stackoverflow question: http://stackoverflow.com/q/20970483/690890
About this issue
- Original URL
- State: closed
- Created 10 years ago
- Comments: 34 (17 by maintainers)
the ‘real’ soln here is to support a
dtype={}
argument similar toread_csv
, see issue #4932 which would allow you to specify that certain columns can be retained asobject
dtype (iow they wouldnt’ be coerced).PR anyone?
vote for suggested solution which is to support dtype for excel file.
Here is a workaround to the problem of leading zeros. If you know that your column has a fixed-width you can first turn the column into a string, and then add the gone padding zeros with the string method zfill, as done in here: http://stackoverflow.com/questions/339007/nicest-way-to-pad-zeroes-to-string
Overall, I agree that read_excel should have a dtype option.
well, my company receives several files of client data (normally Excel files) on a constant basis. I am in charge of ensuring that data is formatted according to our main import logic.
I built a Flask app to help our account managers formatting the files. This app relies on pandas heavily, because pandas is amazing.
However, since these files are coming straight from clients, my app doesn’t know what is on each file, so each account manager uses my app to decide if a column should be an int, or a date, etc.
So how i built the app it would turn every column into a dtype object column and then format every column according to the requirements.
This works great with csv files (because of the dtype arg), but with excel files the date columns are giving me a hard type.
How exactly are you doing that?
As far as I can see from the
xlrd
code, when it reads a number from an xlsx file it immediately converts it to afloat
before returning the cell object to the user.So unless I am missing something I don’t see how Pandas can be made to read the columns as strings. It will always get a float from
xlrd
for an xls or xlsx file if Excel has stored them as numbers.The CSV readers behave differently and can be type specified because they return all read data as strings regardless of whether it looks like a number of not.
Perhaps @jreback or @jtratner can suggest a different way to convert a column from
float64
to formatted strings.But apart from a workaround I don’t think this issue is fixable in the way that you would like.
Thanks, but the
dtype
argument isn’t available for read_excel