pandas: read_csv() force numeric dtype on column and set unparseable entry as missing (NaN)
How can I force a dtype on a column and ensure that any not-parseable data entry are filled as NaN? This is important in cases where there are unpredictable data entry errors in CSVs or database streams that cannot be mapped to missing values a priori.
Eg: Below I want column ‘a’ to be parsed as np.float but the erroneous ‘Dog’ entry causes an exception. Is there a way to tell read_csv() to force parsing a column ‘a’ as np.float and fill all non-parseable entries with NaN?
data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
df.dtypes
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-12-cd8b6f868aec> in <module>()
1 data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
----> 2 df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
3 df.dtypes
C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
389 buffer_lines=buffer_lines)
390
--> 391 return _read(filepath_or_buffer, kwds)
392
393 parser_f.__name__ = name
C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
205 return parser
206
--> 207 return parser.read()
208
209 _parser_defaults = {
C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
622 # self._engine.set_error_bad_lines(False)
623
--> 624 ret = self._engine.read(nrows)
625
626 if self.options.get('as_recarray'):
C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
943
944 try:
--> 945 data = self._reader.read(nrows)
946 except StopIteration:
947 if nrows is None:
C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5785)()
C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6002)()
C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6870)()
C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7919)()
AttributeError: 'NoneType' object has no attribute 'dtype'
About this issue
- Original URL
- State: closed
- Created 12 years ago
- Reactions: 8
- Comments: 19 (12 by maintainers)
if u can find a way to fit this in with the existing dtype option would be preferable
maybe
dtype = {‘foo’ : (float, ‘coerce’)}
or we introduce a helper function
dtype = {‘foo’ : parser.coerce(float)}
where coerce just returns an instance