fastparquet: To Pandas doesn't work with parquet file - Type Error
Hi all, I’m loading some parquet files generated by a Spark ETL job.
I get this error when calling parquet_file.to_pandas().
AttributeError Traceback (most recent call last)
<ipython-input-9-7098f6946da6> in <module>()
----> 1 profiles.to_pandas()
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index, timestamp96)
332 self.read_row_group(rg, columns, categories, infile=f,
333 index=index, assign=parts,
--> 334 timestamp96=timestamp96)
335 start += rg.num_rows
336 else:
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/api.py in read_row_group(self, rg, columns, categories, infile, index, assign, timestamp96)
184 infile, rg, columns, categories, self.schema, self.cats,
185 self.selfmade, index=index, assign=assign,
--> 186 timestamp96=timestamp96, sep=self.sep)
187 if ret:
188 return df
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, timestamp96, sep)
336 raise RuntimeError('Going with pre-allocation!')
337 read_row_group_arrays(file, rg, columns, categories, schema_helper,
--> 338 cats, selfmade, assign=assign, timestamp96=timestamp96)
339
340 for cat in cats:
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_row_group_arrays(file, rg, columns, categories, schema_helper, cats, selfmade, assign, timestamp96)
313 selfmade=selfmade, assign=out[name],
314 catdef=out[name+'-catdef'] if use else None,
--> 315 timestamp96=mr)
316
317 if _is_map_like(schema_helper, column):
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_col(column, schema_helper, infile, use_cat, grab_dict, selfmade, assign, catdef, timestamp96)
237 skip_nulls = False
238 defi, rep, val = read_data_page(infile, schema_helper, ph, cmd,
--> 239 skip_nulls, selfmade=selfmade)
240 if rep is not None and assign.dtype.kind != 'O': # pragma: no cover
241 # this should never get called
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_data_page(f, helper, header, metadata, skip_nulls, selfmade)
103 dtype=np.uint8))
104
--> 105 repetition_levels = read_rep(io_obj, daph, helper, metadata)
106
107 if skip_nulls and not helper.is_required(metadata.path_in_schema):
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_rep(io_obj, daph, helper, metadata)
83 metadata.path_in_schema)
84 bit_width = encoding.width_from_max_int(max_repetition_level)
---> 85 repetition_levels = read_data(io_obj, daph.repetition_level_encoding,
86 daph.num_values,
87 bit_width)[:daph.num_values]
AttributeError: 'NoneType' object has no attribute 'repetition_level_encoding'```
Has anyone seen anything like this before?
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 5
- Comments: 41 (18 by maintainers)
I have the same problem here.
OK, so: there appear to be multiple dictionary pages, which is not supposed to happen, but I can deal with. Also, the encoding is “bit-packed (deprecated)”, which, as the name suggests, is not supposed to be around. I can maybe code it up, since the spec is well-stated, and I can compare the result against ground-truth as given by spark. I’ll get back to you.