koalas: ValueError when reading dict with None

I find that reading a dict

row =  {'a': [1], 'b':[None]}
ks.DataFrame(row)

ValueError: can not infer schema from empty or null dataset

but for pandas there is no error

row =  {'a': [1], 'b':[None]}
print(pd.DataFrame(row))

   a     b
0  1  None

I have tried setting dtype=np.int64 but this has not helped.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 19 (7 by maintainers)

Most upvoted comments

@ederfdias Here is a possible workaround. Specify converters like below:

import numpy as np

df_ks = koalas.read_excel(
   ...
   converters={i : (lambda x: str(x) if x else np.NaN) for i in range(30)} # Read first 30 columns as string     

JFYI… using read_csv function with a column without values I don’t receive any errors, but with an read_excel() the same error is raised.

Now it works properly on pandas-on-Spark (It’s available in Apache Spark 3.2 and above)

I’d recommend to use pandas-on-Spark rather than Koalas since Koalas now only maintenance mode.

>>> import pyspark.pandas as ps
>>> ps.DataFrame()
Empty DataFrame
Columns: []
Index: []
>>> ps.DataFrame([{"A": [None]}])
        A
0  [None]

Apparently, np.NaN does the trick

import numpy as np

row =  {'a': [1], 'b':[np.NaN]}
koalas.DataFrame(row).to_spark().where(F.col("b").isNull()).show()

output

+---+----+
|  a|   b|
+---+----+
|  1|null|
+---+----+

It’s because PySpark, by default, tries to infer the type from the given data. If there’s no data or only nulls in the column, PySpark cannot infer its data type for a DataFrame.

>>> import pandas as pd
>>> row =  {'a': [1], 'b':[None]}
>>> pd.DataFrame(row).dtypes
a     int64
b    object

pandas has object type that can contain everything; whereas PySpark does not have such type. So, it’s actually an issue in PySpark.