koalas: ValueError when reading dict with None

I find that reading a dict

row =  {'a': [1], 'b':[None]}
ks.DataFrame(row)

ValueError: can not infer schema from empty or null dataset

but for pandas there is no error

row =  {'a': [1], 'b':[None]}
print(pd.DataFrame(row))

   a     b
0  1  None

I have tried setting dtype=np.int64 but this has not helped.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 19 (7 by maintainers)

Most upvoted comments

@ederfdias Here is a possible workaround. Specify converters like below:

import numpy as np

df_ks = koalas.read_excel(
   ...
   converters={i : (lambda x: str(x) if x else np.NaN) for i in range(30)} # Read first 30 columns as string

skndrg on May 20, 2021

JFYI… using read_csv function with a column without values I don’t receive any errors, but with an read_excel() the same error is raised.

ederfdias on Apr 15, 2021

Now it works properly on pandas-on-Spark (It’s available in Apache Spark 3.2 and above)

I’d recommend to use pandas-on-Spark rather than Koalas since Koalas now only maintenance mode.

>>> import pyspark.pandas as ps
>>> ps.DataFrame()
Empty DataFrame
Columns: []
Index: []
>>> ps.DataFrame([{"A": [None]}])
        A
0  [None]

itholic on Aug 10, 2021

Apparently, np.NaN does the trick

import numpy as np

row =  {'a': [1], 'b':[np.NaN]}
koalas.DataFrame(row).to_spark().where(F.col("b").isNull()).show()

output

+---+----+
|  a|   b|
+---+----+
|  1|null|
+---+----+

skndrg on May 20, 2021

It’s because PySpark, by default, tries to infer the type from the given data. If there’s no data or only nulls in the column, PySpark cannot infer its data type for a DataFrame.

>>> import pandas as pd
>>> row =  {'a': [1], 'b':[None]}
>>> pd.DataFrame(row).dtypes
a     int64
b    object

pandas has object type that can contain everything; whereas PySpark does not have such type. So, it’s actually an issue in PySpark.

HyukjinKwon on Dec 2, 2019