pandas: df.to_stata fails when a column of type object contains only None

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'a': ['a', None]})
df.to_stata('test.dta')
df = pd.DataFrame({'a': [None, 'a']})
df.to_stata('test.dta')
df = pd.DataFrame({'a': [None, None]})
df.to_stata('test.dta')
# ValueError: Writing general object arrays is not supported

Problem description

The df.to_stata() method writes columns containing None without error when there is at least one string value in the column, but fails if the column contains only None. It’s unclear what data type to write a column of None as, so maybe that’s why this isn’t supported? I would propose that a column with values of only None be written as str1 with empty strings.

I came across this error because I read in a Parquet file with pd.read_parquet() and was unable to write the file to Stata format. In the Parquet schema, the column had type BYTE_ARRAY UTF8, but since the column had only missing values, it was read into Pandas as only None.

Expected Output

Stata file written to disk with missing values for the column with None.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.3.2
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.11.1
xarray: None
IPython: 6.5.0
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.1.14
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 20 (17 by maintainers)

Commits related to this issue

More helpful Stata object dtype error. (#23572) — committed to kylebarron/pandas by deleted user 6 years ago
ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes #23572 — committed to bashtage/pandas by bashtage 6 years ago
ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes #23572 — committed to bashtage/pandas by bashtage 6 years ago
ENH: Improve error message for empty object array (#23718) * ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes #23572 * T... — committed to pandas-dev/pandas by bashtage 6 years ago
ENH: Improve error message for empty object array (#23718) * ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes #23572 * T... — committed to tm9k1/pandas by bashtage 6 years ago
ENH: Improve error message for empty object array (#23718) * ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes #23572 * T... — committed to Pingviinituutti/pandas by bashtage 6 years ago
ENH: Improve error message for empty object array (#23718) * ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes #23572 * T... — committed to Pingviinituutti/pandas by bashtage 6 years ago

Most upvoted comments

Great point thanks @bashtage! Do you think it might be worth adding an option to to_stata so that something like this happens automatically?

diego898 on Dec 8, 2019

So, just to be clear, you’re only talking about raising in the all None case, right?

Above I was referencing the all None case.

No strong opinion on writing out a column full of None with df.to_stata().

I agree with @bashtage raising an error is the best solution.

kylebarron on Nov 12, 2018