pandas: read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2

Code Sample, a copy-pastable example if possible

import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
csv = io.StringIO(df.to_csv(index=False))
df2 = pd.read_csv(csv)

Problem description

pd.read_csv() using _libs.parsers.TextReader read() method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on Python 3.5.2.

 4244 function calls (4210 primitive calls) in 10.273 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.202   10.202   10.204   10.204 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.039    0.039    0.039    0.039 internals.py:5017(_stack_arrays)
        1    0.011    0.011   10.262   10.262 parsers.py:414(_read)
        1    0.011    0.011   10.273   10.273 <string>:1(<module>)
        1    0.004    0.004    0.004    0.004 parsers.py:1685(__init__)
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)

Expected Output

3229 function calls (3222 primitive calls) in 2.944 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.881    2.881    2.882    2.882 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.045    0.045    0.045    0.045 internals.py:4801(_stack_arrays)
        1    0.010    0.010    2.944    2.944 parsers.py:423(_read)
        1    0.004    0.004    0.004    0.004 parsers.py:1677(__init__)
      320    0.001    0.000    0.001    0.000 common.py:777(is_integer_dtype)
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.p

Output of pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV

INSTALLED VERSIONS

commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 2008ServerR2 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.23.4 pytest: 3.9.2 pip: 18.1 setuptools: 40.4.3 Cython: 0.29 numpy: 1.15.3 scipy: 1.1.0 pyarrow: 0.11.0 xarray: 0.10.9 IPython: 7.0.1 sphinx: 1.8.1 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: 1.6.1 bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.0 openpyxl: 2.5.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.12 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.1.6 pandas_gbq: None pandas_datareader: None

Output of pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV

INSTALLED VERSIONS

commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.22.0 pytest: 3.5.0 pip: 9.0.3 setuptools: 20.10.1 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.1 pyarrow: 0.9.0 xarray: 0.10.2 IPython: 6.3.0 sphinx: 1.7.2 patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: 0.4.0 matplotlib: 2.2.2 openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: 1.2.6 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.1.5 pandas_gbq: None pandas_datareader: None

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 56 (24 by maintainers)

Most upvoted comments

I compared the statement df2 = pd.read_csv(csv) on Python 3.7.0a3 and a4 in the Visual Studio profiler. The culprit is the isdigit function called in the parsers extension module. On 3.7.0a3 the function is fast at ~8% of samples. On 3.7.0a4 the function is slow at ~64% samples because it calls the _isdigit_l function, which seems to update and restore the locale in the current thread every time…

3.7.0a3:
Function Name	Inclusive Samples	Exclusive Samples	Inclusive Samples %	Exclusive Samples %	Module Name
 + [parsers.cp37-win_amd64.pyd]	705	347	28.52%	14.04%	parsers.cp37-win_amd64.pyd
   isdigit	207	207	8.37%	8.37%	ucrtbase.dll
 - _errno	105	39	4.25%	1.58%	ucrtbase.dll
   toupper	24	24	0.97%	0.97%	ucrtbase.dll
   isspace	21	21	0.85%	0.85%	ucrtbase.dll
   [python37.dll]	1	1	0.04%	0.04%	python37.dll
3.7.0a4:
Function Name	Inclusive Samples	Exclusive Samples	Inclusive Samples %	Exclusive Samples %	Module Name
 + [parsers.cp37-win_amd64.pyd]	8,613	478	83.04%	4.61%	parsers.cp37-win_amd64.pyd
 + isdigit	6,642	208	64.04%	2.01%	ucrtbase.dll
 + _isdigit_l	6,434	245	62.03%	2.36%	ucrtbase.dll
 + _LocaleUpdate::_LocaleUpdate	5,806	947	55.98%	9.13%	ucrtbase.dll
 + __acrt_getptd	2,121	1,031	20.45%	9.94%	ucrtbase.dll
   FlsGetValue	647	647	6.24%	6.24%	KernelBase.dll
 - RtlSetLastWin32Error	296	235	2.85%	2.27%	ntdll.dll
   _guard_dispatch_icall_nop	101	101	0.97%	0.97%	ucrtbase.dll
   GetLastError	46	46	0.44%	0.44%	KernelBase.dll
 + __acrt_update_multibyte_info	1,475	246	14.22%	2.37%	ucrtbase.dll
 - __crt_state_management::get_current_state_index	1,229	513	11.85%	4.95%	ucrtbase.dll
 + __acrt_update_locale_info	1,263	235	12.18%	2.27%	ucrtbase.dll
 - __crt_state_management::get_current_state_index	1,028	429	9.91%	4.14%	ucrtbase.dll
   _ischartype_l	383	383	3.69%	3.69%	ucrtbase.dll

Redone everything forcing --channel anaconda, same results.

Indeed, I can confirm that there is a 3.5X slowdown when using Python 3.7.1 on Windows 10.

When I use Python 3.5.6, the performance is unchanged from 0.22.0 to 0.23.4.

These observations are consistent with what @dragoljub was observing and appears to suggest that this is a Cython / Python suggest and not pandas.