pandas: Pandas read_csv out of memory even after adding chunksize

Code Sample, a copy-pastable example if possible

dataframe = pandas.read_csv(inputFolder + dataFile, chunksize=1000000, na_values='null', usecols=fieldsToKeep, low_memory=False, header=0, sep='\t')
tables = map(lambda table: TimeMe(foo)(table, categoryExceptions), dataframe)

def foo(table, exceptions):
    """
    Modifies the columns of the dataframe in place to be categories, largely to save space.
    :type table: pandas.DataFrame
    :type exceptions: set columns not to modify.
    :rtype: pandas.DataFrame
    """
    for c in table:
        if c in exceptions:
            continue

        x = table[c]
        if str(x.dtype) != 'category':
            x.fillna('null', inplace=True)
            table[c] = x.astype('category', copy=False)
    return table

Problem description

I have a 34 GB tsv file and I’ve been reading it using pandas readcsv function with chunksize specified as 1000000. The coomand above works fine with a 8 GB file, but pandas crashes for my 34 GB file, subsequently crashing my iPython notebook.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 19 (6 by maintainers)

Most upvoted comments

I’ve solved the memory error problem using chunks AND low_memory=False

chunksize = 100000
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize, low_memory=False):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

I’ve solved the memory error problem using smaller chunks (size 1). It was like 3x slower, but it didn’t error out. low_memory=False didn’t work

chunksize = 1
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)