go: encoding/csv: Reading is slow

$ go version
go version go1.7 linux/amd64

Reading of csv files is, out of the box, quite slow (tl;dr: 3x slower than a simple Java program, 1.5x slower than the obvious python code). A typical example:

package main

import (
    "bufio"
    "encoding/csv"
    "fmt"
    "io"
    "os"
)

func main() {
    f, _ := os.Open("mock_data.csv")
    defer f.Close()

    r := csv.NewReader(f)
    for {
        line, err := r.Read()
        if err == io.EOF {
            break
        }
        if line[0] == "42" {
            fmt.Println(line)
        }
    }

}

Python3 equivalent:

import csv
with open('mock_data.csv') as f:
    r = csv.reader(f)
    for row in r:
        if row[0] == "42":
            print(row)

Equivalent Java code [EDIT: not actually equivalent, please see pauldraper comment below for a better test]

import java.io.BufferedReader;
import java.io.FileReader;

public class ReadCsv {
    public static void main(String[] args) {
        BufferedReader br;
        String line;
        try {
            br = new BufferedReader(new FileReader("mock_data.csv"));
            while ((line = br.readLine()) != null) {
                String[] data = line.split(",");
                if (data[0].equals("42")) {
                    System.out.println(line);
                }
            }
        } catch (Exception e) {}
    }
}

Tested on a 50MB, 1’000’002 lines csv file generated as:

data = ",Carl,Gauss,cgauss@unigottingen.de,Male,30.4.17.77\n"
with open("mock_data.csv", "w") as f:
    f.write("id,first_name,last_name,email,gender,ip_address\n")
    f.write(("1"+data)*int(1e6))
    f.write("42"+data);

Results:

Go:       avg 1.489 secs
Python:   avg 0.933 secs  (1.5x faster)
Java:     avg 0.493 secs  (3.0x faster)

Go error reporting is obviously better than the one you can have with that Java code, and I’m not sure about Python, but people has been complaining about encoding/csv slowness, so it’s probably worth investigating whether the csv package can be made faster.

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 5
Comments: 38 (20 by maintainers)

Commits related to this issue

encoding/csv: update and add CSV reading benchmarks Benchmarks broken off from https://golang.org/cl/24723 and modified to allocate less in the places we're not trying to measure. Updates #16791 Ch... — committed to golang/go by bradfitz 8 years ago
encoding/csv: avoid allocations when reading records This commit changes parseRecord to allocate a single string per record, instead of per field, by using indexes into the raw record. Benchstat (do... — committed to golang/go by nussjustin 8 years ago

Most upvoted comments

Here’s a naive example: https://play.golang.org/p/zbMdK8rCTH

type CSVReader struct {
    Scanner *bufio.Scanner
}

func (r CSVReader) StreamingRead() ([][]byte, error) {
    if r.Scanner.Scan() {
        return bytes.Split(r.Scanner.Bytes(), []byte(",")), nil
    }
    if err := r.Scanner.Err(); err != nil {
        return nil, err
    }
    return nil, io.EOF
}

BenchmarkStdRead-4                    50          30504491 ns/op         2486373 B/op    85013 allocs/op
BenchmarkStreamingRead-4             300           4916641 ns/op         1924190 B/op     5006 allocs/op

weberc2 on Aug 29, 2016

1 hour for a 50MB file? no way… also why the profile shows 99.8% time spent in cgocall? There must be something wrong with your code, please ask for a code review in one of the go forums or mailing list.

ALTree on May 22, 2017

Apologies if this is the wrong place for this comment (perhaps this should be its own feature request?). I was wondering if it would be possible to implement an efficient streaming API. I’m thinking a single buffer reused reused across all rows, with calls that return a [][]byte, error wherein the subslices are slices into the buffer (only valid until the next hit to the API). The standard Read() and ReadAll() methods could use this API, allocating a new buffer before each call to this API so as to provide the normal guarantees that the returned slices won’t be affected by subsequent calls?

weberc2 on Aug 29, 2016

Replacing [rR]eadRune by [rR]eadByte gives me around 18% speedup, see my implementation. This is obviously just for testing, since I rip out the rune handling altogether.

From what I can see, we read runes just because we can have a rune delimiter (and/or comment). We can add a private byte comma to Reader and populate it iff the separator is byte-sized, which is both the default (an actual comma) and the most likely case (has anyone ever split a CSV on a rune?). And if we have a byte-sized separator, we can replace all rune reading and writing with byte reading and writing.

I can imagine the Read method to test the size and then dispatch a different method, which only uses byte reading, writing, and switching.

if r.Comma < utf8.RuneSelf {
    r.comma = byte(r.Comma)
    record, err = r.parseRecordByBytes()
} else {
    record, err = r.parseRecord()
}

(Or it could be equally incorporated to parseField, but there would be a lot of switching back and forth, depending on the byte/rune context.)

(Handling of r.Comment would be handled similarly, but that’s a rather minor issue.)

(I don’t know much about runes, but could we maybe just read on bytes and stop on the first byte of a rune, regardless of its length? And if it’s a multibyte character, then check the subsequent bytes for a match. I don’t know how many false positives this would yield though.)

kokes on Aug 24, 2016