go: encoding/csv: Reading is slow
$ go version
go version go1.7 linux/amd64
Reading of csv files is, out of the box, quite slow (tl;dr: 3x slower than a simple Java program, 1.5x slower than the obvious python code). A typical example:
package main
import (
"bufio"
"encoding/csv"
"fmt"
"io"
"os"
)
func main() {
f, _ := os.Open("mock_data.csv")
defer f.Close()
r := csv.NewReader(f)
for {
line, err := r.Read()
if err == io.EOF {
break
}
if line[0] == "42" {
fmt.Println(line)
}
}
}
Python3 equivalent:
import csv
with open('mock_data.csv') as f:
r = csv.reader(f)
for row in r:
if row[0] == "42":
print(row)
Equivalent Java code [EDIT: not actually equivalent, please see pauldraper comment below for a better test]
import java.io.BufferedReader;
import java.io.FileReader;
public class ReadCsv {
public static void main(String[] args) {
BufferedReader br;
String line;
try {
br = new BufferedReader(new FileReader("mock_data.csv"));
while ((line = br.readLine()) != null) {
String[] data = line.split(",");
if (data[0].equals("42")) {
System.out.println(line);
}
}
} catch (Exception e) {}
}
}
Tested on a 50MB, 1’000’002 lines csv file generated as:
data = ",Carl,Gauss,cgauss@unigottingen.de,Male,30.4.17.77\n"
with open("mock_data.csv", "w") as f:
f.write("id,first_name,last_name,email,gender,ip_address\n")
f.write(("1"+data)*int(1e6))
f.write("42"+data);
Results:
Go: avg 1.489 secs
Python: avg 0.933 secs (1.5x faster)
Java: avg 0.493 secs (3.0x faster)
Go error reporting is obviously better than the one you can have with that Java code, and I’m not sure about Python, but people has been complaining about encoding/csv
slowness, so it’s probably worth investigating whether the csv
package can be made faster.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 5
- Comments: 38 (20 by maintainers)
Commits related to this issue
- encoding/csv: update and add CSV reading benchmarks Benchmarks broken off from https://golang.org/cl/24723 and modified to allocate less in the places we're not trying to measure. Updates #16791 Ch... — committed to golang/go by bradfitz 8 years ago
- encoding/csv: avoid allocations when reading records This commit changes parseRecord to allocate a single string per record, instead of per field, by using indexes into the raw record. Benchstat (do... — committed to golang/go by nussjustin 8 years ago
Here’s a naive example: https://play.golang.org/p/zbMdK8rCTH
1 hour for a 50MB file? no way… also why the profile shows 99.8% time spent in cgocall? There must be something wrong with your code, please ask for a code review in one of the go forums or mailing list.
Apologies if this is the wrong place for this comment (perhaps this should be its own feature request?). I was wondering if it would be possible to implement an efficient streaming API. I’m thinking a single buffer reused reused across all rows, with calls that return a
[][]byte, error
wherein the subslices are slices into the buffer (only valid until the next hit to the API). The standardRead()
andReadAll()
methods could use this API, allocating a new buffer before each call to this API so as to provide the normal guarantees that the returned slices won’t be affected by subsequent calls?Replacing
[rR]eadRune
by[rR]eadByte
gives me around 18% speedup, see my implementation. This is obviously just for testing, since I rip out therune
handling altogether.From what I can see, we read runes just because we can have a rune delimiter (and/or comment). We can add a private byte
comma
toReader
and populate it iff the separator is byte-sized, which is both the default (an actual comma) and the most likely case (has anyone ever split a CSV on a rune?). And if we have a byte-sized separator, we can replace all rune reading and writing with byte reading and writing.I can imagine the
Read
method to test the size and then dispatch a different method, which only uses byte reading, writing, and switching.(Or it could be equally incorporated to
parseField
, but there would be a lot of switching back and forth, depending on the byte/rune context.)(Handling of r.Comment would be handled similarly, but that’s a rather minor issue.)
(I don’t know much about runes, but could we maybe just read on bytes and stop on the first byte of a rune, regardless of its length? And if it’s a multibyte character, then check the subsequent bytes for a match. I don’t know how many false positives this would yield though.)