ponyc: Extremely poor performance and high memory usage when iterating a FileLines instance

Given the program:

use "files"

actor Main
  new create(env: Env) =>
    try
      var count: U32 = 0
      var path = FilePath(env.root as AmbientAuth, "./googlebooks-eng-all-1gram-20120701-0")?
      var file = OpenFile(path) as File
      for line in FileLines(file) do
        count = count + 1
      end
    end

Running on the data file https://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-0.gz (uncompress first)

You will see the program takes an extremely long time to iterate the file, moreover, the ram usage is more than I had expected: ~5GB from ~180 meg input file.

For comparison with python:

$ cat test.py
count = 0
for l in open("googlebooks-eng-all-1gram-20120701-0"):
	count += 1

$ time python test.py

real    0m2.092s
user    0m1.310s
sys     0m0.108s

$ time ./test-pony
<<Manually kill>>
real	0m18.638s
user	0m2.584s
sys	0m13.456s

Could reproduce with pony 0.18 on windows 10 and linux (xubuntu).

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 26 (14 by maintainers)

Commits related to this issue

Most upvoted comments

We have decided to go with the following approach:

  1. remove line() from File as it is surprisingly slow and its existence could end up biting people.
  2. update documentation to tell people to use FileLines
  3. audit File for other performance issues that might have come about when we switch from c style handles to using file descriptors with writev()
  4. make the following changes to FileLines:
  • use buffered.Reader to store data
  • keep track of the position we have read up to
  • when it is time to “refill” our buffer:
    • determine current seek position of the underlying file, save that so we can restore later
    • move the seek position to last point that our buffer read to
    • read more data
    • store where we read to for next time we need to “refill our buffer”
    • reset file to seek position it was at before we started reading.