graphql-ruby: Returning lots of objects is slow

We’re seeing some performance issues with a certain field that returns a lot of data. To better understand how GraphQL performance depends on the size of the response, we wrote a benchmark script that uses the GraphQL gem to query for increasingly large lists of objects at once:

https://gist.github.com/lovitt/effa205b876d9a9b86ee58399cceaf31

All test data is pre-generated and in memory, so the script should only be measuring the overhead of the GraphQL system itself — the time it takes to resolve each of the fields, check the raw value against the expected type, etc., and build the response (i.e., not time to pull data out of some database). All returned objects are of the same type, and all 10 of its fields are requested in the query.

Here are the results on my OS X dev machine:

Calculating -------------------------------------
Querying for 10 objects - all fields                   
                        225.168  (±26.6%) i/s -    952.000  in   5.041098s
Querying for 100 objects - all fields                  
                         29.132  (±24.0%) i/s -    130.000  in   5.038850s
Querying for 1000 objects - all fields                 
                          2.713  (± 0.0%) i/s -     14.000  in   5.346691s
Querying for 10000 objects - all fields                
                          0.262  (± 0.0%) i/s -      2.000  in   7.654611s

Comparison:
Querying for 10 objects - all fields                   :      225.2 i/s
Querying for 100 objects - all fields                  :       29.1 i/s - 7.73x  slower
Querying for 1000 objects - all fields                 :        2.7 i/s - 82.99x  slower
Querying for 10000 objects - all fields                :        0.3 i/s - 858.20x  slower

For this particular test schema and object type, there seems to be around ~1.8ms of overhead introduced for every object in the response. Do these numbers sound right? Are these performance characteristics about what we should expect…?

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 22 (14 by maintainers)

Most upvoted comments

I came back here to double-check my benchmark results and found I never posted them 🙈 ! So I ran them again.

Before

Calculating -------------------------------------
Querying for 10 objects - all fields
                        163.645  (±16.5%) i/s -    768.000  in   5.015027s
Querying for 100 objects - all fields
                         18.821  (±15.9%) i/s -     86.000  in   5.032493s
Querying for 1000 objects - all fields
                          1.807  (± 0.0%) i/s -     10.000  in   5.730450s
Querying for 10000 objects - all fields
                          0.163  (± 0.0%) i/s -      1.000  in   6.127659s

After

Using features on master:

+     use GraphQL::Analysis::AST
+     use GraphQL::Execution::Interpreter
Calculating -------------------------------------
Querying for 10 objects - all fields
                        251.892  (±12.3%) i/s -      1.224k in   5.007991s
Querying for 100 objects - all fields
                         29.337  (± 6.8%) i/s -    146.000  in   5.019815s
Querying for 1000 objects - all fields
                          2.954  (± 0.0%) i/s -     15.000  in   5.098905s
Querying for 10000 objects - all fields
                          0.268  (± 0.0%) i/s -      2.000  in   7.507030s

So, it seems about 50% faster.

Ran these again on d874f79 and saw some improvements. The benchmark is a good bit (~30%) faster:

Querying for 10 objects - all fields                   
                        295.518  (±14.2%) i/s -      1.428k in   5.007656s
Querying for 100 objects - all fields                  
                         37.471  (±13.3%) i/s -    180.000  in   5.054762s
Querying for 1000 objects - all fields                 
                          3.658  (± 0.0%) i/s -     18.000  in   5.016701s
Querying for 10000 objects - all fields                
                          0.335  (± 0.0%) i/s -      2.000  in   6.013074s

The overall memory usage is better too:

- Total allocated: 464321016 bytes (2940816 objects)
+ Total allocated: 260320824 bytes (2000815 objects)

The method profile shows a lot less time in .new and #initialize, but now much more in ===:

 %self      total      self      wait     child     calls  name
  9.52      1.119     1.119     0.000     0.000  6520102   Module#===
  8.69      1.303     1.021     0.000     0.283   410001   GraphQL::Query::Context::FieldResolutionContext#initialize
  5.61      1.963     0.659     0.000     1.304   410176  *Class#new
  4.04      0.678     0.475     0.000     0.203  1750009   Kernel#===
  3.96      2.428     0.465     0.000     1.962   410001   GraphQL::Query::Context::SharedMethods#spawn_child
  3.17      0.939     0.373     0.000     0.566   420001   GraphQL::Query::Context::FieldResolutionContext#value=
  2.49      0.293     0.293     0.000     0.000  1530036   Kernel#is_a?
  2.25      0.264     0.264     0.000     0.000  1660009   BasicObject#==
  1.62      0.190     0.190     0.000     0.000   610003   GraphQL::Query::Context::SharedMethods#invalid_null?
  1.48      0.399     0.174     0.000     0.224   120000   <Object::GraphQL::ScalarType>#coerce_result
  1.27      0.150     0.150     0.000     0.000   120000   String#encode
  1.27      0.149     0.149     0.000     0.000   880063   Kernel#nil?

So I’ll see if I can avoid some of those!

I’ve migrated the benchmark above to the class-based schema so that I can try it with the new interpreter: https://gist.github.com/rmosolgo/eda7d1023cf0b096d52da6ba387b2785

Nothing to report on yet, but I wanted to share that I’m hacking away at it!

I haven’t looked into optimizations like that because it’s not a bottleneck for my application. (For us, it’s application logic and IO.) How about trying some Ruby profilers to make sure we’re hunting the right target? Here are two I’m familiar with:

The output of those profilers can be a bit … dense … but if you want to interpret it together, I’d be happy to take a look at the output. Just open another issue and share your benchmark code and the output of those.

Here’s some example code about how I do that when working on this gem:

https://github.com/rmosolgo/graphql-ruby/blob/1a21f665554a00c822b8f5212b2900e64ec17a91/benchmark/run.rb#L44-L52

https://github.com/rmosolgo/graphql-ruby/blob/1a21f665554a00c822b8f5212b2900e64ec17a91/benchmark/run.rb#L73-L77

(Those scripts write to stdout, so I write the result to a file, for example bundle exec rake:benchmark > before_profile.txt.)

Hi, just wanted to share an update. I’m working on a new runtime over at #1394. Instead of eagerly duping all the objects for different fields, it duplicates them only when needed. (Namely, when requested by extras [...] or when fields return promises.) So I’m optimistic that it will improve runtime performance a lot. It’s not quite ready yet though. Feel free to keep an eye on that issue if you’re interested.

Thanks for the detailed benchmark! That’s my understanding too, that the sheer number of objects allocated causes a lot of overhead.

Currently I’m pretty focused landing the finishing touches to the built-in authorization stuff, and getting that deployed at GitHub, but after that, my focus will be on improving runtime performance. (It’s also a GitHub priority!)

I’m pretty stumped about how to reduce that, because those objects seem really necessary. You need something to whole the runtime path, for example (which includes list indexes, unlike the static path), and something to hold on to promises while they’re still pending.

My first order of business before hacking on the runtime is to read up a bit, I got a couple of books with my learning budget: Language Implementation Patterns and Essentials of Programming Languages, so I’m hoping some proper book learnin’ will help!

There are some techniques I can think of from programming language optimization, like reusing frames instead of making new ones, and not allocating frames when you can avoid it. I wonder if something like that will come in handy.