graphql-engine: Migrations drop and recreate hdb_views dozens of times, causing max_locks_per_transaction to be exceeded

Updating from v1.0.0-beta.6 to v1.0.0-beta.7 results in this error when running sudo docker-compose up -d: ERROR: manifest for hasura/graphql-engine:v1.0.0-beta.7 not found: manifest unknown: manifest unknown

Updating from v1.0.0-beta.6 to v1.0.0-beta.8 results in 502 Bad Gateway in the browser when accessing the console. The ui fails too because all graphql calls get a 502 response.

Updating from v1.0.0-beta.6 to v1.0.0-beta.9 or v1.0.0-beta.10 results in same error.

Reverting to v1.0.0-beta.6 instantly works after sudo docker-compose up -d.

What am I doing wrong?

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 43 (16 by maintainers)

Most upvoted comments

@lexi-lambda please have yourself a beer - you made my day!

barbalex on Dec 17, 2019

I can confirm that my app is working fine with v1.1.0-beta.2, without changing max_locks_per_transaction.

Thanks a lot for your hard work!

barbalex on Jan 16, 2020

@barbalex To be honest, I am not certain whether the root cause of the issue you’re seeing is this change or something else… but I realized it doesn’t actually matter, because we’re getting rid of hdb_views for insert permissions entirely! See #3598.

lexi-lambda on Jan 2, 2020

@lexi-lambda I did:

cd /etc/hasura
sudo nano docker-compose.yaml

then replaced v1.0.0-beta6 with pull3394-67093178

then

sudo docker-compose up -d
sudo docker-compose restart caddy

and it works 😄

barbalex on Dec 17, 2019

Yes, my apologies—I was hoping to leave a comment on this thread yesterday, but one more unexpected issue came up that led me to hold off.

The good news: I have been working on a fix for this in #3394, and I think it basically works. It would be great if either of you could try the experimental build in https://github.com/hasura/graphql-engine/pull/3394#issuecomment-566198192 and let me know if it resolves your problem. I would have liked for this change to go into v1.0.0, but it’s a large change, and there are some outstanding subtleties, so I’ve been hoping to have some people try it out before merging it.

The bad news: the change should work fine, but there are some lingering performance issues that seem to stem primarily from a poor interaction with the parallel GC running on machines where the number of cores the OS reports are available is larger than the number of cores graphql-engine should probably reasonably be using. For example, on a Heroku free dyno, nproc reports 8, so graphql-engine currently defaults to running on 8 cores. That choice is not a good one, however, as Heroku free-tier dynos are shared, and this seems to create a significant performance hit.

I am still looking into the appropriate solution for that, but in the meantime, if you want to try the build, consider restricting the number of cores graphql-engine uses manually. The easiest way to do that is to set the GHCRTS=-N<x> environment variable, replacing <x> with the number of cores you’d like it to run on. Setting GHCRTS=-N1 is a particularly conservative choice, since it will disable parallelism completely, but it will certainly mitigate the pathological behavior.

As a final point of note, the performance running the migrations is still not good—on the database you sent me, they take 15-20 seconds. However, they do eventually finish, and since migrating is a one-time cost, I haven’t worried about that too much yet. There are ways we can improve that number much further over time, it’s just a matter of work.

lexi-lambda on Dec 17, 2019

I’m not completely certain. It might help, but it might not, since even squashed migrations could trigger the issue (since they still have individual calls to things like track_table, IIUC?). If absolutely necessary, it would probably work to drop the catalog information and reapply the metadata in batches so that there aren’t too many individual query operations in each bulk batch (and therefore not too many query operations in a single transaction).

Hopefully I’ll have a less awkward solution available soon. I’ll update this issue once I have a development build available for testing.

lexi-lambda on Nov 22, 2019

Thanks for this very good and transparent information.

So I know that if I want to update or if I run into any problems before the issue is solved, I will have to migrate the db to a virtual droplet so I can increase max_locks_per_transaction.

barbalex on Nov 20, 2019