cloudquery: Error aws->postgresql unexpected EOF
After upgrading to the latest version CQ and AWS/Postgres plugins I get the following output in the logs after the sync exits after only 7 mins running (usual runs for close to 3 hours) having gathered credentials and started syncing tables.
Error: failed to write for aws->postgresql: failed to CloseAndRecv client: rpc error: code = Internal desc = Context done: context canceled and failed to wait for plugin: failed to execute batch: unexpected EOF
We’ve gone from CQ 1.3.7
to 2.0.31
, AWS from v3.5.0
to v9.1.1
and Postgres from v1.3.9
to v2.0.1
We’re running CQ in a docker container on AWS ECS with Aurora Serverless v1 Postgres (pre-scaled to 16 ACU before sync called), this has worked fairly well for some time on a daily schedule. We have 350+ AWS accounts with only a few tables in the CQ config ignore list, mainly ‘aws_inspector2_findings’ and all the lightsail tables.
Please let me know what else you’d need to diagnose this.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 25 (14 by maintainers)
Commits related to this issue
- feat: Log received signal when shutting down (#6933) This adds a message to the log like this for SIGTERM: ``` 2023-01-18T13:59:23Z ERR exiting with error error="received terminated signal from O... — committed to cloudquery/cloudquery by hermanschaaf a year ago
- fix(logging): Log more explicit message when OOM and other status codes occur (#659) Related to https://github.com/cloudquery/cloudquery/issues/6431 This change should inform users if either the s... — committed to cloudquery/plugin-sdk by hermanschaaf a year ago
@hermanschaaf Apologies for the length of time its taken but I have now resolved the issue. I did initially setup an ec2 instance with local Postgres DB with as close to the same cpu/memory as the ECS container, same Postgres version and latest CQ/plugin versions, which I successfully ran a sync across all of our accounts. I then tried again with our original setup, ECS Fargate with Aurora Serverless v1 Postgres 11.16, this failed as before. I then truncated the DB and tried again, this time the sync worked, however subsequent syncs failed intermittently. I have now migrated to Aurora Serverless v2 Postgres 14.6, with latest CQ/plugins sync is now successfully running each day for a week now. I have come to the conclusion it was either the DB config/infrastructure or (less likely) the data, causing the error. Sorry I can’t really offer much more insight but I’ll close this issue now, thanks again for your support
@castaples No worries, thanks for the update, and really glad to hear it’s resolved now!
Hi @castaples , thanks for reporting the experiment. Yes definitely this will have effect and can have affect on the postgres side of things. The default as far as I can see in the code (we need to add it to the docs) is
10000
so maybe you can do2000
be sure to put thebatch_size
in the upperspec
fordestination
not in thepostgresql
spec
.Once you try on ec2, and if it will behave the same way we will add some retries on the Postgres plugin side but it will be much easier to debug it that way.
So, I’ve been digging into this some more…
I tried to reproduce the issue by creating an Aurora Serverless instance (v2, AWS didn’t allow me to create V1) and connecting to it from EC2. I’m afraid I wasn’t able to reproduce the issue. The sync ran to completion in 11 minutes in my case, though it is testing against a much smaller cloud footprint than yours. I’m using the same plugin versions from when the issue was created (AWS v9.1.1, postgresql v2.0.1) with the most recent CLI.
I also tried setting the source
concurrency
to a lower value so that the sync would take longer, hoping to see if this might result in a connection being interrupted, but even after 20 minutes it was still running.I then went through the log messages carefully again, and it seems like the
Unexpected EOF
is happening at a point where the destination plugin server is already shutting down (failed to CloseAndRecv client
). I believe this can only happen if either 1) all resources have been fetched and the channel closed, or 2) the gRPC connection got interrupted for some reason, most probably because the destination plugin process was shut down. TL;DR I think theUnexpected EOF
error is a red herring.@castaples Would you perhaps be up for doing a Zoom debugging session with me and @yevgenypats some time? If so, you can set it up here https://calendly.com/yevgenyp
@hermanschaaf Yes reverting to the previous versions means the sync then completes successfully.
Thanks for trying and providing the details @castaples ! Too bad it’s not good news, looks like the issue really isn’t down to any specific resources we added. And sorry yeah, I see my list of tables included two that were only added a few hours ago and not officially released yet (I was working against the main branch) 😃 I’ll continue digging into potential connection management issues
I’ve just tried CQ
2.1.0
, AWSv10.1.0
and Postgresv2.0.5
, initially with your list of skip tables above, including my own:However I had to remove two (‘aws_organization_resource_policies’ and ‘aws_xray_resource_policies’) from your list as these caused an error:
After removing them, I then had to perform a couple of manual fixes during the table migration phase of the sync:
Once it was finally running it then errored with the “Unexpected EOF” after about 5 minutes running, last few log outputs are:
@castaples While I continue investigating connection management, could you also try running a sync with the new versions, but skipping all the new AWS tables that have been added since
v3.5.0
? This should give us a more apples-to-apples comparison. I auto-generated a list for you here by comparing the list of tables for the two versions:By the way, in newer versions you should also be able to use wildcard matching, so you could skip all of lightsail with
aws_lightsail*
@castaples I’m looking into this… I’ll try and see what changed between these versions, and I notice there have been a few Unexpected EOF errors reported for the pgx library that we use, so there might be something there. My current theory is that it’s related to connection management with Aurora Serverless, which may be different from normal Postgres installations.