cloudquery: Error aws->postgresql unexpected EOF

After upgrading to the latest version CQ and AWS/Postgres plugins I get the following output in the logs after the sync exits after only 7 mins running (usual runs for close to 3 hours) having gathered credentials and started syncing tables.

Error: failed to write for aws->postgresql: failed to CloseAndRecv client: rpc error: code = Internal desc = Context done: context canceled and failed to wait for plugin: failed to execute batch: unexpected EOF

We’ve gone from CQ 1.3.7 to 2.0.31, AWS from v3.5.0 to v9.1.1 and Postgres from v1.3.9 to v2.0.1

We’re running CQ in a docker container on AWS ECS with Aurora Serverless v1 Postgres (pre-scaled to 16 ACU before sync called), this has worked fairly well for some time on a daily schedule. We have 350+ AWS accounts with only a few tables in the CQ config ignore list, mainly ‘aws_inspector2_findings’ and all the lightsail tables.

Please let me know what else you’d need to diagnose this.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 25 (14 by maintainers)

Commits related to this issue

Most upvoted comments

@hermanschaaf Apologies for the length of time its taken but I have now resolved the issue. I did initially setup an ec2 instance with local Postgres DB with as close to the same cpu/memory as the ECS container, same Postgres version and latest CQ/plugin versions, which I successfully ran a sync across all of our accounts. I then tried again with our original setup, ECS Fargate with Aurora Serverless v1 Postgres 11.16, this failed as before. I then truncated the DB and tried again, this time the sync worked, however subsequent syncs failed intermittently. I have now migrated to Aurora Serverless v2 Postgres 14.6, with latest CQ/plugins sync is now successfully running each day for a week now. I have come to the conclusion it was either the DB config/infrastructure or (less likely) the data, causing the error. Sorry I can’t really offer much more insight but I’ll close this issue now, thanks again for your support

@castaples No worries, thanks for the update, and really glad to hear it’s resolved now!

Hi @castaples , thanks for reporting the experiment. Yes definitely this will have effect and can have affect on the postgres side of things. The default as far as I can see in the code (we need to add it to the docs) is 10000 so maybe you can do 2000 be sure to put the batch_size in the upper spec for destination not in the postgresql spec.

Once you try on ec2, and if it will behave the same way we will add some retries on the Postgres plugin side but it will be much easier to debug it that way.

So, I’ve been digging into this some more…

I tried to reproduce the issue by creating an Aurora Serverless instance (v2, AWS didn’t allow me to create V1) and connecting to it from EC2. I’m afraid I wasn’t able to reproduce the issue. The sync ran to completion in 11 minutes in my case, though it is testing against a much smaller cloud footprint than yours. I’m using the same plugin versions from when the issue was created (AWS v9.1.1, postgresql v2.0.1) with the most recent CLI.

I also tried setting the source concurrency to a lower value so that the sync would take longer, hoping to see if this might result in a connection being interrupted, but even after 20 minutes it was still running.

I then went through the log messages carefully again, and it seems like the Unexpected EOF is happening at a point where the destination plugin server is already shutting down (failed to CloseAndRecv client). I believe this can only happen if either 1) all resources have been fetched and the channel closed, or 2) the gRPC connection got interrupted for some reason, most probably because the destination plugin process was shut down. TL;DR I think the Unexpected EOF error is a red herring.

@castaples Would you perhaps be up for doing a Zoom debugging session with me and @yevgenypats some time? If so, you can set it up here https://calendly.com/yevgenyp

@hermanschaaf Yes reverting to the previous versions means the sync then completes successfully.

Thanks for trying and providing the details @castaples ! Too bad it’s not good news, looks like the issue really isn’t down to any specific resources we added. And sorry yeah, I see my list of tables included two that were only added a few hours ago and not officially released yet (I was working against the main branch) 😃 I’ll continue digging into potential connection management issues

I’ve just tried CQ 2.1.0, AWS v10.1.0 and Postgres v2.0.5, initially with your list of skip tables above, including my own:

'aws_inspector2_findings',
'aws_lightsail*',

However I had to remove two (‘aws_organization_resource_policies’ and ‘aws_xray_resource_policies’) from your list as these caused an error:

{
    "level": "info",
    "module": "cli",
    "grpc.code": "InvalidArgument",
    "grpc.component": "server",
    "grpc.error": "rpc error: code = InvalidArgument desc = validation failed: failed to filter tables: skip_tables include a pattern aws_organization_resource_policies with no matches",
    "grpc.method": "GetTablesForSpec",
    "grpc.method_type": "unary",
    "grpc.service": "proto.Source",
    "grpc.start_time": "2023-01-12T13:51:13Z",
    "grpc.time_ms": "1.07",
    "message": "finished call",
    "peer.address": "@",
    "protocol": "grpc",
    "time": "2023-01-12T13:51:13Z"
}
{
    "level": "info",
    "module": "cli",
    "grpc.code": "InvalidArgument",
    "grpc.component": "server",
    "grpc.error": "rpc error: code = InvalidArgument desc = validation failed: failed to filter tables: skip_tables include a pattern aws_xray_resource_policies with no matches",
    "grpc.method": "GetTablesForSpec",
    "grpc.method_type": "unary",
    "grpc.service": "proto.Source",
    "grpc.start_time": "2023-01-12T13:59:17Z",
    "grpc.time_ms": "1.922",
    "message": "finished call",
    "peer.address": "@",
    "protocol": "grpc",
    "time": "2023-01-12T13:59:17Z"
}

After removing them, I then had to perform a couple of manual fixes during the table migration phase of the sync:

Error: failed to migrate source aws on destination postgresql : failed to call Migrate: rpc error: code = Unknown desc = the following primary keys were removed from the schema ["id"] for table "aws_ec2_transit_gateways".

alter table "aws_ec2_transit_gateways" drop constraint if exists "aws_ec2_transit_gateways_cqpk";
alter table "aws_ec2_transit_gateways" alter column "id" drop not null;
Error: failed to migrate source aws on destination postgresql : failed to call Migrate: rpc error: code = Unknown desc = the following primary keys were removed from the schema ["id" "account_id" "region"] for table "aws_elasticsearch_domains".

alter table "aws_elasticsearch_domains" drop constraint if exists "aws_elasticsearch_domains_cqpk";
alter table "aws_elasticsearch_domains" alter column "id" drop not null;
alter table "aws_elasticsearch_domains" alter column "account_id" drop not null;
alter table "aws_elasticsearch_domains" alter column "region" drop not null;

Once it was finally running it then errored with the “Unexpected EOF” after about 5 minutes running, last few log outputs are:

{
    "level": "error",
    "module": "cli",
    "grpc.code": "Internal",
    "grpc.component": "server",
    "grpc.error": "rpc error: code = Internal desc = failed to send resource: rpc error: code = Canceled desc = context canceled",
    "grpc.method": "Sync2",
    "grpc.method_type": "server_stream",
    "grpc.service": "proto.Source",
    "grpc.start_time": "2023-01-12T14:13:07Z",
    "grpc.time_ms": "330238.25",
    "message": "finished call",
    "peer.address": "@",
    "protocol": "grpc",
    "time": "2023-01-12T14:18:37Z"
}
{
    "level": "error",
    "module": "aws-src",
    "client": "************:us-west-2",
    "error": "operation error Athena: GetQueryExecution, https response error StatusCode: 0, RequestID: , canceled, context canceled",
    "message": "pre resource resolver failed",
    "table": "aws_athena_work_group_query_executions",
    "time": "2023-01-12T14:18:37Z"
}
{
    "level": "info",
    "module": "cli",
    "source": "aws",
    "destinations": [
        "postgresql"
    ],
    "sync_time": "2023-01-12T14:13:03Z",
    "time": "2023-01-12T14:18:37Z",
    "message": "End sync"
}
{
    "level": "info",
    "module": "cli",
    "address": "/tmp/cq-YiYQYNemPfYPDIRt.sock",
    "message": "Got interrupt. Source plugin server shutting down",
    "time": "2023-01-12T14:18:37Z"
}
Error: failed to write for aws->postgresql: failed to CloseAndRecv client: rpc error: code = Internal desc = Context done: context canceled and failed to wait for plugin: failed to execute batch: unexpected EOF
{
    "level": "error",
    "module": "cli",
    "error": "failed to write for aws->postgresql: failed to CloseAndRecv client: rpc error: code = Internal desc = Context done: context canceled and failed to wait for plugin: failed to execute batch: unexpected EOF",
    "time": "2023-01-12T14:18:37Z",
    "message": "exiting with error"
}

@castaples While I continue investigating connection management, could you also try running a sync with the new versions, but skipping all the new AWS tables that have been added since v3.5.0? This should give us a more apples-to-apples comparison. I auto-generated a list for you here by comparing the list of tables for the two versions:

- aws_account_alternate_contacts
- aws_account_contacts
- aws_amp_rule_groups_namespaces
- aws_amp_workspaces
- aws_apigatewayv2_api_integration_responses
- aws_apigatewayv2_api_route_responses
- aws_apprunner_auto_scaling_configurations
- aws_apprunner_connections
- aws_apprunner_custom_domains
- aws_apprunner_observability_configurations
- aws_apprunner_operations
- aws_apprunner_vpc_connectors
- aws_apprunner_vpc_ingress_connections
- aws_appstream_app_blocks
- aws_appstream_application_fleet_associations
- aws_appstream_applications
- aws_appstream_directory_configs
- aws_appstream_fleets
- aws_appstream_image_builders
- aws_appstream_images
- aws_appstream_stack_entitlements
- aws_appstream_stack_user_associations
- aws_appstream_stacks
- aws_appstream_usage_report_subscriptions
- aws_appstream_users
- aws_athena_data_catalog_database_tables
- aws_cloudwatchlogs_resource_policies
- aws_config_config_rule_compliances
- aws_config_config_rules
- aws_docdb_certificates
- aws_docdb_cluster_parameter_groups
- aws_docdb_cluster_parameters
- aws_docdb_cluster_snapshots
- aws_docdb_clusters
- aws_docdb_engine_versions
- aws_docdb_event_categories
- aws_docdb_event_subscriptions
- aws_docdb_events
- aws_docdb_global_clusters
- aws_docdb_instances
- aws_docdb_orderable_db_instance_options
- aws_docdb_pending_maintenance_actions
- aws_docdb_subnet_groups
- aws_ecr_repository_image_scan_findings
- aws_elasticsearch_packages
- aws_elasticsearch_versions
- aws_elasticsearch_vpc_endpoints
- aws_elastictranscoder_pipeline_jobs
- aws_elastictranscoder_pipelines
- aws_elastictranscoder_presets
- aws_elbv2_listener_certificates
- aws_eventbridge_api_destinations
- aws_eventbridge_archives
- aws_eventbridge_connections
- aws_eventbridge_endpoints
- aws_eventbridge_event_sources
- aws_eventbridge_replays
- aws_frauddetector_batch_imports
- aws_frauddetector_batch_predictions
- aws_frauddetector_detectors
- aws_frauddetector_entity_types
- aws_frauddetector_event_types
- aws_frauddetector_external_models
- aws_frauddetector_labels
- aws_frauddetector_model_versions
- aws_frauddetector_models
- aws_frauddetector_outcomes
- aws_frauddetector_rules
- aws_frauddetector_variables
- aws_fsx_file_caches
- aws_glue_database_table_indexes
- aws_glue_registry_schema_versions
- aws_iam_ssh_public_keys
- aws_identitystore_group_memberships
- aws_identitystore_groups
- aws_identitystore_users
- aws_kafka_cluster_operations
- aws_kafka_clusters
- aws_kafka_configurations
- aws_kafka_nodes
- aws_kms_key_grants
- aws_kms_key_policies
- aws_lambda_layer_version_policies
- aws_mq_broker_configuration_revisions
- aws_mwaa_environments
- aws_neptune_global_clusters
- aws_organization_resource_policies
- aws_organizations
- aws_quicksight_analyses
- aws_quicksight_dashboards
- aws_quicksight_data_sets
- aws_quicksight_data_sources
- aws_quicksight_folders
- aws_quicksight_group_members
- aws_quicksight_groups
- aws_quicksight_ingestions
- aws_quicksight_templates
- aws_quicksight_users
- aws_ram_principals
- aws_ram_resource_share_associations
- aws_ram_resource_share_invitations
- aws_ram_resource_share_permissions
- aws_ram_resource_shares
- aws_ram_resource_types
- aws_ram_resources
- aws_rds_cluster_parameters
- aws_rds_engine_versions
- aws_redshift_cluster_parameters
- aws_savingsplans_plans
- aws_scheduler_schedule_groups
- aws_scheduler_schedules
- aws_servicecatalog_portfolios
- aws_servicecatalog_products
- aws_servicecatalog_provisioned_products
- aws_servicequotas_quotas
- aws_servicequotas_services
- aws_ses_active_receipt_rule_sets
- aws_ses_configuration_set_event_destinations
- aws_ses_configuration_sets
- aws_ses_contact_lists
- aws_ses_custom_verification_email_templates
- aws_ses_identities
- aws_ssm_associations
- aws_ssm_compliance_summary_items
- aws_ssm_instance_patches
- aws_ssm_inventories
- aws_ssm_inventory_schemas
- aws_ssm_patch_baselines
- aws_ssoadmin_account_assignments
- aws_ssoadmin_instances
- aws_ssoadmin_permission_sets
- aws_stepfunctions_state_machines
- aws_timestream_databases
- aws_timestream_tables
- aws_xray_resource_policies

By the way, in newer versions you should also be able to use wildcard matching, so you could skip all of lightsail with aws_lightsail*

@castaples I’m looking into this… I’ll try and see what changed between these versions, and I notice there have been a few Unexpected EOF errors reported for the pgx library that we use, so there might be something there. My current theory is that it’s related to connection management with Aurora Serverless, which may be different from normal Postgres installations.