risingwave: PG cdc `q3` checksums inconsistent

Describe the bug

pg-cdc q3 checksums failed https://buildkite.com/risingwave-test/chaos-mesh/builds/552 image

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240207

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 16 (11 by maintainers)

Most upvoted comments

I am able to successfully repro with:

BENCH_TESTBED="medium-arm-all-affinity"
CH_BENCHMARK_QUERY="q3"
TEST_TYPE="ch-benchmark-pg-cdc"
cooldown_after_experiment="3m"
duration="1h"
experiment_recovery_time="1m"
experiments="[{'kind': 'StressChaos', 'actions': ['cpu'], 'duration': '10m', 'cpu': {'workers': 1, 'load': 10}, 'cases': [{'mode': ['one'], 'label': {'key': 'risingwave/component', 'value': 'meta'}}]}]"
RW_VERSION="nightly-20240207"
ENABLE_KEEP_CLUSTER="true"

Buildkite Grafana

Findings:

  1. Row count of all source tables (customer, new_order, orders, order_line) related to ch_benchmark_q3 matches with PG.
  2. ch_benchmark_q3 in RW has more rows than PG.
  3. To further dig out the root cause of the mismatch, I created a new MV called ch_benchmark_q3_2 with the same SQL when there is no source throughput. ch_benchmark_q3_2 has correct row count so I cross check the internal state table row count one by one:
    • Only 1 join state table’s row count mismatches between ch_benchmark_q3 and ch_benchmark_q3_2: __internal_ch_benchmark_q3_99_hashjoinleft_1121
    • __internal_ch_benchmark_q3_99_hashjoinleft_1121 is the one and the only one table that has triggered memtable spill.
  4. I pick one of the row that should not be present in __internal_ch_benchmark_q3_99_hashjoinleft_1121 and use sst-dump to grep the occurrences of its key in all relevant SSTs. I only found one occurrence in a L6 SST. There is no other PUT/DELETE for the key in sst-dump.

See more details in debug_notes.txt. The cluster is not cleaned up. For those who are interested, you can psql into the cluster to get more information.

Now I am re-running the same test with the following config. Hopefully we can get the full mutation history for a row in hummock:

RW_CONFIG="{'meta':{'enable_hummock_data_archive': true, 'full_gc_interval_sec': 6048000, 'min_sst_retention_time_sec': 6048000}}"

The source tables are synced completed. Something wrong with the query q3.

bin/qa consistency compare --upstream-driver postgres --upstream-port 54321 --upstream-user postgres --upstream-password postgres --upstream-database-name postgres --downstream-port 45678 --downstream-user root --downstream-password "" --downstream-database-name dev -t customer,new_order,orders,order_line 
{
    "consistent": true,
    "table-compare-results": [
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "customer",
                    "table-checksum": 3564470369424743317,
                    "table-rows": 30000
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "customer",
                    "table-checksum": 3564470369424743317,
                    "table-rows": 30000
                }
            ]
        },
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "new_order",
                    "table-checksum": -7778568713383235460,
                    "table-rows": 20
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "new_order",
                    "table-checksum": -7778568713383235460,
                    "table-rows": 20
                }
            ]
        },
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "orders",
                    "table-checksum": -6341453060762824403,
                    "table-rows": 155614
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "orders",
                    "table-checksum": -6341453060762824403,
                    "table-rows": 155614
                }
            ]
        },
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "order_line",
                    "table-checksum": 5946274962969133011,
                    "table-rows": 1557351
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "order_line",
                    "table-checksum": 5946274962969133011,
                    "table-rows": 1557351
                }
            ]
        }
    ]
}

q3:

bin/qa consistency compare --upstream-driver postgres --upstream-port 54321 --upstream-user postgres --upstream-password postgres --upstream-database-name postgres --downstream-port 45678 --downstream-user root --downstream-pa
ssword "" --downstream-database-name dev -t ch_benchmark_q3 
{
    "consistent": false,
    "table-compare-results": [
        {
            "consistent": false,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "ch_benchmark_q3",
                    "table-checksum": 7479820059950340793,
                    "table-rows": 20
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "ch_benchmark_q3",
                    "table-checksum": -9219433622567761793,
                    "table-rows": 212
                }
            ]
        }
    ]
}

Trigger a run for the test with #15232. Let’s wait and see whether the issue is gone.

The test passed.

Note that the CN node restarts 2 times during the test

pod-failure will cause pods restart. image