bigquery-etl: Reprocess stable table days missing sample_id via query

We have 6 days of data for which we were not adding sample_id, and we need to reprocess somehow to add that back in.

We have a working and reasonably efficient implementation of sample_id as a UDF in https://github.com/mozilla/bigquery-etl/pull/353 and we have the existing copy_deduplicate task that can iterate through all stable tables to run a query to populate them.

I propose that we create a variant of the copy_deduplicate script that reads from the stable table rather than the live table and just replaces the sample_id field in the query rather than deduplicating.

I ran a small test of this method and we were able to process a day of main ping in approximately the same time as the existing copy_deduplicate task (~21 minutes), suggesting that the bottleneck for both these jobs is simply the volume of data to be written rather than any data processing.

The cost for this will be identical to the cost for running the existing daily copy_deduplicate queries for the affected days.

cc @whd @relud

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

Done. We’re back in business. Thanks so much for your work here, @relud.

This looks good to me. I think we’re ready to copy these back into the stable table. I have access to do so.

I fixed 2019-09-10 and confirmed that each sample id exists and is getting a nearly even distribution with this query

WITH
  clusters AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    sample_id,
    COUNT(*) AS row_count
  FROM
    `moz-fx-data-derived-datasets.backfill.main_v4`
  WHERE
    DATE(submission_timestamp) BETWEEN '2019-09-10' AND '2019-09-16'
  GROUP BY
    1,
    2)
SELECT
  submission_date,
  COUNT(*) AS cluster_count,
  (MAX(row_count) - MIN(row_count))/AVG(row_count) AS deviation_percent
FROM
  clusters
GROUP BY
  submission_date
ORDER BY
  submission_date

which returns

submission_date cluster_count deviation_percent
2019-09-10 100 0.03508765536536311
2019-09-11 100 0.05274466234570406
2019-09-12 100 0.044740380156198724
2019-09-13 100 0.05225730055741874
2019-09-14 100 0.055302258614994416
2019-09-15 100 0.06897966628383832
2019-09-16 100 0.05240154334473516

I’ve finished reprocessing for all tables besides main ping for 2019-09-10 through 2019-09-15. Main ping is now processing.

Tomorrow, I’ll need to kick off jobs for 2019-09-16 since they’re also going to get populated with partially records that have no sample_id.