bigquery-etl: Reprocess stable table days missing sample_id via query
We have 6 days of data for which we were not adding sample_id, and we need to reprocess somehow to add that back in.
We have a working and reasonably efficient implementation of sample_id as a UDF in https://github.com/mozilla/bigquery-etl/pull/353 and we have the existing copy_deduplicate task that can iterate through all stable tables to run a query to populate them.
I propose that we create a variant of the copy_deduplicate script that reads from the stable table rather than the live table and just replaces the sample_id
field in the query rather than deduplicating.
I ran a small test of this method and we were able to process a day of main ping in approximately the same time as the existing copy_deduplicate task (~21 minutes), suggesting that the bottleneck for both these jobs is simply the volume of data to be written rather than any data processing.
The cost for this will be identical to the cost for running the existing daily copy_deduplicate queries for the affected days.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (10 by maintainers)
Done. We’re back in business. Thanks so much for your work here, @relud.
This looks good to me. I think we’re ready to copy these back into the stable table. I have access to do so.
I fixed 2019-09-10 and confirmed that each sample id exists and is getting a nearly even distribution with this query
which returns
I’ve finished reprocessing for all tables besides main ping for 2019-09-10 through 2019-09-15. Main ping is now processing.
Tomorrow, I’ll need to kick off jobs for 2019-09-16 since they’re also going to get populated with partially records that have no sample_id.