spark-bigquery-connector: Dynamic overwrite of partitions does not work as expected

In Spark when you set spark.conf.set(“spark.sql.sources.partitionOverwriteMode”,“dynamic”) and then do an insert into a partitioned table in overwrite mode. The newly inserted partitions would overwrite only partitions being inserted. Other partitions that were not part of that group would also stick around untouched. When writing to BigQuery with this connector the entire table in BigQuery gets wiped out and only the new partitions inserted show up. Can the connector be updated to support dynamic partition overwrites?

I am testing with gs://spark-lib/bigquery/spark-bigquery-latest.jar.

Thanks!

Example setup of this scenario:

Ran this on BigQuery directly:

CREATE OR REPLACE TABLE gcp-project.dev.wiki_page_views_spark_write ( wiki_project STRING, wiki_page STRING, wiki_page_views INT64, date DATE ) PARTITION BY date OPTIONS ( partition_expiration_days=999999 )

spark.conf.set(“spark.sql.sources.partitionOverwriteMode”,“dynamic”)

Saving the data to BigQuery

wiki.write.format(‘bigquery’)
.option(‘table’, ‘gcp-project.dev.wiki_page_views_spark_write’)
.option(‘project’,‘gcp-project’)
.option(‘temporaryGcsBucket’,‘gcp-project/tmp/bq_staging’)
.mode(‘overwrite’)
.save()

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 17 (3 by maintainers)

Commits related to this issue

Most upvoted comments

As a workaround, I set the write mode to “Append”, so, BigQuery will add new partition to the table without deleting old partitions. If I need to delete a partition, in case of a reset, I can use bq rm 'dataset.table$20200614' to delete a specific partition in the Table.