spark-bigquery-connector: Dynamic overwrite of partitions does not work as expected
In Spark when you set spark.conf.set(“spark.sql.sources.partitionOverwriteMode”,“dynamic”) and then do an insert into a partitioned table in overwrite mode. The newly inserted partitions would overwrite only partitions being inserted. Other partitions that were not part of that group would also stick around untouched. When writing to BigQuery with this connector the entire table in BigQuery gets wiped out and only the new partitions inserted show up. Can the connector be updated to support dynamic partition overwrites?
I am testing with gs://spark-lib/bigquery/spark-bigquery-latest.jar.
Thanks!
Example setup of this scenario:
Ran this on BigQuery directly:
CREATE OR REPLACE TABLE gcp-project.dev.wiki_page_views_spark_write
(
wiki_project STRING,
wiki_page STRING,
wiki_page_views INT64,
date DATE
)
PARTITION BY date
OPTIONS (
partition_expiration_days=999999
)
spark.conf.set(“spark.sql.sources.partitionOverwriteMode”,“dynamic”)
Saving the data to BigQuery
wiki.write.format(‘bigquery’)
.option(‘table’, ‘gcp-project.dev.wiki_page_views_spark_write’)
.option(‘project’,‘gcp-project’)
.option(‘temporaryGcsBucket’,‘gcp-project/tmp/bq_staging’)
.mode(‘overwrite’)
.save()
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 17 (3 by maintainers)
Commits related to this issue
- Issue #103: Enabling to overwrite data of a single date partition — committed to davidrabinowitz/spark-bigquery-connector by davidrabinowitz 4 years ago
- Issue #103: Enabling to overwrite data of a single date partition (#211) — committed to GoogleCloudDataproc/spark-bigquery-connector by davidrabinowitz 4 years ago
- Issue #103: Support Dynamic Partition Overwrite [DIRECT write, Time Partitioning] (#1087) — committed to GoogleCloudDataproc/spark-bigquery-connector by isha97 8 months ago
As a workaround, I set the write mode to “Append”, so, BigQuery will add new partition to the table without deleting old partitions. If I need to delete a partition, in case of a reset, I can use
bq rm 'dataset.table$20200614'
to delete a specific partition in the Table.