petastorm: Petastrom fails with big datasets
Using the code from the repo Github main page as reference, my code looks like follows:
spark = SparkSession.builder.config('spark.driver.memory', '10g').master('local[4]').getOrCreate()
sc = spark.sparkContext
with materialize_dataset(spark=spark, dataset_url='file:///opt/data/hello_world_dataset',
schema=MySchema, row_group_size_mb=256):
logging.info('Building RDD...')
rows_rdd = sc.parallelize(ids)\
.flatMap(row_generator)\ # Generator that yields lists of examples
.map(lambda x: dict_to_spark_row(MySchema, x))
logging.info('Creating DataFrame...')
spark.createDataFrame(rows_rdd, MySchema.as_spark_schema()) \
.coalesce(10) \
.write \
.mode('overwrite') \
.parquet('file:///opt/data/hello_world_dataset')
Now the RDD code executes successfully but fails only the .createDataFrame
call with the following error:
_pickle.PicklingError: Could not serialize broadcast: OverflowError: cannot serialize a string larger than 4GiB
This is my first experience with Spark, so I can’t really tell if this error originates in Spark or Petastorm.
Looking through other solutions to this error (in respect to Spark, not Petastorm) I saw that it might have to do with the pickling protocol, but I can’t confirm that, neither did I find a way of altering the pickling protocol.
How could I avoid this error?
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 20
Got it. I’ll update the ticket when we are out with the new version that should handle your case well.