arrow: [C++][Parquet] Segmentation fault reading modular encrypted Parquet dataset over 2^15 rows
Describe the bug, including details regarding any error messages, version, and platform.
Version: pyarrow==14.0.2
Platform: Linux 5.15.x x86_64 GNU/Linux
I have been trying out the new capability introduced in Arrow 14 which allows modular encryption of Parquet files to be used alongside the generic Dataset API, but began to experience segmentation faults in C++ Arrow code when working with real, partitioned datasets rather than toy examples.
The error is most often segmentation fault but I have seen all of these:
Segmentation fault
OSError: Failed decryption finalization
OSError: Couldn't set AAD
After some trial and error generating larger and larger toy examples until the error was reliably hit, I discovered that the threshold was when the data which is initially written to the dataset on disk with modular encryption was over 2^15
. In testing, 2^15
rows never faulted, but 2^15 + 1
often faults (occasionally it does not).
The backtrace when triggering the fault or error always ends in the same function:
#9 0x00007f9b726625f2 in parquet::encryption::AesDecryptor::AesDecryptorImpl::GcmDecrypt(unsigned char const*, int, unsigned char const*, int, unsigned char const*, int, unsigned char*) [clone .cold] () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
#10 0x00007f9b7270923e in parquet::Decryptor::Decrypt(unsigned char const*, int, unsigned char*) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
#11 0x00007f9b726f569b in void parquet::ThriftDeserializer::DeserializeMessage<parquet::format::ColumnMetaData>(unsigned char const*, unsigned int*, parquet::format::ColumnMetaData*, parquet::Decryptor*) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
#12 0x00007f9b726f8f34 in parquet::ColumnChunkMetaData::ColumnChunkMetaDataImpl::ColumnChunkMetaDataImpl(parquet::format::ColumnChunk const*, parquet::ColumnDescriptor const*, short, short, parquet::ReaderProperties const&, parquet::ApplicationVersion const*, std::shared_ptr<parquet::InternalFileDecryptor>) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
Corresponding to this source: https://github.com/apache/arrow/blob/main/cpp/src/parquet/encryption/encryption_internal.cc#L453
I found this previous fix in this area, the issue for which details the exact same symptoms: https://github.com/apache/arrow/commit/88bccab18a4ab818355209e45862cc52f9cf4a0d
However, AesDecryptor::AesDecryptorImpl::GcmDecrypt() and AesDecryptor::AesDecryptorImpl::CtrDecrypt() use ctx_ member of type EVP_CIPHER_CTX from OpenSSL, which shouldn't be used from multiple threads concurrently.
So I was suspicious that it could be a multi-threading issue during read or write.
Whilst attempting to narrow down the cause (and whether the root cause occurs during writing or reading), I made the following observations:
- Writing
2^15 + 1
rows, deleting half the dataset and then reading in the full dataset still encounters the error - Writing
2^15 + 1
rows, and filtering to half the partitions in the dataset when reading still encounters the error - Writing
2^15 + 1
rows, and decrypting with modular encryption each individual parquet file in the dataset and concatenating them never failed, only doing so as a full dataset - Writing or reading or both with
use_threads=False
still encounters the error
The fact that the issue still occurs when the dataset on disk is halved and then read again suggested corruption during write, but the fact that all individual Parquet files are still independently readable suggests an issue with modular decryption during Dataset operations. Issue still occurring with threading disabled was unexpected.
The error is reproducible using any random data, and using a KMS client which actually does no encryption at all (eliminating our custom KMS client as a probable cause), but simply passes the keys used around in (encoded) plaintext. It occurs whether the footer is plaintext or not, or whether envelope encryption is used or not.
Reproduction in Python here (no partitions needed at all, so it produces a single Parquet file, which can be read normally):
import base64
import numpy as np
import tempfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.parquet.encryption as pqe
class NoOpKmsClient(pqe.KmsClient):
def __init__(self):
super().__init__()
def wrap_key(self, key_bytes: bytes, _: str) -> bytes:
b = base64.b64encode(key_bytes)
return b
def unwrap_key(self, wrapped_key: bytes, _: str) -> bytes:
b = base64.b64decode(wrapped_key)
return b
row_count = pow(2, 15) + 1
table = pa.Table.from_arrays([pa.array(np.random.rand(row_count), type=pa.float32())], names=["foo"])
kms_config = pqe.KmsConnectionConfig()
crypto_factory = pqe.CryptoFactory(lambda _: NoOpKmsClient())
encryption_config = pqe.EncryptionConfiguration(
footer_key="UNIMPORTANT_KEY",
column_keys={"UNIMPORTANT_KEY": ["foo"]},
double_wrapping=True,
plaintext_footer=False,
data_key_length_bits=128,
)
pqe_config = ds.ParquetEncryptionConfig(crypto_factory, kms_config, encryption_config)
pqd_config = ds.ParquetDecryptionConfig(crypto_factory, kms_config, pqe.DecryptionConfiguration())
scan_options = ds.ParquetFragmentScanOptions(decryption_config=pqd_config)
file_format = ds.ParquetFileFormat(default_fragment_scan_options=scan_options)
write_options = file_format.make_write_options(encryption_config=pqe_config)
file_decryption_properties = crypto_factory.file_decryption_properties(kms_config)
with tempfile.TemporaryDirectory() as tempdir:
path = tempdir + "/test-dataset"
ds.write_dataset(table, path, format=file_format, file_options=write_options)
file_path = path + "/part-0.parquet"
new_table = pq.ParquetFile(file_path, decryption_properties=file_decryption_properties).read()
assert table == new_table
dataset = ds.dataset(path, format=file_format)
new_table = dataset.to_table()
assert table == new_table
Any help here would be much appreciated: being restricted to 2^15
rows is a roadblock for us for the use of this feature.
Component(s)
Parquet, Python
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Comments: 21 (16 by maintainers)
Commits related to this issue
- GH-39444: refactor test case — committed to wgtmac/arrow by wgtmac 5 months ago
- [DO NOT MERGE] GH-39444: Try to reproduce the test failure — committed to wgtmac/arrow by wgtmac 5 months ago
- GH-39444: refactor test case — committed to tolleybot/arrow by wgtmac 5 months ago
- GH-39444: [Python] Fix parquet import in encryption test — committed to jorisvandenbossche/arrow by jorisvandenbossche 4 months ago
- GH-39444: [C++][Parquet] Fix crash in Modular Encryption (#39623) **Rationale for this change:** This pull request addresses a critical issue (GH-39444) in the C++/Python components of Parquet, spec... — committed to jorisvandenbossche/arrow by tolleybot 4 months ago
- GH-39444: [Python] Fix parquet import in encryption test (#40505) Small follow-up on https://github.com/apache/arrow/pull/39623 fixing am import in the test. But first testing here if it actually fa... — committed to apache/arrow by jorisvandenbossche 4 months ago
I believe that the reason this issue is only seen with more than 2^15 rows is that this is the value of
kMaxBatchSize
used in Acero, and when writing the dataset, if there are more rows than this, the Parquet file is split into multiple row groups. As a workaround, I can get the reproduction code from the issue description to work without error if I useds.write_dataset(table, path, format=file_format, file_options=write_options, min_rows_per_group=row_count)
, so that Parquet files only ever have one row group.The reason this leads to crashes when multi-threading is enabled seems to be that when scanning the dataset,
RowGroupGenerator::read_one_grow_group
is called concurrently from different threads due to using aReadaheadGenerator
, leading to concurrent use of the sameAesDecryptor
instances, which are not thread safe. Just putting mutexes around use of theAesDecryptor
s isn’t sufficient to fix the problem due toDecryptor::UpdateAad
being used to update theAAD
value as data pages are read, which then affects the use of the decryptor from other threads. TheInternalFileDecryptor::Get*Decryptor
methods are also called concurrently but are not thread safe due to modifyingstd::map
s and theall_decryptors_
vector.@eirki mentioned seeing this issue without using the Dataset API, and I believe this might also happen when
FileReaderImpl::DecodeRowGroups
is used to decode multiple row groups concurrently when threading is enabled, but I haven’t tried reproducing that.Edit: Actually I don’t think this is due to
ReadaheadGenerator
but rather the readahead logic inRowGroupGenerator
. I still get errors when usingFragmentReadahead(1)
but the errors go away when usingBatchReadahead(0)
. This explains why settingBatchSize
to something small fixes the issue, as the default batch readahead is 16 so if 16 * BatchSize is less than the row group size then we won’t try to readahead into another row group. So a better workaround is probably to setbatch_readahead=0
inDataset.to_table
from Python.I have submitted a pull request that addresses this issue: #39623. This PR contains the proposed fix for the segmentation fault encountered in modular encryption as discussed here. @wgtmac
Yes, I can. 👌
I changed GetColumnDecryptor so that it always returns a new Decryptor and I no longer see the issue. I am not sure , without testing, if I am stomping on something else, or if this is efficient since the Decryptor was being cached.
Previous Method:
Updated Method
}
(also, as far as I understand, while it is definitely a serious bug, it’s already present in the previous major release as well, so it’s not a regression (or at least not a new regression), which I think makes it a little bit less blocking)
Hi @tolleybot , I am currently working on the 15.0.0 release. If someone is able to push a fix before I create a RC in the next days I am happy to cherry-pick the fix but I don’t think we should hold the release, otherwise would be difficult to release due to other bugs/issues. If the issue is solved after the release is done we can also add it to a possible patch release and or to the scheduled quarter release. Of course if there are major concerns and other community members think this should block the release I will act accordingly.
@tmct I recompile the C++ build using ARROW_ENABLE_THREADING = OFF and I have no issue.