arrow: [C++][Parquet] Segmentation fault reading modular encrypted Parquet dataset over 2^15 rows

Describe the bug, including details regarding any error messages, version, and platform.

Version: pyarrow==14.0.2 Platform: Linux 5.15.x x86_64 GNU/Linux

I have been trying out the new capability introduced in Arrow 14 which allows modular encryption of Parquet files to be used alongside the generic Dataset API, but began to experience segmentation faults in C++ Arrow code when working with real, partitioned datasets rather than toy examples.

The error is most often segmentation fault but I have seen all of these:

Segmentation fault
OSError: Failed decryption finalization
OSError: Couldn't set AAD

After some trial and error generating larger and larger toy examples until the error was reliably hit, I discovered that the threshold was when the data which is initially written to the dataset on disk with modular encryption was over 2^15. In testing, 2^15 rows never faulted, but 2^15 + 1 often faults (occasionally it does not).

The backtrace when triggering the fault or error always ends in the same function:

#9  0x00007f9b726625f2 in parquet::encryption::AesDecryptor::AesDecryptorImpl::GcmDecrypt(unsigned char const*, int, unsigned char const*, int, unsigned char const*, int, unsigned char*) [clone .cold] () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
#10 0x00007f9b7270923e in parquet::Decryptor::Decrypt(unsigned char const*, int, unsigned char*) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
#11 0x00007f9b726f569b in void parquet::ThriftDeserializer::DeserializeMessage<parquet::format::ColumnMetaData>(unsigned char const*, unsigned int*, parquet::format::ColumnMetaData*, parquet::Decryptor*) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
#12 0x00007f9b726f8f34 in parquet::ColumnChunkMetaData::ColumnChunkMetaDataImpl::ColumnChunkMetaDataImpl(parquet::format::ColumnChunk const*, parquet::ColumnDescriptor const*, short, short, parquet::ReaderProperties const&, parquet::ApplicationVersion const*, std::shared_ptr<parquet::InternalFileDecryptor>) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400

Corresponding to this source: https://github.com/apache/arrow/blob/main/cpp/src/parquet/encryption/encryption_internal.cc#L453

I found this previous fix in this area, the issue for which details the exact same symptoms: https://github.com/apache/arrow/commit/88bccab18a4ab818355209e45862cc52f9cf4a0d

However, AesDecryptor::AesDecryptorImpl::GcmDecrypt() and AesDecryptor::AesDecryptorImpl::CtrDecrypt() use ctx_ member of type EVP_CIPHER_CTX from OpenSSL, which shouldn't be used from multiple threads concurrently.

So I was suspicious that it could be a multi-threading issue during read or write.

Whilst attempting to narrow down the cause (and whether the root cause occurs during writing or reading), I made the following observations:

Writing 2^15 + 1 rows, deleting half the dataset and then reading in the full dataset still encounters the error
Writing 2^15 + 1 rows, and filtering to half the partitions in the dataset when reading still encounters the error
Writing 2^15 + 1 rows, and decrypting with modular encryption each individual parquet file in the dataset and concatenating them never failed, only doing so as a full dataset
Writing or reading or both with use_threads=False still encounters the error

The fact that the issue still occurs when the dataset on disk is halved and then read again suggested corruption during write, but the fact that all individual Parquet files are still independently readable suggests an issue with modular decryption during Dataset operations. Issue still occurring with threading disabled was unexpected.

The error is reproducible using any random data, and using a KMS client which actually does no encryption at all (eliminating our custom KMS client as a probable cause), but simply passes the keys used around in (encoded) plaintext. It occurs whether the footer is plaintext or not, or whether envelope encryption is used or not.

Reproduction in Python here (no partitions needed at all, so it produces a single Parquet file, which can be read normally):

import base64
import numpy as np
import tempfile

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.parquet.encryption as pqe

class NoOpKmsClient(pqe.KmsClient):
    def __init__(self):
        super().__init__()

    def wrap_key(self, key_bytes: bytes, _: str) -> bytes:
        b = base64.b64encode(key_bytes)
        return b

    def unwrap_key(self, wrapped_key: bytes, _: str) -> bytes:
        b = base64.b64decode(wrapped_key)
        return b

row_count = pow(2, 15) + 1
table = pa.Table.from_arrays([pa.array(np.random.rand(row_count), type=pa.float32())], names=["foo"])

kms_config = pqe.KmsConnectionConfig()
crypto_factory = pqe.CryptoFactory(lambda _: NoOpKmsClient())
encryption_config = pqe.EncryptionConfiguration(
    footer_key="UNIMPORTANT_KEY",
    column_keys={"UNIMPORTANT_KEY": ["foo"]},
    double_wrapping=True,
    plaintext_footer=False,
    data_key_length_bits=128,
)
pqe_config = ds.ParquetEncryptionConfig(crypto_factory, kms_config, encryption_config)
pqd_config = ds.ParquetDecryptionConfig(crypto_factory, kms_config, pqe.DecryptionConfiguration())
scan_options = ds.ParquetFragmentScanOptions(decryption_config=pqd_config)
file_format = ds.ParquetFileFormat(default_fragment_scan_options=scan_options)
write_options = file_format.make_write_options(encryption_config=pqe_config)
file_decryption_properties = crypto_factory.file_decryption_properties(kms_config)

with tempfile.TemporaryDirectory() as tempdir:
    path = tempdir + "/test-dataset"
    ds.write_dataset(table, path, format=file_format, file_options=write_options)

    file_path = path + "/part-0.parquet"
    new_table = pq.ParquetFile(file_path, decryption_properties=file_decryption_properties).read()
    assert table == new_table

    dataset = ds.dataset(path, format=file_format)
    new_table = dataset.to_table()
    assert table == new_table

Any help here would be much appreciated: being restricted to 2^15 rows is a roadblock for us for the use of this feature.

Component(s)

Parquet, Python

About this issue

Original URL
State: closed
Created 6 months ago
Comments: 21 (16 by maintainers)

Commits related to this issue

GH-39444: refactor test case — committed to wgtmac/arrow by wgtmac 5 months ago
[DO NOT MERGE] GH-39444: Try to reproduce the test failure — committed to wgtmac/arrow by wgtmac 5 months ago
GH-39444: refactor test case — committed to tolleybot/arrow by wgtmac 5 months ago
GH-39444: [Python] Fix parquet import in encryption test — committed to jorisvandenbossche/arrow by jorisvandenbossche 4 months ago
GH-39444: [C++][Parquet] Fix crash in Modular Encryption (#39623) **Rationale for this change:** This pull request addresses a critical issue (GH-39444) in the C++/Python components of Parquet, spec... — committed to jorisvandenbossche/arrow by tolleybot 4 months ago
GH-39444: [Python] Fix parquet import in encryption test (#40505) Small follow-up on https://github.com/apache/arrow/pull/39623 fixing am import in the test. But first testing here if it actually fa... — committed to apache/arrow by jorisvandenbossche 4 months ago

Most upvoted comments

I believe that the reason this issue is only seen with more than 2^15 rows is that this is the value of kMaxBatchSize used in Acero, and when writing the dataset, if there are more rows than this, the Parquet file is split into multiple row groups. As a workaround, I can get the reproduction code from the issue description to work without error if I use ds.write_dataset(table, path, format=file_format, file_options=write_options, min_rows_per_group=row_count), so that Parquet files only ever have one row group.

The reason this leads to crashes when multi-threading is enabled seems to be that when scanning the dataset, RowGroupGenerator::read_one_grow_group is called concurrently from different threads due to using a ReadaheadGenerator, leading to concurrent use of the same AesDecryptor instances, which are not thread safe. Just putting mutexes around use of the AesDecryptors isn’t sufficient to fix the problem due to Decryptor::UpdateAad being used to update the AAD value as data pages are read, which then affects the use of the decryptor from other threads. The InternalFileDecryptor::Get*Decryptor methods are also called concurrently but are not thread safe due to modifying std::maps and the all_decryptors_ vector.

@eirki mentioned seeing this issue without using the Dataset API, and I believe this might also happen when FileReaderImpl::DecodeRowGroups is used to decode multiple row groups concurrently when threading is enabled, but I haven’t tried reproducing that.

Edit: Actually I don’t think this is due to ReadaheadGenerator but rather the readahead logic in RowGroupGenerator. I still get errors when using FragmentReadahead(1) but the errors go away when using BatchReadahead(0). This explains why setting BatchSize to something small fixes the issue, as the default batch readahead is 16 so if 16 * BatchSize is less than the row group size then we won’t try to readahead into another row group. So a better workaround is probably to set batch_readahead=0 in Dataset.to_table from Python.

adamreeve on Jan 15, 2024

I have submitted a pull request that addresses this issue: #39623. This PR contains the proposed fix for the segmentation fault encountered in modular encryption as discussed here. @wgtmac

tolleybot on Jan 15, 2024

Yes, I can. 👌

tolleybot on Jan 13, 2024

I changed GetColumnDecryptor so that it always returns a new Decryptor and I no longer see the issue. I am not sure , without testing, if I am stomping on something else, or if this is efficient since the Decryptor was being cached.

Previous Method:

std::shared_ptr<Decryptor> InternalFileDecryptor::GetColumnDecryptor(
    const std::string& column_path, const std::string& column_key_metadata,
    const std::string& aad, bool metadata) {
  std::string column_key;
  // first look if we already got the decryptor from before
  if (metadata) {
    if (column_metadata_map_.find(column_path) != column_metadata_map_.end()) {
      auto res(column_metadata_map_.at(column_path));
      res->UpdateAad(aad);
      return res;
    }
  } else {
    if (column_data_map_.find(column_path) != column_data_map_.end()) {
      auto res(column_data_map_.at(column_path));
      res->UpdateAad(aad);
      return res;
    }
  }

  column_key = properties_->column_key(column_path);
  // No explicit column key given via API. Retrieve via key metadata.
  if (column_key.empty() && !column_key_metadata.empty() &&
      properties_->key_retriever() != nullptr) {
    try {
      column_key = properties_->key_retriever()->GetKey(column_key_metadata);
    } catch (KeyAccessDeniedException& e) {
      std::stringstream ss;
      ss << "HiddenColumnException, path=" + column_path + " " << e.what() << "\n";
      throw HiddenColumnException(ss.str());
    }
  }
  if (column_key.empty()) {
    throw HiddenColumnException("HiddenColumnException, path=" + column_path);
  }

  // Create both data and metadata decryptors to avoid redundant retrieval of key
  // using the key_retriever.
  int key_len = static_cast<int>(column_key.size());
  auto aes_metadata_decryptor = encryption::AesDecryptor::Make(
      algorithm_, key_len, /*metadata=*/true, &all_decryptors_);
  auto aes_data_decryptor = encryption::AesDecryptor::Make(
      algorithm_, key_len, /*metadata=*/false, &all_decryptors_);

  column_metadata_map_[column_path] = std::make_shared<Decryptor>(
      aes_metadata_decryptor, column_key, file_aad_, aad, pool_);
  column_data_map_[column_path] =
      std::make_shared<Decryptor>(aes_data_decryptor, column_key, file_aad_, aad, pool_);

  if (metadata) return column_metadata_map_[column_path];
  return column_data_map_[column_path];
}

Updated Method

std::shared_ptr<Decryptor> InternalFileDecryptor::GetColumnDecryptor(
    const std::string& column_path, const std::string& column_key_metadata,
    const std::string& aad, bool metadata) {
  std::string column_key = properties_->column_key(column_path);

  column_key = properties_->column_key(column_path);
  // No explicit column key given via API. Retrieve via key metadata.
  if (column_key.empty() && !column_key_metadata.empty() &&
      properties_->key_retriever() != nullptr) {
    try {
      column_key = properties_->key_retriever()->GetKey(column_key_metadata);
    } catch (KeyAccessDeniedException& e) {
      std::stringstream ss;
      ss << "HiddenColumnException, path=" + column_path + " " << e.what() << "\n";
      throw HiddenColumnException(ss.str());
    }
  }
  if (column_key.empty()) {
    throw HiddenColumnException("HiddenColumnException, path=" + column_path);
  }

  int key_len = static_cast<int>(column_key.size());
  auto aes_decryptor =
      encryption::AesDecryptor::Make(algorithm_, key_len, metadata, &all_decryptors_);

  return std::make_shared<Decryptor>(aes_decryptor, column_key, file_aad_, aad, pool_);

}

tolleybot on Jan 12, 2024

(also, as far as I understand, while it is definitely a serious bug, it’s already present in the previous major release as well, so it’s not a regression (or at least not a new regression), which I think makes it a little bit less blocking)

jorisvandenbossche on Jan 10, 2024

Hi @tolleybot , I am currently working on the 15.0.0 release. If someone is able to push a fix before I create a RC in the next days I am happy to cherry-pick the fix but I don’t think we should hold the release, otherwise would be difficult to release due to other bugs/issues. If the issue is solved after the release is done we can also add it to a possible patch release and or to the scheduled quarter release. Of course if there are major concerns and other community members think this should block the release I will act accordingly.

raulcd on Jan 10, 2024

@tmct I recompile the C++ build using ARROW_ENABLE_THREADING = OFF and I have no issue.

tolleybot on Jan 10, 2024