velox: Inconsistencies in FileSystem initialization

Bug description

Expected Behaviour

File system is initialised when the fist invocation to filesystem is made. In order to correctly initialise the file system connector config needs to be passed.

Not all classes have access to the connector config, but they need to still need to invoke the filesystem APIs. As a result in a number of places nullptr is being passed as connector config. See: https://github.com/search?q=repo%3Afacebookincubator%2Fvelox+%3A%3AgetFileSystem&type=code

The side-effect of this being, if this happens to be first invocation of filesystem, the filesystem instance gets initialised incorrectly.

An option is to first retrieve the connector config using the API, velox::connector::getConnector("<connector-id>")->connectorProperties()), but leads to coupling of connector-id and file system mapping in client code.

Example of this workaround looks like:

auto connectors = facebook::velox::connector::getAllConnectors();
  // final connector called prism
  for (auto c : connectors) {
    if (c.first == "<connector-id>") {
      velox::filesystems::getFileSystem(
          "<URI-to-some-resource-in-filesystem>",
          velox::connector::getConnector("<connector-id>")->connectorProperties());
    }
  }

System information

N/A

Relevant logs

No response

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

@mbasmanova that is correct. HDFS already does this today based on the hdfs identity. Each file system must generate a unique identity and a filesystem based on the config. https://github.com/facebookincubator/velox/blob/5e07a684c33cc55631f6681a85f21421b8ab4917/velox/connectors/hive/storage_adapters/hdfs/RegisterHdfsFileSystem.cpp#L33-L45

Thanks a lot for the input @mbasmanova @majetideepak . I will close the task, and followup with a change to the internal client.

@vermapratyush @majetideepak Deepak, Pratyush, I took a close look and it looks like the bug is in the implementation of WarmStorageFileSystem. That implementation creates FileSystem on first call and the returns it for all subsequence calls. This is incorrect since the object creation depends on configuration properties. A proper fix could be to cache multiple objects: one per each unique set of config values.

std::function<std::shared_ptr<
    filesystems::FileSystem>(std::shared_ptr<const Config>, std::string_view)>
WarmStorageFileSystem::fileSystemGenerator(
    const std::string& oncall,
    const std::string& user,
    const std::string& svc) {
  return [oncall, user, svc](
             std::shared_ptr<const Config> properties,
             std::string_view /*filePath*/) {
    // One instance of WS FileSystem is sufficient.
    // Initialize on first access and reuse after that.
    static std::shared_ptr<filesystems::FileSystem> wsfs;
    folly::call_once(fsInstantiationFlag, [&properties, oncall, user, svc]() {
      wsfs = std::make_shared<WarmStorageFileSystem>(
          properties, oncall, user, svc);
    });
    return wsfs;
  };
}

Specifying the config during registration will only help those filesystems that have a fixed end-point and are expected to have only a single instance. It will also add a dependency between FS registration and catalog initialization. It will be cleaner and less confusing to avoid this special handling given users still have to use velox::filesystems::getFileSystem( std::string_view filename, std::shared_ptr<const Config> config) API. The caching of the filesystem is only an optimization and I feel it should not impact the API. For LocalFileSystem uses in SSD Cache and spilling, we can add getLocalFileSystem(std::string_view filename) API as the local file system does not require a config. In other places such as tableScan and tableWrite, we have to plumb the config and validate the config for each getFileSystem() invocation and ensure that the cached file system is valid. The simplest validation which is to ensure the config is not a nullptr will catch most of the bugs.

I fixed a similar issue recently for FileSink here https://github.com/facebookincubator/velox/pull/6325. The issue was only discovered when I moved a test to be standalone. As you noticed, the global static initialization of a file system can hide issues. I don’t think the client solution proposed is feasible since the <URI-to-some-resource-in-filesystem> might not be available always. The right solution is to plumb the config and fix the Velox velox::filesystems::getFileSystem API. How about we add a check for config for each call to velox::filesystems::getFileSystem()? The check must ensure the config passed matches the instantiated FS. For example, in S3, we could ensure the Hive S3 configuration matches always.