velox: Inconsistencies in FileSystem initialization
Bug description
Expected Behaviour
File system is initialised when the fist invocation to filesystem is made. In order to correctly initialise the file system connector config needs to be passed.
Not all classes have access to the connector config, but they need to still need to invoke the filesystem APIs. As a result in a number of places nullptr is being passed as connector config. See: https://github.com/search?q=repo%3Afacebookincubator%2Fvelox+%3A%3AgetFileSystem&type=code
The side-effect of this being, if this happens to be first invocation of filesystem, the filesystem instance gets initialised incorrectly.
An option is to first retrieve the connector config using the API, velox::connector::getConnector("<connector-id>")->connectorProperties())
, but leads to coupling of connector-id and file system mapping in client code.
Example of this workaround looks like:
auto connectors = facebook::velox::connector::getAllConnectors();
// final connector called prism
for (auto c : connectors) {
if (c.first == "<connector-id>") {
velox::filesystems::getFileSystem(
"<URI-to-some-resource-in-filesystem>",
velox::connector::getConnector("<connector-id>")->connectorProperties());
}
}
System information
N/A
Relevant logs
No response
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 16 (8 by maintainers)
@mbasmanova that is correct. HDFS already does this today based on the hdfs identity. Each file system must generate a unique identity and a filesystem based on the config. https://github.com/facebookincubator/velox/blob/5e07a684c33cc55631f6681a85f21421b8ab4917/velox/connectors/hive/storage_adapters/hdfs/RegisterHdfsFileSystem.cpp#L33-L45
It defaults to
20'000
and is configurable vianum_cached_file_handles
https://github.com/facebookincubator/velox/blob/5e07a684c33cc55631f6681a85f21421b8ab4917/velox/connectors/hive/HiveConnector.cpp#L51-L54Thanks a lot for the input @mbasmanova @majetideepak . I will close the task, and followup with a change to the internal client.
@vermapratyush @majetideepak Deepak, Pratyush, I took a close look and it looks like the bug is in the implementation of WarmStorageFileSystem. That implementation creates FileSystem on first call and the returns it for all subsequence calls. This is incorrect since the object creation depends on configuration properties. A proper fix could be to cache multiple objects: one per each unique set of config values.
Specifying the config during registration will only help those filesystems that have a fixed end-point and are expected to have only a single instance. It will also add a dependency between FS registration and catalog initialization. It will be cleaner and less confusing to avoid this special handling given users still have to use
velox::filesystems::getFileSystem( std::string_view filename, std::shared_ptr<const Config> config)
API. The caching of the filesystem is only an optimization and I feel it should not impact the API. For LocalFileSystem uses in SSD Cache and spilling, we can addgetLocalFileSystem(std::string_view filename)
API as the local file system does not require a config. In other places such astableScan
andtableWrite
, we have to plumb the config and validate the config for each getFileSystem() invocation and ensure that the cached file system is valid. The simplest validation which is to ensure the config is not a nullptr will catch most of the bugs.I fixed a similar issue recently for FileSink here https://github.com/facebookincubator/velox/pull/6325. The issue was only discovered when I moved a test to be standalone. As you noticed, the global static initialization of a file system can hide issues. I don’t think the client solution proposed is feasible since the
<URI-to-some-resource-in-filesystem>
might not be available always. The right solution is to plumb the config and fix the Veloxvelox::filesystems::getFileSystem
API. How about we add a check for config for each call tovelox::filesystems::getFileSystem()
? The check must ensure the config passed matches the instantiated FS. For example, in S3, we could ensure the Hive S3 configuration matches always.