datafusion: DataFusion does not support wasm32-unknown-unknown target

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11615

The Arrow crate successfully compiles to WebAssembly (e.g. https://github.com/domoritz/arrow-wasm) but the DataFusion crate currently does not support thewasm32-unknown-unknown target.

Try out the repository at https://github.com/domoritz/datafusion-wasm/tree/73105fd1b2e3ca6c32ec4652c271fb741bda419a.

{code} error[E0433]: failed to resolve: could not find unix in os –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18 | 41 | use std::os::unix::ffi::OsStringExt; | ^^^^ could not find unix in os

error[E0432]: unresolved import unix –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5 | 6 | use unix; | ^^^^ no unix in the root

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:98:9 | 98 | sys::duplicate(self) | ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:101:9 | 101 | sys::allocated_size(self) | ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:104:9 | 104 | sys::allocate(self, len) | ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:107:9 | 107 | sys::lock_shared(self) | ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:110:9 | 110 | sys::lock_exclusive(self) | ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:113:9 | 113 | sys::try_lock_shared(self) | ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:116:9 | 116 | sys::try_lock_exclusive(self) | ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:119:9 | 119 | sys::unlock(self) | ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:126:5 | 126 | sys::lock_error() | ^^^ use of undeclared crate or module sys

error[E0433]: failed to resolve: use of undeclared crate or module sys –> /Users/dominik/.cargo/registry/src/github.com-1ecc6299db9ec823/fs2-0.4.3/src/lib.rs:169:5 | 169 | sys::statvfs(path.as_ref()) | ^^^ use of undeclared crate or module sys

Compiling num-rational v0.3.2 error: aborting due to 10 previous errors {code}

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 15 (8 by maintainers)

Most upvoted comments

Good news, fellow WebAssembly enthusiasts! It looks like the stars are finally aligning, and with relatively minimal patching, I successfully compiled the code from the gist (create, insert and query a MemTable) to wasm32-wasi and wasm32-unknown-unknown, and ran it in wasmedge and the browser (via wasmpack):

❯ docker run --rm -it -v $(pwd)/target/wasm32-wasi/debug:/app wasmedge/slim:0.11.2-rc.1 wasmedge --reactor dfwasm.wasm _start
+---+----+
| a | b  |
+---+----+
| b | 10 |
| c | 10 |
+---+----+
0

image

I pushed the proof-of-concept to a public repository at splitgraph/experimental-datafusion-webassembly. There are two branches:

  • wasm32-wasi
    • This is the target I got working first. The readme on this branch contains all the details and you should be able to reproduce it yourself.
  • wasm32-unknown-unknown
    • This is branched from wasm32-wasi and the diff of wasm32-wasi..wasm32-unknown-unknown shows the changes
    • The top of the readme includes instructions for running this in the browser, but the patch is still very messy and might not be easily reproducible. Make sure you check Cargo.toml for any patched crates that you need to have checked out at a local path.

In the near future, I intend to cleanup these changes and submit a PR to DataFusion feature-flagging WebAssembly support.

In general, the summary of requirements for wasm-wasi:

for wasm32-unknown-unknown, in addition to all those requirements, it was also necessary to:

  • Replace usage of std::time with Instant, in both datafusion and arrow
  • Make sure every library that calls getrandom is also passing it the js feature flag, which I did by just patching getrandom and making that the default

To get it to run (without a runtime error related to std::time being unreachable), a few more changes were made:

  • Don’t run the demo code in a Tokio main runtime, even with flavor = current-thread. Instead, use wasm-bindgen-futures to await a future that performs the asynchronous task that calls datafusion

This is all very messy. I will clean it up and submit a PR to DataFusion once I have a better sense of the most minimal changes required and the proper way to feature flag them. Also, general disclaimer that I’m new to Rust and YMMV, especially on the wasm-unknown-unknown patch - after all, I barely got it to run. But it does compile and create and query a small in-memory table, which is pretty good!

Hello, folks.

I’m trying to add WASM support to DataFusion’s dependencies. Started with bzip2-rs https://github.com/alexcrichton/bzip2-rs/pull/93

This sounds very cool @milesrichardson - DataFusion should be upgraded to arrow 26.0.0 shortly: https://github.com/apache/arrow-datafusion/pull/4039. I think @Jimexist is in the process of making bzip support optional https://github.com/apache/arrow-datafusion/pull/3993

In terms of being messy / submitting a PR – if it is possible I suggest trying to do it incrementally – like for example we can probably sort out the calls to spawn_blocking in a separate PR

But all in all this is pretty exciting

Thanks @alamb . I will do some experiments but seems like a good solution.

@REASY In my experiment (the one linked above), I put bzip behind a configuration flag and disabled it for the wasm targets. Datafusion still compiled. I don’t know enough about DF to say how important bzip is, or which parts of DF would be broken without it, however. It seemed limited in scope, since it should only affect files that are encoded with bzip.

@seddonm1 compile and run?

I have experimented with that yesterday. I tried wasm32-wasi first and a simple sample works in single threaded mode after disabling some parquet features. See this gist for the example: https://gist.github.com/roee88/91f2b67c3e180fa0dfb688ba8d923dae

For wasm32-unknown-unknown adding getrandom with js as a dependency of the sample makes it compile IIRC, but actually running it is a different story. I tried to get a sample working with wasm-pack and it stops execution on the datafusion context creation, I suspect that it uses some sync primitives that are unsupported in wasm32-unknown-unknown but I didn’t investigate further.

I didn’t try wasm32-unknown-emscripten yet since my local rust version is incompatible with my installed emcc version (both latest at the time of this writing).

Edit: re tokio, the sample above worked on wasm32-wasi with other executors in single threaded mode including futures 0.3, https://github.com/richardanaya/executor, and async-global-executor. As long as you don’t hit code paths that use things like tokio::spawn (used in hash aggregate) it might be fine to use another executor. I’m not sure what’s the best approach for library code that needs to spawn tasks. I have seen opinions for 1) a library should never spawn, 2) futures should be universally supported, 3) a library should accept an executor trait (as implemented in https://github.com/najamelan/async_executors). I didn’t check the state of futures and WebAssembly recently. I didn’t try wasmbindgen-futures because it’s officially no longer compatible with wasi and emscripten and as I said I couldn’t get anything running with wasm32-unknown-unknown.

dirs-rs

Note that https://github.com/apache/arrow-rs/pull/656 from @PsiACE has removed the pretty-table dependency in arrow-rs upstream. This will be included in the 6.0 arrow release (in 2ish months); I am not sure if/how this affects your decision

lz4:

I think lz4 is an optional dependency of parquet: https://github.com/apache/arrow-rs/blob/master/parquet/Cargo.toml#L40 thus perhaps we could just have a lz4 feature flag for datafusion?

Polars proof of concept (shows that arrow-rs and datafusion like API can work): https://github.com/ritchie46/polars/blob/master/js-polars/app.js