wgpu: Dynamic dispatch layer breaks on web
Description
The web backend currently has an error like this when running the hello-triangle
example using cargo run-wasm --example=hello-triangle
in the browser console output:
panicked at 'assertion failed: wasm.is_instance_of::<T>()', wgpu/src/backend/web.rs:53:9
Stack:
getImports/imports.wbg.__wbg_new_abda76e883ba8a5f@http://localhost:8000/hello-triangle.js:340:21
console_error_panic_hook::Error::new::h09e78aee7fd991ae@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[4182]:0xf15ec
console_error_panic_hook::hook_impl::h81c8bf5494034336@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[717]:0x66f7d
console_error_panic_hook::hook::hc9dd7d1d176aa725@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[4719]:0xf8b1e
core::ops::function::Fn::call::hc3d05c66054772b6@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[3865]:0xec751
std::panicking::rust_panic_with_hook::hb42d415afcc11f2f@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[1721]:0xaeb05
std::panicking::begin_panic_handler::{{closure}}::h57dd84c078404aa2@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[2168]:0xc3482
std::sys_common::backtrace::__rust_end_short_backtrace::hef94dcc71d3b37f5@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[5052]:0xfcef3
rust_begin_unwind@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[3238]:0xe0f9f
core::panicking::panic_fmt::h32de9c76c9d5eb0c@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[4581]:0xf6f1d
core::panicking::panic::hff08c7cf90edd9fe@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[3862]:0xec692
<wgpu::backend::web::Identified<T> as core::convert::From<wgpu::context::ObjectId>>::from::h09e99484006351c1@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[1664]:0xaba02
<T as wgpu::context::DynContext>::adapter_request_device::he51a22a86469d596@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[668]:0x61f45
wgpu::Adapter::request_device::hdaaa471e221d3c58@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[1073]:0x856fc
hello_triangle::run::{{closure}}::h19b0046d2265909e@http://localhost:8000/hello-triangle_bg.wasm:wasm-function[255]:0xb1d7
... Rest of call stack trimmed for brevity
Now this is not an error that is caused by mismatching types. By inserting some debugging aid log statements we see the following console output before the panic:
(0) Upcast(id = 74): wgpu::backend::web::Identified<web_sys::features::gen_GpuCanvasContext::GpuCanvasContext> [hello-triangle.js:489:17](http://localhost:8000/hello-triangle.js)
(1) Upcast(id = 72): wgpu::backend::web::Identified<web_sys::features::gen_GpuAdapter::GpuAdapter> [hello-triangle.js:489:17](http://localhost:8000/hello-triangle.js)
(2) Downcasting(id = 72): wgpu::backend::web::Identified<web_sys::features::gen_GpuAdapter::GpuAdapter> [hello-triangle.js:489:17](http://localhost:8000/hello-triangle.js)
(3) Downcasting(id = 72): wgpu::backend::web::Identified<web_sys::features::gen_GpuAdapter::GpuAdapter> [hello-triangle.js:489:17](http://localhost:8000/hello-triangle.js)
panicked at 'assertion failed: wasm.is_instance_of::<T>()', wgpu/src/backend/web.rs:61:9
...
What we see here is that the failure occurs on the second downcast of the GpuAdapter. The reason this occurs is because of how the web backend currently converts an ObjectId
into whatever associated type is desired. In particular the web backend will use the ObjectId
and convert the raw number right into whatever wasm_bindgen type was desired (such as GpuAdapter). Currently the web backend uses the IntoWasmAbi
and FromWasmAbi
traits to do this. However there is a less known side effect of this. FromWasmAbi
and IntoWasmAbi
transfer ownership of the JS object. When dropped, it will be destroyed on the JS side. Further use of the object when downcast again will of course fail since the actual object on the JS side was destroyed. It just happens to be caught easily by the is_instance_of
check with strict asserts.
When running in release mode (to disable strict asserts) a different error shows up:
Uncaught (in promise) TypeError: getObject(...).requestDevice is not a function
__wbg_requestDevice_7dbb93ae9dc5edc1 http://localhost:8000/hello-triangle.js:500
__wbg_adapter_49 http://localhost:8000/hello-triangle.js:229
real http://localhost:8000/hello-triangle.js:202
promise callback*getImports/imports.wbg.__wbg_then_11f7a54d67b4bfad http://localhost:8000/hello-triangle.js:1128
__wbg_adapter_49 http://localhost:8000/hello-triangle.js:229
real http://localhost:8000/hello-triangle.js:202
promise callback*getImports/imports.wbg.__wbg_then_cedad20fbbd9418a http://localhost:8000/hello-triangle.js:1132
__wbg_adapter_49 http://localhost:8000/hello-triangle.js:229
real http://localhost:8000/hello-triangle.js:202
promise callback*getImports/imports.wbg.__wbg_then_11f7a54d67b4bfad http://localhost:8000/hello-triangle.js:1128
finalizeInit http://localhost:8000/hello-triangle.js:1240
init http://localhost:8000/hello-triangle.js:1272
async* http://localhost:8000/:17
EventListener.handleEvent* http://localhost:8000/:16
The solution to fix this would be to avoid taking ownership of the object when downcasting. This is easier said than done however. You need to make sure you destroy the object on drop to prevent memory leaks. The casting can be done, but we can only receive a ManuallyOwned<T>
. And of course trying to return a reference to an object defined in a function is not allowed. This means there are one of two solutions left:
- Replace ObjectId with a
Box<dyn Whatever>
This would be an easy solution, but will cause a definite increase in memory allocations. It might not be a huge concern on either web or native as the largest allocation size would essentially be a pointer or two. Profiling would make sense here, but of course best to avoid an allocation heavy pattern if possible in the first place.
- The web backend should store the underling
Gpu*
objects on theXXXXData
associated types.
This does mean the ObjectId on web will become effectively inert (unless the expose-ids) feature is enabled. This will also automatically deal with the concerns around memory leaks by downcasting.
This solution however calls into question whether the dynamic dispatch layer should even have separate Data
and Id
types (consolidating instead on a Data
type only. But the currently native implementation uses some Data
associated types and future out of tree backends could use both.
Other notes
I tested this on firefox nightly (webgpu does not work on Linux at all on Chrome dev, no canary). The same error does appear in chrome dev even though the webgpy implementation on chrome is broken.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 3
- Comments: 28 (13 by maintainers)
Commits related to this issue
- upgrade wgpu now that gfx-rs/wgpu#3430 is fixed — committed to compute-toys/wgpu-compute-toy by davidar a year ago
I wasn’t working on it as I thought @i509VCB was working on it - I can prioritize it as it seems to be blocking.
I have a PR that should fix this at #3657 - would be great if some larger projects could test against my branch to see if I missed anything. I found it really easy to make mistakes doing the downcasting (would be great to improve this somehow), so it’s possible there are some other paths that I didn’t hit while testing.
So there’s a way to work around this massive (UAF-shaped) unsoundness issue. But it’s not a good way to do it. Basically you just comment out the
dropObject
function body in the resulting.js
file:No more
free
, no more problems (okay you’re now leaking, which could be a problem for any transient objects).After about a minute or two, it doesn’t look too bad though (this starts out around 200 entries):
The risk here I suppose would be large objects being kept around, not necessarily many objects.
However, we can limit this to just WebGPU objects (i.e. the ones
wgpu::backend::web
is constantly freeing repeatedly due toObjectId
beingCopy
etc.):After triggering significantly more redraws than with the simpler method, I get:
And while Chrome itself using 1.6 GiB, the JS heap size is shown as 3 MiB, with less than 1 MiB leaked by that
heap
array (based on the contents, if I had to guess it’s mostlyGPURenderPassEncoder
).I’m also getting some lag-spikes but I think those are from XWayland (as I couldn’t get WebGPU working natively).
Speaking of, this works with stable Chrome (
Chrome/111.0.5563.64
):And just as well with
google-chrome-unstable
(Chrome/113.0.5638.0
- that’s the version I did most experimenting with, since I wasn’t sure if the issue was older Chrome, but nope, both work just as well).~I started working on this in https://github.com/grovesNL/wgpu/tree/dynamic-dispatch-workaround but it’s not clear how we could make this work without bigger changes to how
ObjectId
is used. I might be doing something wrong and misunderstood the approach you suggested though.~~For example,
Buffer
is defined as~~A lot of functions just work with e.g.,
buffer.id
, so we’d ideally have a way to get back toContext::BufferData
from theObjectId
. I don’t see a great way we could do that without going through the wasm ABI conversions again. In my branch you can remove theimpl<T: FromWasmAbi<Abi = u32> + JsCast> From<ObjectId> for Sendable<T>
to see most of the places I’m not sure about.~We also have some places accepting iterators likefn queue_submit<I: Iterator<Item = Self::CommandBufferId>>
where we’d like to accessSelf::CommandBufferData
instead. I’m not sure how we can deal with those either.Edit: I think I’m past the problems with types like buffer now - still not sure about iteratorsI’d like to change signatures likefn queue_submit<I: Iterator<Item = Self::CommandBufferId>>
to include data with something likefn queue_submit<I: Iterator<Item = (Self::CommandBufferId, Self::CommandBufferData)>>
but not sure how to make this work because of the type erasure.Edit2: Figured out how to get past the iterator problem I was running into. Still working on filling in some other places now, but I have hello-triangle working again
I’ll keep updating https://github.com/grovesNL/wgpu/tree/dynamic-dispatch-workaround as I work through it and open a PR once it’s in slightly better shape.
How is the progress in solving this, as it seems to completely break wgpu 0.15 on the web for me and others?
Is there any suggested fix or workaround, or what is the plan for solving this issue?
I think the issue needs to be reopened or am I wrong?
I will try to reduce the code to get a better understanding about the problem. I thought the same about the offscreen canvas part, but the integration test with wasm-pack is a pure canvas, because we can’t test in worker for now.
https://github.com/codeart1st/wgpu-layers/blob/main/tests/utils.rs#L33
More of the stack:
If i don’t rely on
get_capabilities
i get the same error for callingconfigure
on my swapchain surface.I can confirm that this works with maplibre-rs
@grovesNL I just tested https://compute.toys/ against your branch and can confirm it resolves this issue (and didn’t notice any breakages)
So the Context trait defines a bunch of assoicated types. My idea was to no longer downcast from ObjectId (making it just a global id) and store the GPUTexture for example inside the TextureData assoicated type. The same with every other type that holds a web_sys::GPU type