web-llm: Chat demo does not work on Android because of maxStorageBufferBindingSize

On my Pixel 7 Android device, the maxStorageBufferBindingSize limit is only 128. Sadly https://webllm.mlc.ai/#chat-demo requires 1024 for the Llama-2-7b-chat-hf-q4f16_1 model as seen below when run on Chrome for Android.

As it is possible with an Android app (https://llm.mlc.ai/#android), would it be possible to lower this limit so that more Android devices can access and play with https://webllm.mlc.ai/#chat-demo? Note that WebGPU support is coming to Android soon. See https://groups.google.com/a/chromium.org/g/blink-dev/c/YFWuDlCKTP4/m/97C4LCBUBgAJ

About this issue

Original URL
State: open
Created 8 months ago
Comments: 57 (56 by maintainers)

Commits related to this issue

Add 128MB buffer size friendly models (#213) This PR adds three models with 1k context length that can be run by devices with `maxStorageBufferBindingSize = 128MB`: - `Llama-2-7b-chat-hf-q4f16_1-1k... — committed to mlc-ai/web-llm by CharlieFRuan 8 months ago
[SimpleChat][ChatModule] Disable most models on Android phone (#256) This PR is motivated by https://github.com/mlc-ai/web-llm/issues/209, where limited devices like Android phones crash when we try... — committed to mlc-ai/web-llm by CharlieFRuan 6 months ago

Most upvoted comments

WebGPU is available without a flag in Chrome Canary for Android.

Thanks in advance! Besides, I might have limited availability until next week because of Thanksgiving.

Enjoy 🦃

beaufortfrancois on Nov 23, 2023

@toji may have insights. From what I understand there’s not much you can do from the web app side.

beaufortfrancois on Jan 17, 2024

@beaufortfrancois https://github.com/mlc-ai/web-llm/pull/256 is my attempt to address this issue. We tried various ways of catching the crash, including https://github.com/apache/tvm/pull/16330 and using device.lost.then. However, they don’t seem to be reliable and Chrome would keep asking for memory until it crashes (with the “Aw Snap” crash page).

Thank you @CharlieFRuan, I’ll have a look at the PR. @toji who added Android support to WebGPU in Chrome may be interested as well in your findings.

Hmm btw the latest Chrome Canary 122.0.6237.0 released yesterday seems to break webgpu on my Android, failing requestAdapter on webgpureport.org.

I cannot reproduce in https://webgpureport.org with Chrome Canary 122.0.6237.0 on my Pixel 7 device (Android 14). @toji Is this expected?

beaufortfrancois on Jan 10, 2024

In the meantime, it is possible to react to “out-of-memory” GPU errors. See https://gpuweb.github.io/gpuweb/#error-scopes

device.pushErrorScope('out-of-memory');
// ...
device.popErrorScope().then((error) => {
  console.error(error.message);
});

beaufortfrancois on Nov 13, 2023

It is possible that the model crashes in Llama, we can hit VRAM limit in Llama2 models(that goes beyond 4GB) when using 4 bit quantization, and in iOS we had to rely on 3bit quantization to stay in memory budget. On android we can sometimes get Llama2 4bit model working but that also requires to go beyond 4GB VRAM limit and the depending on the system limit and phone’s capability it can crash.

🥳 I was finally able to get logs on my Pixel 7 Android device!

[FATAL] /ssd1/cfruan/packages/tvm-unity/src/runtime/memory/memory_manager.cc:108:
InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: 
storage allocation failure, attempted to allocate 8396800 at offset 0 in region that is 8388608bytes
@ worker.360c8a9c.js:669

Uncaught (in promise) DOMException: Failed to execute 'mapAsync' on 'GPUBuffer': Device is lost

Hopefully this will help.

So likely 3B models are good choice to demo LLM on android/webgpu, or if later we have 2bit model variants. Glad that RedPajama-INCITE-Chat-3B-v1-q4f32_1-1k works, perhaps you can also try RedPajama-INCITE-Chat-3B-v1-q4f16_1-1k to test f16 support

Glad that 3B model works, this is the first running example of webgpu native LLM on mobile phone AFAIK, thank you @beaufortfrancois for pushing this. Love to share this with broader community and let us know how we can help

I’m also really happy 😄 to see WebLLM running on Android without any flag 😉

beaufortfrancois on Nov 13, 2023

that model depends on shader-f16 feature, which only existed in chrom canary (and not chrome stable AFAIK), so not sure if it works on android. If it is possible to get the console output, it will also help us see what is going on

shader-f16 is shipping in Chrome 120. See https://groups.google.com/a/chromium.org/g/blink-dev/c/AsKn-UwMYAE/m/4FKB-x_QAQAJ

I’m unable to get logs as Chrome is now crashing with this model. Sorry 😭 I’ve tried several times.

Do you mind trying out RedPajama-INCITE-Chat-3B-v1-q32-1k?

It works great with RedPajama-INCITE-Chat-3B-v1-q4f32_1-1k.

beaufortfrancois on Nov 13, 2023