candle: Explicit panic on Falcon

Hi,

I have most models to work with; thanks again for your guidance, but I also had one other issue with Falcon. Hopefully, someone knows about this.

codespace@codespaces-f226cf:/workspaces/rust-candle-demos/candle$ RUST_BACKTRACE=1 && cargo run --example falcon --release -- --pr
ompt "which 100m sprinter won the 1984 olympics"?
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
    Finished release [optimized] target(s) in 0.35s
     Running `target/release/examples/falcon --prompt 'which 100m sprinter won the 1984 olympics?'`
Running on CPU, to run on GPU, build this example with `--features cuda`
retrieved the files in 226.6µs
loaded the model in 8.0565019s
starting the inference loop
thread 'main' panicked at 'explicit panic', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-gemm-0.15.6/src/gemm.rs:143:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 19

Most upvoted comments

That’s pretty unexpected, it seems that the cuda kernel is_u32_bf16 is not found. Maybe it’s some caching issue somehow, could you try running make clean-ptx and giving it another go? My guess is that you already built the example with a lower compute cap and the build system didn’t have a way to detect that this has changed.

This was it! Thanks again for all of the help, Candle is pretty cool! I started with the p4, then switching the machine to a G5 which has a better cost profile, and better performance for developmant and is actually supported by Candle! I simply “switched machine type”, which I should have thought about, and could have created a subtle issue with the compilation.

Here is my workflow that verifies Falcon works a g5.16xlarge. Later I will open another ticket on a related by different topic of machine compatibility for development, probably later this week and put some of my notes in there.

ubuntu@ip-172-31-21-63:~/candle$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.6
ubuntu@ip-172-31-21-63:~/candle$ make clean-ptx
find target -name "*.ptx" -type f -delete
echo "" > candle-kernels/src/lib.rs
touch candle-kernels/build.rs
touch candle-examples/build.rs
touch candle-flash-attn/build.rs
ubuntu@ip-172-31-21-63:~/candle$ RUST_BACKTRACE=1 cargo run --features cuda --example falcon --release -- --prompt "What is the best ty
> ^C
ubuntu@ip-172-31-21-63:~/candle$ RUST_BACKTRACE=1 cargo run --features cuda --example falcon --release -- --prompt "What is the best type of apple to eat?"
   Compiling candle-examples v0.1.3 (/home/ubuntu/candle/candle-examples)
   Compiling candle-kernels v0.1.3 (/home/ubuntu/candle/candle-kernels)
   Compiling candle-core v0.1.3 (/home/ubuntu/candle/candle-core)
   Compiling candle-nn v0.1.3 (/home/ubuntu/candle/candle-nn)
   Compiling candle-transformers v0.1.3 (/home/ubuntu/candle/candle-transformers)
   Compiling candle-datasets v0.1.3 (/home/ubuntu/candle/candle-datasets)
    Finished release [optimized] target(s) in 41.51s
     Running `target/release/examples/falcon --prompt 'What is the best type of apple to eat?'`
retrieved the files in 22.997771ms
loaded the model in 158.647230995s
starting the inference loop
> 9.624751977s
1 token: 193 '
'
> 102.677149ms
2 token: 487 'The'
> 43.732627ms
3 token: 935 ' best'
> 43.633475ms
4 token: 19088 ' apples'
> 43.604565ms
5 token: 271 ' to'
> 43.646475ms
6 token: 3781 ' eat'
> 68.119076ms
7 token: 362 ' are'
> 43.602295ms
8 token: 248 ' the'
> 43.662716ms
9 token: 3040 ' ones'
> 43.624095ms
10 token: 325 ' that'
> 43.707497ms
11 token: 362 ' are'
> 43.652905ms
12 token: 272 ' in'
> 43.697466ms
13 token: 2120 ' season'
> 43.602755ms
14 token: 25 '.'
> 43.654146ms
15 token: 193 '
'
> 43.645306ms
16 token: 487 'The'
> 43.607284ms
17 token: 935 ' best'
> 43.780128ms
18 token: 19088 ' apples'
> 43.699476ms
19 token: 271 ' to'
> 43.709886ms
20 token: 3781 ' eat'
> 43.732406ms
21 token: 362 ' are'
> 43.623285ms
22 token: 248 ' the'
> 43.687266ms
23 token: 3040 ' ones'
> 43.840629ms
24 token: 325 ' that'
> 43.825839ms
25 token: 362 ' are'
> 43.765788ms
26 token: 272 ' in'
> 43.789708ms
27 token: 2120 ' season'
> 43.863559ms
28 token: 25 '.'
> 43.84757ms
29 token: 193 '
'
> 43.846399ms
30 token: 487 'The'
> 43.835978ms
31 token: 935 ' best'
> 43.859219ms
32 token: 19088 ' apples'
> 43.809038ms
33 token: 271 ' to'
> 43.816158ms
34 token: 3781 ' eat'
> 43.836879ms
35 token: 362 ' are'
> 43.832719ms
36 token: 248 ' the'
> 43.844879ms
37 token: 3040 ' ones'
> 43.840609ms
38 token: 325 ' that'
> 49.265138ms
39 token: 362 ' are'
> 43.779528ms
40 token: 272 ' in'
> 43.782628ms
41 token: 2120 ' season'
> 43.810958ms
42 token: 25 '.'
> 43.782698ms
43 token: 193 '
'
> 43.879599ms
44 token: 487 'The'
> 43.834599ms
45 token: 935 ' best'
> 43.86694ms
46 token: 19088 ' apples'
> 43.8874ms
47 token: 271 ' to'
> 43.857309ms
48 token: 3781 ' eat'
> 43.945341ms
49 token: 362 ' are'
> 43.811019ms
50 token: 248 ' the'
> 43.87959ms
51 token: 3040 ' ones'
> 43.888ms
52 token: 325 ' that'
> 43.87292ms
53 token: 362 ' are'
> 43.822219ms
54 token: 272 ' in'
> 43.88727ms
55 token: 2120 ' season'
> 43.837039ms
56 token: 25 '.'
> 43.85941ms
57 token: 193 '
'
> 43.802489ms
58 token: 487 'The'
> 43.789688ms
59 token: 935 ' best'
> 43.792658ms
60 token: 19088 ' apples'
> 43.791908ms
61 token: 271 ' to'
> 43.791158ms
62 token: 3781 ' eat'
> 43.834899ms
63 token: 362 ' are'
> 43.752368ms
64 token: 248 ' the'
> 43.832158ms
65 token: 3040 ' ones'
> 43.793968ms
66 token: 325 ' that'
> 43.787038ms
67 token: 362 ' are'
> 43.735537ms
68 token: 272 ' in'
> 43.817829ms
69 token: 2120 ' season'
> 43.86399ms
70 token: 25 '.'
> 43.990323ms
71 token: 193 '
'
> 43.911951ms
72 token: 487 'The'
> 43.88645ms
73 token: 935 ' best'
> 43.863369ms
74 token: 19088 ' apples'
> 43.905461ms
75 token: 271 ' to'
> 43.87203ms
76 token: 3781 ' eat'
> 43.89619ms
77 token: 362 ' are'
> 43.890361ms
78 token: 248 ' the'
> 43.988762ms
79 token: 3040 ' ones'
> 43.909571ms
80 token: 325 ' that'
> 43.994063ms
81 token: 362 ' are'
> 43.932841ms
82 token: 272 ' in'
> 43.949341ms
83 token: 2120 ' season'
> 43.952721ms
84 token: 25 '.'
> 43.960452ms
85 token: 193 '
'
> 43.89845ms
86 token: 487 'The'
> 43.92011ms
87 token: 935 ' best'
> 43.870799ms
88 token: 19088 ' apples'
> 43.935321ms
89 token: 271 ' to'
> 43.933781ms
90 token: 3781 ' eat'
> 43.972932ms
91 token: 362 ' are'
> 43.953642ms
92 token: 248 ' the'
> 43.950331ms
93 token: 3040 ' ones'
> 43.785347ms
94 token: 325 ' that'
> 43.804188ms
95 token: 362 ' are'
> 43.770187ms
96 token: 272 ' in'
> 43.90394ms
97 token: 2120 ' season'
> 43.89419ms
98 token: 25 '.'
> 43.913241ms
99 token: 193 '
'
> 43.911771ms
100 token: 487 'The'
100 tokens generated (7.115549321467834 token/s)
----

The best apples to eat are the ones that are in season.
The best apples to eat are the ones that are in season.
The best apples to eat are the ones that are in season.
The best apples to eat are the ones that are in season.
The best apples to eat are the ones that are in season.
The best apples to eat are the ones that are in season.
The best apples to eat are the ones that are in season.
The
----

You probably already tried it but using --features cuda might work well out of the box if cuda is installed at some default locations. E.g.

RUST_BACKTRACE=1 cargo run --example falcon --features cuda --release -- --prompt ...

Or you can try --use-f32 if you have lots of memory 😃

That’s because the model use bf16 for which there is no cpu implementation. I’ll make the error message more explicit, if you have a gpu maybe you want to use --features cuda?