exllama: Performance degradation
I did a test on the latest commit (77545c) and bec6c9 on h100 with 30b model and I can see stable performance degradation.
Latest bec6c9
25 t/s 34t/s
thoughts?
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 20 (6 by maintainers)
Typo is fixed. Thanks. But attention probably isn’t the issue anyway. I guess I’ll have to add a profiling mode to time the CUDA kernel launches, since the performance profiles are so different across architectures. I really have no idea right now why it’s slower on Hopper than on Ada, or why the recent version is slower than the older one.
I didn’t actually change much in the CUDA code, I just moved more stuff from Python to C++, and mostly trivial stuff too. E.g. instead of passing five separate PyTorch tensors to every C++ function, now it passes a pointer to a C++ object that references the five underlying storages. There’s also strictly less initialization of than before.
There is something fishy going on for sure. SM utilization is usually a good thing. It’s apparently doing extra work for some reason…? Higher GPU power consumption too. I’m very baffled.
But, I’m going to be adding Graphs soon, and it may completely change how the H100 reacts to the model. So probably best to wait and see what happens with that first.