llama-cpp-python: Subsequent prompts are around 10x to 12x slower than on llama.cpp "main" example.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I am creating a simple clone of the “main” example from the llama.cpp repo, which involves interactive mode with really fast interference of around 36 ms per token.
Current Behavior
Generating the first token takes around 10–12 seconds and then subsequent ones take around 200-300 ms. It should match the speed of the example from the llama.cpp repo.
Environment and Context
I am utilizing context size of 512, prediction 256 and batch 1024. The rest of the settings are default. I am also utilizing CLBlast which on llama.cpp gives me 2.5x boost in performence.
- AMD Ryzen 5 3600 6-Core Processor + RX 580 4 GB
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 5 3600 6-Core Processor
CPU family: 23
Model: 113
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 94%
CPU max MHz: 4208,2031
CPU min MHz: 2200,0000
BogoMIPS: 7186,94
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
a cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall n
x mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_go
od nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl p
ni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe
popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy sv
m extapic cr8_legacy abm sse4a misalignsse 3dnowprefetc
h osvw ibs skinit wdt tce topoext perfctr_core perfctr_
nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate
ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bm
i2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsa
veopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_tota
l cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd
arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean
flushbyasid decodeassists pausefilter pfthreshold avic
v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_re
cov succor smca sev sev_es
Number of devices 1
Device Name gfx803
Device Vendor Advanced Micro Devices, Inc.
Device Vendor ID 0x1002
Device Version OpenCL 1.2
Driver Version 3513.0 (HSA1.1,LC)
Device OpenCL C Version OpenCL C 2.0
Device Type GPU
Device Board Name (AMD) AMD Radeon RX 580 Series
- Linux:
Linux bober-desktop 6.3.1-x64v1-xanmod1-2 #1 SMP PREEMPT_DYNAMIC Sun, 07 May 2023 10:32:57 +0000 x86_64 GNU/Linux
- Versions:
Python 3.11.3
GNU Make 4.4.1
Built for x86_64-pc-linux-gnu
g++ (GCC) 13.1.1 20230429
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (3 by maintainers)
Commits related to this issue
- Add --ignore-eos parameter (#181) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> — committed to xaptronic/llama-cpp-python by slaren a year ago
@abetlen Yeah. I need to do some stuff, but then I will more formally test llama.cpp vs llama-cpp-python with the profiles like that ^, including tokens/sec for the same prompt, and post an issue.
@AlphaAtlas not sure if they’re related as this issue was from before the CUDA offloading merge.
Do you mind opening a new issue here, if there’s a performance discrepency and you don’t mind giving me a hand getting to the bottom of it I’m very interested in fixing it.