llama-cpp-python: Subsequent prompts are around 10x to 12x slower than on llama.cpp "main" example.

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am creating a simple clone of the “main” example from the llama.cpp repo, which involves interactive mode with really fast interference of around 36 ms per token.

Current Behavior

Generating the first token takes around 10–12 seconds and then subsequent ones take around 200-300 ms. It should match the speed of the example from the llama.cpp repo.

Environment and Context

I am utilizing context size of 512, prediction 256 and batch 1024. The rest of the settings are default. I am also utilizing CLBlast which on llama.cpp gives me 2.5x boost in performence.

  • AMD Ryzen 5 3600 6-Core Processor + RX 580 4 GB
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 3600 6-Core Processor
    CPU family:          23
    Model:               113
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  94%
    CPU max MHz:         4208,2031
    CPU min MHz:         2200,0000
    BogoMIPS:            7186,94
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall n
                         x mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_go
                         od nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl p
                         ni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe
                          popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy sv
                         m extapic cr8_legacy abm sse4a misalignsse 3dnowprefetc
                         h osvw ibs skinit wdt tce topoext perfctr_core perfctr_
                         nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate
                          ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bm
                         i2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsa
                         veopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_tota
                         l cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd
                          arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean 
                         flushbyasid decodeassists pausefilter pfthreshold avic 
                         v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_re
                         cov succor smca sev sev_es

Number of devices                                 1
  Device Name                                     gfx803
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.2 
  Driver Version                                  3513.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         AMD Radeon RX 580 Series
  • Linux:

Linux bober-desktop 6.3.1-x64v1-xanmod1-2 #1 SMP PREEMPT_DYNAMIC Sun, 07 May 2023 10:32:57 +0000 x86_64 GNU/Linux

  • Versions:
Python 3.11.3

GNU Make 4.4.1
Built for x86_64-pc-linux-gnu

g++ (GCC) 13.1.1 20230429

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (3 by maintainers)

Commits related to this issue

Most upvoted comments

@abetlen Yeah. I need to do some stuff, but then I will more formally test llama.cpp vs llama-cpp-python with the profiles like that ^, including tokens/sec for the same prompt, and post an issue.

@AlphaAtlas not sure if they’re related as this issue was from before the CUDA offloading merge.

Do you mind opening a new issue here, if there’s a performance discrepency and you don’t mind giving me a hand getting to the bottom of it I’m very interested in fixing it.