llm: MPT-7B error during inference: Assertion `ggml_nelements(src1) == 3' failed

Getting this error when trying to run inference on any of these models: https://huggingface.co/rustformers/mpt-7b-ggml

Windows:

Loaded hyperparameters
ggml ctx size = 3567.88 MB

Loaded tensor 8/194
Loaded tensor 16/194
Loaded tensor 24/194
Loaded tensor 32/194
Loaded tensor 40/194
Loaded tensor 48/194
Loaded tensor 56/194
Loaded tensor 64/194
Loaded tensor 72/194
Loaded tensor 80/194
Loaded tensor 88/194
Loaded tensor 96/194
Loaded tensor 104/194
Loaded tensor 112/194
Loaded tensor 120/194
Loaded tensor 128/194
Loaded tensor 136/194
Loaded tensor 144/194
Loaded tensor 152/194
Loaded tensor 160/194
Loaded tensor 168/194
Loaded tensor 176/194
Loaded tensor 184/194
Loaded tensor 192/194
Loading of model complete
Model size = 3568.33 MB / num tensors = 194
Assertion failed: ggml_nelements(src1) == 3, file ggml/src/ggml.c, line 10838
error: process didn't exit successfully: exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)

WSL (Ubuntu VM):

Loaded hyperparameters
ggml ctx size = 3567.88 MB

Loaded tensor 8/194
Loaded tensor 16/194
Loaded tensor 24/194
Loaded tensor 32/194
Loaded tensor 40/194
Loaded tensor 48/194
Loaded tensor 56/194
Loaded tensor 64/194
Loaded tensor 72/194
Loaded tensor 80/194
Loaded tensor 88/194
Loaded tensor 96/194
Loaded tensor 104/194
Loaded tensor 112/194
Loaded tensor 120/194
Loaded tensor 128/194
Loaded tensor 136/194
Loaded tensor 144/194
Loaded tensor 152/194
Loaded tensor 160/194
Loaded tensor 168/194
Loaded tensor 176/194
Loaded tensor 184/194
Loaded tensor 192/194
Loading of model complete
Model size = 3568.33 MB / num tensors = 194
ggml/src/ggml.c:10838: ggml_compute_forward_alibi_f32: Assertion `ggml_nelements(src1) == 3' failed.
Aborted

Code to reproduce:

use llm::{Model, InferenceParameters};
use std::io::Write;

fn main() {
    let model = llm::load::<llm::models::Mpt>(
        std::path::Path::new("./models/mpt-7b-instruct-q4_0.bin"),
        Default::default(),
        None,
        llm::load_progress_callback_stdout
    )
    .unwrap_or_else(|err| panic!("Failed to load model: {err}"));

    let mut session = model.start_session(Default::default());

    let res = session.infer::<std::convert::Infallible>(
        &model,
        &mut rand::thread_rng(),
        &llm::InferenceRequest {
            prompt: "### Human: Rust is a cool programming language because
            ### Assistant: ".into(),
            parameters: &llm::InferenceParameters::default(),
            play_back_previous_tokens: false,
            maximum_token_count: None,
        },
        // OutputRequest
        &mut Default::default(),
        |r| match r {
            llm::InferenceResponse::PromptToken(t) | llm::InferenceResponse::InferredToken(t) => {
                print!("{t}");
                std::io::stdout().flush().unwrap();

                Ok(llm::InferenceFeedback::Continue)
            }
            _ => Ok(llm::InferenceFeedback::Continue),
        },
    );

    match res {
        Ok(result) => println!("\n\nInference stats:\n{result}"),
        Err(err) => println!("\n{err}"),
    }
}

About this issue

Original URL
State: closed
Created a year ago
Comments: 16 (3 by maintainers)

Most upvoted comments

Maybe it’s just my VS Code terminal because of the ANSI escape sequences? Just tried in Windows terminal / Powershell with mpt-7b-q4_0.bin and the output seems to make more sense. This context is weirdly relevant to me despite giving no context to the LLM so far since I helped with Flecs (which is written in C?), very odd/coincidental output haha.

Model size = 3568.33 MB / num tensors = 194
### Human: Rust is a cool programming language because
            ### Assistant: ######

- [Rusty](https://github.com/SanderMertens) has been creating some very interesting projects in the past few weeks, and it's about time I share them with you guys! The first project that he worked on was called 'Cargo Container', which allows users to play around C++11 code without having any prior knowledge of this new standard (but they can still have fun anyway). It contains three parts;

troyedwardsjr on May 28, 2023

@vilhelmbergsoe No i haven’t played around with GPTQ as im mostly focused on CPU inference. But AutoGPTQ looks like a good starting point if you want to look into it. I would expect similar performance to GGMLs quantization technices.

@troyedwardsjr Ok i upload it, should be done in about 30 minutes, but i wont update the readme today. You can find the model in the files section.

LLukas22 on May 28, 2023