llm: MPT-7B error during inference: Assertion `ggml_nelements(src1) == 3' failed
Getting this error when trying to run inference on any of these models: https://huggingface.co/rustformers/mpt-7b-ggml
Windows:
Loaded hyperparameters
ggml ctx size = 3567.88 MB
Loaded tensor 8/194
Loaded tensor 16/194
Loaded tensor 24/194
Loaded tensor 32/194
Loaded tensor 40/194
Loaded tensor 48/194
Loaded tensor 56/194
Loaded tensor 64/194
Loaded tensor 72/194
Loaded tensor 80/194
Loaded tensor 88/194
Loaded tensor 96/194
Loaded tensor 104/194
Loaded tensor 112/194
Loaded tensor 120/194
Loaded tensor 128/194
Loaded tensor 136/194
Loaded tensor 144/194
Loaded tensor 152/194
Loaded tensor 160/194
Loaded tensor 168/194
Loaded tensor 176/194
Loaded tensor 184/194
Loaded tensor 192/194
Loading of model complete
Model size = 3568.33 MB / num tensors = 194
Assertion failed: ggml_nelements(src1) == 3, file ggml/src/ggml.c, line 10838
error: process didn't exit successfully: exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)
WSL (Ubuntu VM):
Loaded hyperparameters
ggml ctx size = 3567.88 MB
Loaded tensor 8/194
Loaded tensor 16/194
Loaded tensor 24/194
Loaded tensor 32/194
Loaded tensor 40/194
Loaded tensor 48/194
Loaded tensor 56/194
Loaded tensor 64/194
Loaded tensor 72/194
Loaded tensor 80/194
Loaded tensor 88/194
Loaded tensor 96/194
Loaded tensor 104/194
Loaded tensor 112/194
Loaded tensor 120/194
Loaded tensor 128/194
Loaded tensor 136/194
Loaded tensor 144/194
Loaded tensor 152/194
Loaded tensor 160/194
Loaded tensor 168/194
Loaded tensor 176/194
Loaded tensor 184/194
Loaded tensor 192/194
Loading of model complete
Model size = 3568.33 MB / num tensors = 194
ggml/src/ggml.c:10838: ggml_compute_forward_alibi_f32: Assertion `ggml_nelements(src1) == 3' failed.
Aborted
Code to reproduce:
use llm::{Model, InferenceParameters};
use std::io::Write;
fn main() {
let model = llm::load::<llm::models::Mpt>(
std::path::Path::new("./models/mpt-7b-instruct-q4_0.bin"),
Default::default(),
None,
llm::load_progress_callback_stdout
)
.unwrap_or_else(|err| panic!("Failed to load model: {err}"));
let mut session = model.start_session(Default::default());
let res = session.infer::<std::convert::Infallible>(
&model,
&mut rand::thread_rng(),
&llm::InferenceRequest {
prompt: "### Human: Rust is a cool programming language because
### Assistant: ".into(),
parameters: &llm::InferenceParameters::default(),
play_back_previous_tokens: false,
maximum_token_count: None,
},
// OutputRequest
&mut Default::default(),
|r| match r {
llm::InferenceResponse::PromptToken(t) | llm::InferenceResponse::InferredToken(t) => {
print!("{t}");
std::io::stdout().flush().unwrap();
Ok(llm::InferenceFeedback::Continue)
}
_ => Ok(llm::InferenceFeedback::Continue),
},
);
match res {
Ok(result) => println!("\n\nInference stats:\n{result}"),
Err(err) => println!("\n{err}"),
}
}
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (3 by maintainers)
Maybe it’s just my VS Code terminal because of the ANSI escape sequences? Just tried in Windows terminal / Powershell with mpt-7b-q4_0.bin and the output seems to make more sense. This context is weirdly relevant to me despite giving no context to the LLM so far since I helped with Flecs (which is written in C?), very odd/coincidental output haha.
@vilhelmbergsoe No i haven’t played around with GPTQ as im mostly focused on CPU inference. But AutoGPTQ looks like a good starting point if you want to look into it. I would expect similar performance to GGMLs quantization technices.
@troyedwardsjr Ok i upload it, should be done in about 30 minutes, but i wont update the readme today. You can find the model in the
filessection.