LLamaSharp: Kernel Memory is broken with latest nugets

Using the 0.8 release of LlamaSharp and Kernal-Memory with the samples there is an error because the LlamaSharpTextEmbeddingGeneration doesn’t implement the Attributes property.

I took the source and created my own and added this:

public IReadOnlyDictionary<string, string> Attributes => new Dictionary<string, string>();

So it wouldn’t error.

But no matter what model I use I get “INFO NOT FOUND.” (I’ve tried kai-7b-instruct.Q5_K_M.gguf, llama-2-7b-32k-instruct.Q6_K.gguf, llama-2-7b-chat.Q6_K.gguf and a few others)

I’ve tried loading just text, an html file, and a web page to no avail.

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 34

Most upvoted comments

Update: LLamaSharp 0.8.1 is now integrated into KernelMemory, here’s an example: https://github.com/microsoft/kernel-memory/blob/main/examples/105-dotnet-serverless-llamasharp/Program.cs

There’s probably some work to do for users, e.g. customizing prompts for LLama and identifying which model works best. KM should be sufficiently configurable to allow that.

KernelMemory author here, let me know if there’s something I can do to make the integration better, more powerful, easier, etc 😃

Thanks for the feedback, we merged a PR today that allows to configure and/or replace the search logic, e.g. defining token limits.

And this PR https://github.com/microsoft/kernel-memory/pull/189 allows to customize token settings and tokenization logic. Would appreciate if someone could take a look/let us know if it helps.

This snippet shows how we could add LLama to KernelMemory:

public class LLamaConfig
{
    public string ModelPath { get; set; } = "";

    public int MaxTokenTotal { get; set; } = 4096;
}

public class LLamaTextGenerator : ITextGenerator, IDisposable
{
    private readonly string _modelPath;

    public LLamaTextGenerator(LLamaConfig config)
    {
        this._modelPath = config.ModelPath;
        this.MaxTokenTotal = config.MaxTokenTotal;
    }

    /// <inheritdoc/>
    public int MaxTokenTotal { get; }

    /// <inheritdoc/>
    public int CountTokens(string text)
    {
        // ... count tokens using LLama tokenizer ...
        // ... which can be injected via ctor as usual ...
    }

    /// <inheritdoc/>
    public IAsyncEnumerable<string> GenerateTextAsync(
        string prompt,
        TextGenerationOptions options,
        CancellationToken cancellationToken = default)
    {
        // ... use LLama backend to generate text ...
    }
}

dluc: Generally I see this when:

  1. The tokens to keep doesn’t include the original prompt and the question.
  2. The context_length is too short for the total of everything.
  3. The max_tokens (i.e. the maximum response length) is too short.

Generally what you want is the context_length to be the same as the model’s length. And you want max_tokens to either be short but long enough for the answer you’re expecting because LLAMA has a bad habit of repeating itself, or System Message + all user messages + assistant responses including max_tokens <= context_length.

I use TikSharp to calculate the number of tokens for all prompts, add 10 or so just to be safe and subtract that from context_length and make that the max token length, then set antiprompts to AntiPrompts = [“\n\n\n\n”, “\t\t\t\t”] which gets rid of 2 of the cases (especially when generating json with the grammar file) of repetition instead of ending.

This technique also works when using ChatGPT 3.5+ so you don’t get errors since it hard refuses and costs you money, to produce more than the context_length so you have to do this math or risk it blowing up and running up your bill.

@xbotter I think we should watch the prompt put into our model in the second run with context size reduced to 4096. Could you please take a look? I’m on duty this weekend.

This can only be resolved from the kernel memory . I have already submitted an issue https://github.com/microsoft/kernel-memory/issues/164 and waiting for further updates.

On the first issue I was able to take your code, and manually add a grammar to it. Is there a way that we can just expose Grammar in the LLamaSharpConfig for right now that will get passed through? Same with MainGPU, and TOP_K?

👍 Good idea, I seem to have a solution to the issue #289. 😃 Thank you for your suggestion.