continue: JSONDecodeError

Describe the bug I’ve setup ggml in config.py like so,

        default=GGML(
            max_context_length=2048,
            server_url="http://192.168.1.19:8000"
        )

To Reproduce Steps to reproduce the behavior:

  1. Write a prompt and click on continue
  2. See error

Expected behavior Expected a response (API is confirmed to be working)

Environment

  • GGUF model file
  • Operating System: Linux and MacOS
  • Python Version: recent
  • Continue Version: latest

Console logs

Traceback (most recent call last):

  File "continuedev/src/continuedev/core/autopilot.py", line 367, in _run_singular_step
    observation = await step(self.continue_sdk)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "continuedev/src/continuedev/core/main.py", line 359, in __call__
    return await self.run(sdk)
           ^^^^^^^^^^^^^^^^^^^

  File "continuedev/src/continuedev/plugins/steps/chat.py", line 85, in run
    async for chunk in generator:

  File "continuedev/src/continuedev/libs/llm/ggml.py", line 112, in stream_chat
    async for chunk in generator():

  File "continuedev/src/continuedev/libs/llm/ggml.py", line 103, in generator
    yield json.loads(chunk[6:])["choices"][0][
          ^^^^^^^^^^^^^^^^^^^^^

  File "json/__init__.py", line 346, in loads

  File "json/decoder.py", line 337, in decode

  File "json/decoder.py", line 355, in raw_decode

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 31 (14 by maintainers)

Most upvoted comments

@corv89 yes, that was the intention. I’m working right now though on support for the direct llama.cpp server like you were running. This has several benefits, including tighter control of parameters like grammar

@redtier100x I will share again here once I push the update tomorrow, assuming you’d still prefer to use llama.cpp directly

@corv89 Turns out I accidentally chose the wrong chat templating function as the default. You can actually change it like this:

from continuedev.src.continuedev.libs.llm.prompts.chat import llama2_template_messages

LlamaCpp(..., template_messages=llama2_template_messages)

And I’ll update the default in the next version I release today

lol tic-tac-toe is hilarious

@corv89 If you are using codellama-instruct, I would actually expect these nonsense responses, because it’s being fed the "User: Assistant: " prompt template, which it is not trained for.

You are correct that this is a great choice of model, if not the best, but it’s a limitation of the llama.cpp OpenAI server, as in the code snippet link I sent above. Assuming you still want to use this model, I think the options would be: a) Manually alter this code to give the right prompt template format b) Use llama.cpp ./server directly with the LlamaCpp class, which would by default use the correct prompt template c) Use TextGenUI as another way to run the model locally. I’ve heard users having success with this, and it provides more flexibility. It has its own OpenAI-compatible server extension, so you could still use the GGML class. d) If you are on Mac, use Ollama, which is quite straightforward and reliable

nevermind the above message, I just recalled you are using phind-codellama. I have a couple hypotheses.

a) Phind is trained on a different prompting pattern than llama2/codellama, the same as for alpaca/vicuna. I wouldn’t expect that the openAI server on top of llama.cpp uses this pattern…except that now looking into it appears that it does: https://github.com/ggerganov/llama.cpp/blob/92d0b751a77a089e650983e9f1564ef4d31b32b9/examples/server/api_like_OAI.py#L34C1-L53C18

b) There are two user inputs in a row because of the way we include context in the prompt. It might be that this confuses the model. Do you ever get nonsense responses when no code is highlighted and you follow a strict “user, assistant, user, assistant, etc…” conversation pattern? I want to be careful about changing such a fundamental part of how we prompt, but if this solves your problem I’ll make the change.

@corv89 You can see the prompt being sent if you hover the response message and click on the magnifying glass icon.

I would actually stick with the GGML class for this I think, because you already have a server using the OpenAI format. LlamaCpp class will have to deal with raw prompts: fine if that’s what you want, but I also think there might be something wrong in llama.cpp’s ./server because I tried it for a while yesterday and didn’t have any luck: lots of hallucinations like you said.

If you’re using the OpenAI-compatible server for llama.cpp, there isn’t a lot of room for Continue to mess up the prompts, unless the highlighted code just isn’t being included; because we just pass the array of json messages: [{"role": "user", "content": "..."}, ...].

I used [INST] without a slash because this indicates the start of a user block, which should mean that the assistant is done generating.

@corv89 What was the import before you rewrote it?

One option is to set the context length to the max integer if you really want that, but it’s useful to input a max_content_length because Continue will use it to determine how much context to keep and what to throw out. Even if Code Llama can handle a whole 50 message conversation, the previous messages might not be useful and will just confuse it / increase time to first token. So I would recommend just using 100k as the max_content_length, or moving it to be smaller if you notice really big issues

Definitely—I think the naming of the classes is a bit problematic in general right now. The GGML class probably actually doesn’t need to exist since it’s just connecting to any OpenAI-shaped API endpoint. Will probably remove entirely at some point

It just updated! One thought… Given that llama.cpp has now moved to GGUF and, I believe, has dropped GGML, you might consider having that as a class going forward.

It’s also notable that the llama.cpp server seems to give lower quality outputs than just running ./main in llama.cpp. Haven’t yet pinned down the difference, but there are tons of alternatives to get a server setup, https://github.com/oobabooga/text-generation-webui being one option, llama-cpp-python being another, Ollama if on Mac, LocalAI I’ve recently heard of, and maybe kobold. Lots of discussion going on in the discord about this if you’re interested

@changeling Version 0.0.344 is the latest version if that’s helpful

I’m setting it up by running ./server -m models/codellama-7b.Q4_K_M.gguf -c 2048 from the root of the llama.cpp repo, and then my config file looks like this:

config=ContinueConfig(
    ...
    models=Models(
        default=LlamaCpp(
            prompt_templates={},
            max_context_length=2048,
            server_url="http://localhost:8080",
            llama_cpp_args={"stop": ["[INST]"]},
        )
    )
)

I also have the problem, when i run llma on toghther ai , image