langroid: prompt doesn't seem to make it to the llm via an agent or task with Mixtral (MoE) local model

Hey there!

So I’m playing with the example scripts from the docs, specifically the two-agent collaboration example and have run into a problem with Mixtral instruct v1 based models using the Oobabooga text-generation-ui server.

The problem is when the agents are set up, for whatever reason, the prompt doesn’t seem to make it to the LLM.

Here’s how I set up the llm. llm_url is set to an http link to the /v1 endpoint at port 5000 and api_key is set to “sk-111111111111111111111111111111111111111111111111”, which is how Oooba likes it. Then, per the example, I did this:

# create the (Pydantic-derived) config class: Allows setting params via MYLLM_XXX env vars
MyLLMConfig = OpenAIGPTConfig.create(prefix="myllm") #(1)!

# instantiate the class, with the model name and context length
my_llm_config = MyLLMConfig(
    api_base=llm_url,
    chat_context_length=2048,
    api_key=api_key,
    litellm = False,  
    max_output_tokens= 2048,
    min_output_tokens = 64,
    chat_model="local_mixtral",
    completion_model=OpenAICompletionModel.TEXT_DA_VINCI_003,  #tried lots of settings here including GPT3, GPT4 Turbo, etc.
    timeout=60,  # increased this as I was experiencing timeouts dunno why but this fixed it.
    seed=random.randint(0,9999999),  #tried 42 but wanted to see if this would change anything
    cache_config=RedisCacheConfig(fake=True)  # get rid of annoying warning
)

At this point, the following works fine:

mdl = OpenAIGPT(my_llm_config)
response = mdl.chat("Is New York in America?", max_tokens=30)

RESPONSE: Yes, New York is a state in the United States of America.

Great. So let’s try it with multiple messages

messages = [
    LLMMessage(content="You are a helpful assistant",  role=Role.SYSTEM),
    LLMMessage(content="Is New York in America?",  role=Role.USER),
]
response = mdl.chat(messages, max_tokens=50)

agent_config = ChatAgentConfig(llm=my_llm_config, name="my-llm-agent")
agent = ChatAgent(agent_config)

RESPONSE: Yes, New York is a state in the United States of America.

However, setting it up with an agent, like this:

agent = ChatAgent(agent_config)
response = agent.llm_response("Is New York in America?")

Results in a very long ramble on random topics (how to use python, some long paragraph in french, etc.) that is completely unrelated to the prompt and appears to be what happens when no prompt makes it to the llm. It’s processing a blank prompt, I suspect, and just spewing randomness.

Similarly, trying it with a Task:

agent = ChatAgent(agent_config)
task = Task(agent, system_message="""
        You are a helpful assistant
        """, single_round=True,)
task.run("Is New York in America?")

This also results in total garbage out.

Again, using a non MoE Mistral appears to work (although it didn’t quite follow the prompts very well, which is why I was hoping mixtral would work better), but it doesn’t seem to receive it through an agent. With Mixtral alone, it’s prompt in, but garbage out.

Anyone else experiencing this?

Without examining the code in too much detail, I wonder why would the prompt make it to the LLM directly but not via an agent? Does this maybe have something to do with the instruction template setting or something?

I tried playing with various settings in the MyLLMConfig, some of which you can see above, but nothing seemed to work. Also tried changing instruction templates on Oobabooga itself, but no dice. I also tried moving the prompts from system_message to user_message, from the task to the agent… but it wouldn’t “take”.

Any thoughts? Why would using an agent “block” the prompt? 🤔

Using langroid v0.1.157 w/litellm FWIW.

Thanks - this looks like a fun and interesting project!

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

@fat-tire @tozimaru Thank you for all the feedback. I will take all of this into account and rationalize some of the local-model setups and write an updated doc page on that. And for task-orchestration, I am planning to write up a detailed doc that presents a mental model that people should have when designing multi-agent workflows, and what the various task settings mean – even I run into issues setting these up!

Meanwhile I will point to a couple places that may be helpful, specifically for multi-agent task workflow design:

  • there are several workflows illustrated in test_task.py. Note that this and any other test can be run using pytest with an optional --m arg for the local (or hosted) model, e.g.
pytest -s -x tests/main/test_task.py --m local/x.y.z:5000

This arg globally overrides the chat_model setting, so you can easily run any test against models other than the default GPT4. (The -s shows output, and -x quits on the first test failure).

  • an example of a realistic 3-agent workflow for filtered RAG using LanceDB, consisting of agents for query planning, query feedback and RAG. This setup illustrates a few different ways to control task loops, including overriding ChatAgent methods. [UPDATE:] To see this in action, you can run the tests in test_lance_doc_chat_agent.py

Just to chime in, I too would like to express how fun it feels using this project to tinker with agents. I’d also appreciate a better explanation of done_if_response and done_if_no_response. Currently my biggest challenge using the repo is my agents not knowing when a certain task is DONE when using RecipientTool in combination with several other tools.

does a Task always run synchronously?

Ah yes @nilspalumbo is working on an async task spawning, glad to see interest in that

will one day be obvious, clear, standardized, and easily accessible to everyone, so it’s really fun to see it develop. Terrific stuff.

Thank you for the interest! I’m thinking of putting down a definitive “Laws of Langroid” doc, stay tuned – it will address what is a step, what is a valid response, when is a task done, what is the result of a task, when is a responder eligible to respond, etc. All of these are in the code but there’s a real need to bring it out conceptually, and also show diagrammatically how each step evolves.

I’m not sure if langroid would need to somehow set that to make sure it’s in the right mode but I haven’t looked too carefully at the api.

Langroid doesn’t have these; it simply assumes the endpoint is OpenAI-compatible and that the chat-formatting is handled by the endpoint.

Okay, update!

As a test, I’m using this model: https://huggingface.co/TheBloke/Starling-LM-alpha-8x7B-MoE-GGUF – it’s based on Mistral’s MoE model.

Here’s my simplified llmconfig, which uses the “local/#.#.#.#:5000/v2” formulation as you recommended, and is assigned this time to chat_model rather than api_base. (I previously used api_base instead because it successfully connected locally and didn’t try OpenAI’s servers. I didn’t realize that I could do local/#.#.#.#... , but it does seem that the “local/” is pretty important)

my_llm_config = MyLLMConfig(
    chat_context_length=2048,  # adjust based on model
    api_key=api_key,
    litellm = False,  # use litellm api?
    max_output_tokens= 2048,
    min_output_tokens = 64,
    chat_model=llm_url,
    timeout=60,
    seed=random.randint(0,9999999),
    cache_config=RedisCacheConfig(fake=True)  # get rid of annoying warning
)

So now the agent responds correctly with:

agent = ChatAgent(agent_config)
response = agent.llm_response("Is New York in America?")

It responds correctly that yes, new york is a state.

Unfortunately, when I try the two-agent chat (adding the numbers together), i’m getting some weird timeout issues, but it does appear to work eventually, and I see some communication between agents now. It still isn’t following the prompts perfectly. But at least it sees them! 😄

Thanks for the help!

A couple quick thoughts/suggestions:

  • Maybe offer an example in the docs (and in the example code) where a non OpenAI server is being accessed by IP or a domain name instead of only “localhost” just to demonstrate the difference between “local/” and “localhost” and to show how how non-OpenAI addresses are also be prepended with “local/” That is, I think I confused “local/” as synonymous with “localhost”, meaning I thought “local” told langroid the model was “on the same machine”, when really it signifies “not openai’s official server”-- eg, I presume you could do “local/ServerSomewhereFarAway.com:5000/v1” so long as it’s compatible with openai’s API, right? (btw- would this handle SSL okay?)

  • To that end, maybe “local” isn’t the right word, since in future people may be connected to anywhere for LLM services- it could be a local open source model, sure, but it could be somewhere else, right? Perhaps something like “private/” or “hosted/” or “external/” etc? Specifically referencing litellm might be too narrow since it should be compatible with lots of services, right? (Or maybe you treat OpenAI as a special case where tools plugins or whatever it’s called become available automatically if it recognizes an OpenAI model, in which case do you need the “local/” at all? Ideally it’s totally agnostic to the LLM endpoint and works the same with providers be they Mistral, Gemini, Cohere, etc. — although OpenAI’s API is the primary citizen here and so far everyone needs to conform to them)

  • It was not immediately clear in the docs that “DO-NOT-KNOW” has a special NO_ANSWER meaning/function and is treated differently than other responses until I got to the three-agent example, which comes well after the delegate concept has been introduced earlier. I think it may be clearer to explain how a delegate agent decides when it’s done and when to move on, what a pass is, etc.

  • Related: The single_round and the llm_delegate configs seem to have been deprecated and replaced with done_if_response, and done_if_no_response instead. I didn’t see docs on those options and i’m not quite clear conceptually on how they are drop-in replacements, especially as they’ve gone from a bool to something of the form done_if_response=[Entity.LLM]. What’s Entity.LLM and what does this do? And what are the various options for the config? I’m sure it was deprecated for a great reason, but could the documentation be updated to help frame the new way of thinking of agent behavior? Does DO-NOT-KNOW constitute a response?

  • I had to add use_functions_api=False, to the ChatAgent’s config to avoid the warning: “You have enabled use_functions_api but the LLM does not support it. So we will enable use_tools instead, so we can use Langroid’s ToolMessage mechanism.”. Maybe either suggest to do this to make the warning go away, or just silently switch to use_tools in the case of a “local” LLM, since openai’s stuff will never be available there.

  • Similarly, when using a “local” model, I was getting a warning about using fakereddis until I put cache_config=RedisCacheConfig(fake=True) in the config. Similar to the above this warning include the “solution” if we’re fine with the default behavior?

Again, I just want to stress how absolutely cool and fun this project is-- I can easily see a future of pre-written agents and tasks that you can download and snap together to do all kinds of cool tasks. A modular node-based graphical system a la Blender or invokeai or comfyui to follow? heh.