langchain: GPT4All chat error with async calls

Hi, I believe this issue is related to this one: #1372

I’m using GPT4All integration and get the following error after running ConversationalRetrievalChain with AsyncCallbackManager: ERROR:root:Async generation not implemented for this LLM. Changing to CallbackManager does not fix anything.

The issue is model-agnostic, i.e., I have used ggml-gpt4all-j-v1.3-groovy.bin and ggml-mpt-7b-base.bin. The LangChain version I’m using is 0.0.179. Any ideas how this can be potentially solved or should we just wait for a new release fixing it?

Suggestion:

Release a fix, similar as in #1372

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 6
  • Comments: 26

Most upvoted comments

I am currently checking with the requirements for my pr to be ready for reviewing (format, linting, testing) I will finish them ASAP and this is my first contribution ever in an open source project 😃

what I did was implement _acall with support for async await, should I add just one test with async for this? is it enough to be accepted? https://github.com/khaledadrani/langchain/blob/32a041b8a2a5a8a6db36592b501e4ce9d54c219b/tests/unit_tests/llms/fake_llm.py

Edit; I also need to put a test here https://github.com/khaledadrani/langchain/blob/32a041b8a2a5a8a6db36592b501e4ce9d54c219b/tests/integration_tests/llms/test_gpt4all.py

@auxon I use it the following way


history = ConversationBufferMemory(ai_prefix="### Assistant", human_prefix="### Human")
template = """
            {history}
            ### Human: {input}
            ### Assistant:"""
prompt = PromptTemplate(template=template, input_variables=["history","input"])
streaminCallback = AsyncIteratorCallbackHandler()
            llmObj = AGPT4All(model = modelpath, verbose=False, allow_download=True,
                temp = properties["temp"],
                top_k = properties["top_k"],
                top_p = properties["top_p"],
                repeat_penalty = properties["repeat_penalty"],
                repeat_last_n = properties["repeat_last_n"],
                n_predict = properties["n_predict"],
                n_batch = properties["n_batch"],
                callbacks=[streaminCallback],
                n_threads= threads,
                streaming = True)
history.load_memory_variables({})
chain = ConversationChain(prompt=prompt, llm=llmObj, memory=history)
asyncio.create_task(chain.apredict(input=data["prompt"]))
start = datetime.now()
tokenCount = 0
async for respEntry in streaminCallback.aiter():
    now = datetime.now()
    diff = now - start
    tokenCount += 1
    print("Tokens per Second: " + str(tokenCount / diff.total_seconds()))
    compResp = compResp + respEntry
    yield respEntry
´´´

For GPT4All you can use this class in your projects


from langchain.llms import GPT4All
from functools import partial
from typing import Any, List
from langchain.callbacks.manager import AsyncCallbackManagerForLLMRun
from langchain.llms.utils import enforce_stop_tokens

class AGPT4All(GPT4All):
    async def _acall(self, prompt: str, stop: List[str] | None = None, run_manager: AsyncCallbackManagerForLLMRun | None = None, **kwargs: Any) -> str:
        text_callback = None
        if run_manager:
            text_callback = partial(run_manager.on_llm_new_token, verbose=self.verbose)
        text = ""
        params = {**self._default_params(), **kwargs}
        for token in self.client.generate(prompt,streaming = True, **params):
            if text_callback:
                await text_callback(token)
            text += token
        if stop is not None:
            text = enforce_stop_tokens(text, stop)
        return text

@khaledadrani eagerly waiting for it.

@charlod is the current implementation of ainvoke or acall (going to be deprecated) not working for you?

model_response = await qa.ainvoke(
               input={"query": "what is Python?"}
           )

@baskaryan Could you please help @PiotrPmr with the issue they mentioned? They are still experiencing an error with async calls in the GPT4All chat integration and would appreciate your assistance. Thank you!

A workaround for using ConversationalRetrievalChain with llamaCpp is implemet _acall function. This is not tested extensively.


from langchain.llms import LlamaCpp
from typing import Any, Dict, List, Generator, Optional
from langchain.callbacks.manager import AsyncCallbackManagerForLLMRun

class LlamaCppAsync(LlamaCpp):
    async def _acall(
            self,
            prompt: str,
            stop: Optional[List[str]] = None,
            run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
            **kwargs: Any,
    ) -> str:
        """Asynchronous Call the Llama model and return the output.

        Args:
            prompt: The prompt to use for generation.
            stop: A list of strings to stop generation when encountered.

        Returns:
            The generated text.

        Example:
            .. code-block:: python

                from langchain.llms import LlamaCpp
                llm = LlamaCpp(model_path="/path/to/local/llama/model.bin")
                llm("This is a prompt.")
        """
        if self.streaming:
            # If streaming is enabled, we use the stream
            # method that yields as they are generated
            # and return the combined strings from the first choices's text:
            combined_text_output = ""
            stream = self.stream_async(prompt=prompt, stop=stop, run_manager=run_manager)

            async for token in stream:
                combined_text_output += token["choices"][0]["text"]
            return combined_text_output
        else:
            params = self._get_parameters(stop)
            params = {**params, **kwargs}
            result = self.client(prompt=prompt, **params)
            return result["choices"][0]["text"]

    async def stream_async(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
    ) -> Generator[Dict, None, None]:
        """Yields results objects as they are generated in real time.

        BETA: this is a beta feature while we figure out the right abstraction.
        Once that happens, this interface could change.

        It also calls the callback manager's on_llm_new_token event with
        similar parameters to the OpenAI LLM class method of the same name.

        Args:
            prompt: The prompts to pass into the model.
            stop: Optional list of stop words to use when generating.

        Returns:
            A generator representing the stream of tokens being generated.

        Yields:
            A dictionary like objects containing a string token and metadata.
            See llama-cpp-python docs and below for more.

        Example:
            .. code-block:: python

                from langchain.llms import LlamaCpp
                llm = LlamaCpp(
                    model_path="/path/to/local/model.bin",
                    temperature = 0.5
                )
                for chunk in llm.stream("Ask 'Hi, how are you?' like a pirate:'",
                        stop=["'","\n"]):
                    result = chunk["choices"][0]
                    print(result["text"], end='', flush=True)

        """
        params = self._get_parameters(stop)
        result = self.client(prompt=prompt, stream=True, **params)
        for chunk in result:
            token = chunk["choices"][0]["text"]
            log_probs = chunk["choices"][0].get("logprobs", None)
            if run_manager:
                await run_manager.on_llm_new_token(
                    token=token, verbose=self.verbose, log_probs=log_probs
                )
            yield chunk

Then change


    question_gen_llm = LlamaCpp(
        model_path=LLM_MODEL_PATH,
        n_ctx=2048,
        streaming=True,
        callback_manager=question_manager,
        verbose=True,
    )

To


    question_gen_llm = LlamaCppAsync(
        model_path=LLM_MODEL_PATH,
        n_ctx=2048,
        streaming=True,
        callback_manager=question_manager,
        verbose=True,
    )