langchain: RuntimeError: Failed to tokenize (LlamaCpp and QAWithSourcesChain)

Hi there, getting the following error when attempting to run a QAWithSourcesChain using a local GPT4All model. The code works fine with OpenAI but seems to break if I swap in a local LLM model for the response. Embeddings work fine in the VectorStore (using OpenSearch).

def query_datastore(
    query: str,
    print_output: bool = True,
    temperature: float = settings.models.DEFAULT_TEMP,
) -> list[Document]:
    """Uses the `get_relevant_documents` from langchains to query a result from vectordb and returns a matching list of Documents.

    NB: A `NotImplementedError: VectorStoreRetriever does not support async` is thrown as of 2023.04.04 so we need to run this in a synchronous fashion.

    Args:
        query: string representing the question we want to use as a prompt for the QA chain.
        print_output: whether to pretty print the returned answer to stdout. Default is True.
        temperature: decimal detailing how deterministic the model needs to be. Zero is fully, 2 gives it artistic licences.

    Returns:
        A list of langchain `Document` objects. These contain primarily a `page_content` string and a `metadata` dictionary of fields.
    """
    retriever = db().as_retriever()  # use our existing persisted document repo in opensearch
    docs: list[Document] = retriever.get_relevant_documents(query)
    llm = LlamaCpp(
        model_path=os.path.join(settings.models.DIRECTORY, settings.models.LLM),
        n_batch=8192,
        temperature=temperature,
        max_tokens=20480,
    )
    chain: QAWithSourcesChain = QAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff")
    answer: list[Document] = chain({"docs": docs, "question": query}, return_only_outputs=True)
    logger.info(answer)
    if print_output:
        pprint(answer)
    return answer

Exception as below.

RuntimeError: Failed to tokenize: text="b' Given the following extracted parts of a long document and a question, create a final answer with 
references ("SOURCES"). \nIf you don\'t know the answer, just say that you don\'t know. Don\'t try to make up an answer.\nALWAYS return a "SOURCES" 
part in your answer.\n\nQUESTION: Which state/country\'s law governs the interpretation of the contract?\n=========\nContent: This Agreement is 
governed by English law and the parties submit to the exclusive jurisdiction of the English courts in  relation to any dispute (contractual or 
non-contractual) concerning this Agreement save that either party may apply to any court for an  injunction or other relief to protect its 
Intellectual Property Rights.\nSource: 28-pl\nContent: No Waiver. Failure or delay in exercising any right or remedy under this Agreement shall not 
constitute a waiver of such (or any other)  right or remedy.\n\n11.7 Severability. The invalidity, illegality or unenforceability of any term (or 
part of a term) of this Agreement shall not affect the continuation  in force of the remainder of the term (if any) and this Agreement.\n\n11.8 No 
Agency. Except as expressly stated otherwise, nothing in this Agreement shall create an agency, partnership or joint venture of any  kind between the
parties.\n\n11.9 No Third-Party Beneficiaries.\nSource: 30-pl\nContent: (b) if Google believes, in good faith, that the Distributor has violated or 
caused Google to violate any Anti-Bribery Laws (as  defined in Clause 8.5) or that such a violation is reasonably likely to occur,\nSource: 
4-pl\n=========\nFINAL ANSWER: This Agreement is governed by English law.\nSOURCES: 28-pl\n\nQUESTION: What did the president say about Michael 
Jackson?\n=========\nContent: Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices
of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as 
Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution.
\n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia\xe2\x80\x99s Vladimir Putin sought to 
shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll
into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President 
Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citizens blocking tanks with 
their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.\nSource: 0-pl\nContent: And we won\xe2\x80\x99t 
stop. \n\nWe have lost so much to COVID-19. Time with one another. And worst of all, so much loss of life. \n\nLet\xe2\x80\x99s use this moment to 
reset. Let\xe2\x80\x99s stop looking at COVID-19 as a partisan dividing line and see it for what it is: A God-awful disease.  \n\nLet\xe2\x80\x99s 
stop seeing each other as enemies, and start seeing each other for who we really are: Fellow Americans.

From what I can tell the model is struggling to interpret the prompt template that’s being passed to it?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 5
  • Comments: 15

Most upvoted comments

I am facing the same issue. I am getting embeddings from LlamaCppEmbeddings and using Chroma for storing the embeddings. I also noticed it’s very slow, and after hours of running, I got that error!

Now I am guessing next step as you suggested is to increase the context tokens, to combat Requested tokens exceed context window of 512. Any documentation that you can point me at on that, so that I can sort this out hopefully?

Haven’t got access currently but it’s something that you should be able to change when initialising the model. With llama.cpp there might be a limit of 2048, can’t remember off top of my head.

Something along the lines of this could be worth a try.

llm = LlamaCpp(model_path="models/my-model.bin", n_ctx=2048)

It is really bizarre that I can’t use publicly available LLMs to create a very simple use case of interrogating a text file. Am I missing something?

You’ll have the same issue - read the responses of mine above (particularly the second section). Either change from chain_type="stuff" to chain_type="refine", increase the context tokens etc.

As you can see if you pass only a single document it’ll work fine. You use of stuff means that you’re not chunking input to the LLM in a way it can process it within the token buffer it has available to it based on my experience.

Try passing in n_ctx=2048 as parameter.