ragas: Invalid response format. Expected a list of dictionaries with keys 'verdict'

Ragas version: 0.1.1 Python version: 3.8

Describe the bug

Getting all NaN scores. I am using open-source LLM and hosting in vLLM openAI server style, using HuggingfaceEmbeddings and creating a custom sample dataset.

Code to Reproduce

successfully ran vllm

python -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-alpha --tensor-parallel-size 4 --port 8000 --enforce-eager --gpu-memory-utilization 0.99

from langchain_openai.chat_models import ChatOpenAI
from datasets import Dataset
inference_server_url = "http://localhost:8000/v1"

#create vLLM Langchain instance
chat = ChatOpenAI(
    model="HuggingFaceH4/zephyr-7b-alpha",
    openai_api_key="no-key",
    openai_api_base=inference_server_url,
    max_tokens=5,
    temperature=0,
)

from ragas.embeddings import HuggingfaceEmbeddings
hf_embeddings = HuggingfaceEmbeddings(model_name="BAAI/bge-small-en")

#My dataset looks like this 
data  =  {
'question': ['What did the president say about gun violence?'],

 'answer': ['I do not have access to the most recent statements made by the president. please provide me with the specific date or context of the statement you are referring to.'],

 'contexts': [['The president asked Congress to pass proven measures to reduce gun violence.']],

 'ground_truth': ['The president asked Congress to pass proven measures to reduce gun violence.']
}
# Convert dict to dataset
dataset = Dataset.from_dict(data)

from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

from ragas import evaluate
result = evaluate(dataset = dataset,
                  llm = chat,
                  embeddings = hf_embeddings,
                  metrics=[context_precision,
                           context_recall,
                           faithfulness,
                           answer_relevancy,
                           ]
                  )  
df = result.to_pandas()

Error trace

Invalid response format. Expected a list of dictionaries with keys ‘verdict’ Invalid JSON response. Expected dictionary with key ‘Attributed’ Invalid JSON response. Expected dictionary with key ‘question’ /usr/local/lib/python3.8/dist-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice value = np.nanmean(self.scores[cn])

About this issue

  • Original URL
  • State: open
  • Created 5 months ago
  • Reactions: 3
  • Comments: 15 (2 by maintainers)

Most upvoted comments

Same issue with Bedrock models, e.g. "anthropic.claude-v2"

We are using mixtral.

The issue is that emitting the expected json perfectly is super fragile for other models. That’s not a RAGAS issue, but a model issue.

LLama didnt work for us, only mixtral8x7b seems to have limited success.

For RAGAS with mixtral we had to change the issues in the prompts, such as context_recall expecting a capitalized json key “Attributed” which is non-standard or giving a “classification” array OR dictionary as few shot examples.

@jjmachan While technically using other models should work, in practice it seems like it just does not work. I am wondering if there is anyone in the community who actually is using other model or not. The json issue is more profound than you think. I think other LLMs are not responding well to your prompts. The code in the metrics needs to be adapted to each LLM separately.