ragas: Invalid response format. Expected a list of dictionaries with keys 'verdict'
Ragas version: 0.1.1 Python version: 3.8
Describe the bug
Getting all NaN scores. I am using open-source LLM and hosting in vLLM openAI server style, using HuggingfaceEmbeddings and creating a custom sample dataset.
Code to Reproduce
successfully ran vllm
python -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-alpha --tensor-parallel-size 4 --port 8000 --enforce-eager --gpu-memory-utilization 0.99
from langchain_openai.chat_models import ChatOpenAI
from datasets import Dataset
inference_server_url = "http://localhost:8000/v1"
#create vLLM Langchain instance
chat = ChatOpenAI(
model="HuggingFaceH4/zephyr-7b-alpha",
openai_api_key="no-key",
openai_api_base=inference_server_url,
max_tokens=5,
temperature=0,
)
from ragas.embeddings import HuggingfaceEmbeddings
hf_embeddings = HuggingfaceEmbeddings(model_name="BAAI/bge-small-en")
#My dataset looks like this
data = {
'question': ['What did the president say about gun violence?'],
'answer': ['I do not have access to the most recent statements made by the president. please provide me with the specific date or context of the statement you are referring to.'],
'contexts': [['The president asked Congress to pass proven measures to reduce gun violence.']],
'ground_truth': ['The president asked Congress to pass proven measures to reduce gun violence.']
}
# Convert dict to dataset
dataset = Dataset.from_dict(data)
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_recall,
context_precision,
)
from ragas import evaluate
result = evaluate(dataset = dataset,
llm = chat,
embeddings = hf_embeddings,
metrics=[context_precision,
context_recall,
faithfulness,
answer_relevancy,
]
)
df = result.to_pandas()
Error trace
Invalid response format. Expected a list of dictionaries with keys ‘verdict’ Invalid JSON response. Expected dictionary with key ‘Attributed’ Invalid JSON response. Expected dictionary with key ‘question’ /usr/local/lib/python3.8/dist-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice value = np.nanmean(self.scores[cn])
About this issue
- Original URL
- State: open
- Created 5 months ago
- Reactions: 3
- Comments: 15 (2 by maintainers)
Same issue with Bedrock models, e.g.
"anthropic.claude-v2"We are using mixtral.
The issue is that emitting the expected json perfectly is super fragile for other models. That’s not a RAGAS issue, but a model issue.
LLama didnt work for us, only mixtral8x7b seems to have limited success.
For RAGAS with mixtral we had to change the issues in the prompts, such as context_recall expecting a capitalized json key “Attributed” which is non-standard or giving a “classification” array OR dictionary as few shot examples.
@jjmachan While technically using other models should work, in practice it seems like it just does not work. I am wondering if there is anyone in the community who actually is using other model or not. The json issue is more profound than you think. I think other LLMs are not responding well to your prompts. The code in the metrics needs to be adapted to each LLM separately.