Major LLM Evaluation Frameworks

20-Nov-24

Benchmark/Framework Description Metrics
MMLU (Massive Multitask Language Understanding) Evaluates multitask accuracy across 57 tasks from diverse domains. Coherence, relevance, detail, clarity
GLUE (General Language Understanding Evaluation) Tests language understanding through nine tasks like sentiment analysis and QA. Accuracy, F1 score
SuperGLUE Extension of GLUE with more complex reasoning and comprehension tasks. Accuracy, F1 score
TruthfulQA Evaluates the truthfulness of LLMs by testing their ability to avoid falsehoods. Truthfulness percentage, informativeness
HellaSwag Tests commonsense reasoning by asking models to complete logically consistent sequences. Accuracy
DeepEval Open-source framework with 14+ evaluation metrics for summarization and hallucination detection. Answer relevancy, contextual recall/precision, hallucination detection
RAGAs (Retrieval-Augmented Generation Assessment) Evaluates RAG pipelines focusing on faithfulness and contextual relevancy. Faithfulness, contextual precision/recall
Eleuther AI Supports 200+ evaluation tasks and powers Hugging Face’s Open LLM Leaderboard. Task-specific accuracy
MLFlow LLM Evaluate Modular framework for evaluating LLMs with support for RAG pipelines and QA tasks. QA correctness, answer relevancy
HumanEval (OpenAI) Assesses code generation by testing generated code against predefined test cases. Code correctness (pass/fail rate)