Major LLM Evaluation Frameworks
20-Nov-24
Benchmark/Framework | Description | Metrics |
---|---|---|
MMLU (Massive Multitask Language Understanding) | Evaluates multitask accuracy across 57 tasks from diverse domains. | Coherence, relevance, detail, clarity |
GLUE (General Language Understanding Evaluation) | Tests language understanding through nine tasks like sentiment analysis and QA. | Accuracy, F1 score |
SuperGLUE | Extension of GLUE with more complex reasoning and comprehension tasks. | Accuracy, F1 score |
TruthfulQA | Evaluates the truthfulness of LLMs by testing their ability to avoid falsehoods. | Truthfulness percentage, informativeness |
HellaSwag | Tests commonsense reasoning by asking models to complete logically consistent sequences. | Accuracy |
DeepEval | Open-source framework with 14+ evaluation metrics for summarization and hallucination detection. | Answer relevancy, contextual recall/precision, hallucination detection |
RAGAs (Retrieval-Augmented Generation Assessment) | Evaluates RAG pipelines focusing on faithfulness and contextual relevancy. | Faithfulness, contextual precision/recall |
Eleuther AI | Supports 200+ evaluation tasks and powers Hugging Face’s Open LLM Leaderboard. | Task-specific accuracy |
MLFlow LLM Evaluate | Modular framework for evaluating LLMs with support for RAG pipelines and QA tasks. | QA correctness, answer relevancy |
HumanEval (OpenAI) | Assesses code generation by testing generated code against predefined test cases. | Code correctness (pass/fail rate) |