Evaluation of LLM and LLM based Systems
Compendium of LLM Evaluation methods
Introduction
The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across.
I aim to create a resource that will enable anyone with queries—whether it’s about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.
My view on LLM Evaluation: Deck, and SF Big Analytics and AICamp video Analytics Vidhya (Data Phoenix Mar 5) (by Andrei Lopatenko)
Table of contents
- Reviews and Surveys
- Leaderboards and Arenas
- Evaluation Software
- LLM Evaluation articles in tech media and blog posts from companies
- Large benchmarks
- Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
- Long Comprehensive Studies
- HITL (Human in the Loop)
- LLM as Judge
- LLM Evaluation
- LLM Systems
- Other collections
-
Reviews and Surveys
- Evaluating Large Language Models: A Comprehensive Survey , Oct 2023 arxiv:
- A Survey on Evaluation of Large Language Models Jul 2023 arxiv:
- Through the Lens of Core Competency: Survey on Evaluation of Large Language Models, Aug 2023 , arxiv:
-
Leaderboards and Arenas
- New Hard Leaderboard by HuggingFace leaderboard description, blog post
- LMSys Arena (explanation:)
- Salesforce’s Contextual Bench leaderboard hugging face an overview of how different LLMs perform across a variety of contextual tasks,
- OpenLLM Leaderboard
- MTEB
- SWE Bench
- AlpacaEval leaderboard Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv code
- Open Medical LLM Leaderboard from HF Explanation
- Gorilla, Berkeley function calling Leaderboard Explanation
- WildBench WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
- Enterprise Scenarios, Patronus
- Vectara Hallucination Leaderboard
- Ray/Anyscale’s LLM Performance Leaderboard (explanation:)
- Hugging Face LLM Performance hugging face leaderboard
-
Evaluation Software
- EleutherAI LLM Evaluation Harness
- Eureka, Microsoft, A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. github Sep 2024 arxiv
- OpenAI Evals
- ConfidentAI DeepEval
- MTEB
- OpenICL Framework
- RAGAS
- ML Flow Evaluate
- MosaicML Composer
- Toolkit from Mozilla AI for LLM as judge evaluation tool: lm-buddy eval tool model: Prometheus
- TruLens
- Promptfoo
- BigCode Evaluation Harness
- LangFuse
- LLMeBench see LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
- ChainForge
- Ironclad Rivet
- LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models, arxiv pdf github repository
—
LLM Evaluation articles in tech media and blog posts from companies
Long Comprehensive Studies
- TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
- Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
HITL (Human in the Loop)
Multi Turn
-
Instruction Following
- Evaluating Large Language Models at Evaluating Instruction Following Oct 2023, arxiv
- Instruction-Following Evaluation for Large Language Models, IFEval, Nov 2023, arxiv
- FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets, Jul 2023, arxiv , FLASK dataset
- DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation, Mar 2024, aaai, pdf
-
LongForm: Effective Instruction Tuning with Reverse Instructions, Apr 2023, arxiv dataset
-
Ethical AI
- Evaluating the Moral Beliefs Encoded in LLMs, Jul 23 arxiv
- AI Deception: A Survey of Examples, Risks, and Potential Solutions Aug 23 arxiv
- Aligning AI With Shared Human Value, Aug 20 - Feb 23, arxiv Re: ETHICS benchmark
- What are human values, and how do we align AI to them?, Mar 2024, pdf
- TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
- Helpfulness, Honesty, Harmlessness (HHH) framework from Antrhtopic, introduced in A General Language Assistantas a Laboratory for Alignment, 2021, arxiv, it’s in BigBench now bigbench
- WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models, April 2024, arxiv
- Chapter 19 in The Ethics of Advanced AI Assistants, Apr 2024, Google DeepMind, pdf at google
-
BEHONEST: Benchmarking Honesty of Large Language Models, June 2024, arxiv
-
Biases
- FairPair: A Robust Evaluation of Biases in Language Models through Paired Perturbations, Apr 2024 arxiv
- BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation, 2021, arxiv, dataset
- “I’m fully who I am”: Towards centering transgender and non-binary voices to measure biases in open language generation, ACM FAcct 2023, amazon science
-
This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models, May 2023, arxiv
-
Safe AI
- SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, Oct 2024, arxiv
- LLMSecCode: Evaluating Large Language Models for Secure Coding, Aug 2024, arxiv
- Attack Atlas: A Practitioner’s Perspective on Challenges and Pitfalls in Red Teaming GenAI, Sep 2024, arxiv
- DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection, Sep 2024, arxiv
- Purple Llama, an umbrella project from Meta, Purple Llama repository
- Explore, Establish, Exploit: Red Teaming Language Models from Scratch, Jun 2023, arxiv
- Rethinking Backdoor Detection Evaluation for Language Models, Aug 2024, arxiv pdf
- Gradient-Based Language Model Red Teaming, Jan 24, arxiv
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, Mar 2024, arxiv
- Announcing a Benchmark to Improve AI Safety MLCommons has made benchmarks for AI performance—now it’s time to measure safety, Apr 2024 IEEE Spectrum
- Model evaluation for extreme risks, May 2023, arxiv
-
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Jan 2024, arxiv
Cybersecurity
- CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, July 2023, Meta arxiv
- CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models, Apr 2024, Meta arxiv
- Benchmarking OpenAI o1 in Cyber Security, Oct 2024, arxiv
-
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models, Aug 2024, arxiv
Code Generating LLMs
- Evaluating Large Language Models Trained on Code HumanEval Jul 2022 arxiv
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation Feb 21 arxiv
- SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, Oct 2024, arxiv
- LLMSecCode: Evaluating Large Language Models for Secure Coding, Aug 2024, arxiv
- Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
- SWE Bench SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Feb 2024 arxiv Tech Report
- Gorilla Functional Calling Leaderboard, Berkeley Leaderboard
- DevBench: A Comprehensive Benchmark for Software Development, Mar 2024,arxiv
- MBPP (Mostly Basic Python Programming) benchmark, introduced in Program Synthesis with Large Language Models
, 2021 papers with code data
- CodeMind: A Framework to Challenge Large Language Models for Code Reasoning, Feb 2024, arxiv
- CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution, Jan 2024, arxiv
- CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning, Jul 2022, arxiv code at salesforce github
Summarization
- Human-like Summarization Evaluation with ChatGPT, Apr 2023, arxiv
- WikiAsp: A Dataset for Multi-domain Aspect-based Summarization, 2021, Transactions ACL dataset
LLM quality (generic methods: overfitting, redundant layers etc)
LLM Systems
RAG Evaluation
- Google Frames Dataset for evaluation of RAG systems, Sep 2024, [arxiv paper: Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
- Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository
](https://arxiv.org/abs/2409.12941) Hugging Face, dataset
- RAGAS: Automated Evaluation of Retrieval Augmented Generation Jul 23, arxiv
- ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems Nov 23, arxiv
- Evaluating Retrieval Quality in Retrieval-Augmented Generation, Apr 2024, arxiv
- IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios, Sep 2024, arxiv
Conversational systems
And Dialog systems
- Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
- CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems, Jun 2024, arxiv
- Simulated user feedback for the LLM production, TDS
- How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
- Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs, Apr 2024, arxiv
-
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4, Jun 2024, arxiv
-
Copilots
- Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
- ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv
-
Search and Recommendation Engines
- Is ChatGPT a Good Recommender? A Preliminary Study Apr 2023 arxiv
- IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios, Sep 2024, arxiv
- LaMP: When Large Language Models Meet Personalization, Apr 2023, arxiv
- Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository
- BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives, Feb 2024, arxiv
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, Apr 2023, arxiv
- BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, Oct 2021, arxiv
- BENCHMARK : LoTTE, Long-Tail Topic-stratified Evaluation for IR that features 12 domain-specific search tests, spanning StackExchange communities and using queries from GooAQ, ColBERT repository wth the benchmark data
- LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
- Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
- STARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases, Apr 2024, arxiv code github
-
Constitutional AI: Harmlessness from AI Feedback, Sep 2022 arxiv (See Appendix B Identifying and Classifying Harmful Conversations, other parts)
-
Task Utility
- Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications, Feb 2024, arxiv
-
Verticals
Healthcare and medicine
- Evaluation and mitigation of cognitive biases in medical language models, Oct 2024 Nature
- Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
- Evaluating Generative AI Responses to Real-world Drug-Related Questions, June 2024, Psychiatry Research
- Clinical Insights: A Comprehensive Review of Language Models in Medicine, Aug 2024, arxiv See table 2 for evaluation
- Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data Jan 2024 arxiv
- Evaluating LLM – Generated Multimodal Diagnosis from Medical Images and Symptom Analysis, Jan 2024, arxiv
- MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, 2022, PMLR
- What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams, MedQA benchmark, Sep 2020, arxiv
- PubMedQA: A Dataset for Biomedical Research Question Answering, 2019, acl
- Open Medical LLM Leaderboard from HF Explanation
- Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics, Apr 2023, arxiv
- Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery, Apr 2023, pub med
- Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today, Jun 2023, arxiv
- Evaluating the use of large language model in identifying top research questions in gastroenterology, Mar 2023, nature
- Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
- MedDialog: Two Large-scale Medical Dialogue Datasets, Apr 2020, arxiv
- An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition, 2015, article html
- DrugBank 5.0: a major update to the DrugBank database for 2018, 2018, paper html]
- A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models, May 2024, nature, dataset
- MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records, Aug 2023, arxiv
Law
- LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, NeurIPS 2023
- LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain, EMNLP 2023
- Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities NeurIPS 2022
-
Science
- SciRepEval: A Multi-Format Benchmark for Scientific Document Representations, 2022, arxiv
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Nov 2023, arxiv
- MATH Mathematics Aptitude Test of Heuristics, Measuring Mathematical Problem Solving With the MATH Dataset, Nov 2021 arxiv
-
Math
- How well do large language models perform in arithmetic tasks?, Mar 2023, arxiv
- FrontierMath at EpochAI, FrontierAI page, FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, Nov 2024, arxiv
- Cmath: Can your language model pass chinese elementary school math test?, Jun 23, arxiv
- GSM8K paperwithcode repository github
Financial
- Evaluating LLMs’ Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
- PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance, Jun 2023, arxiv
- BloombergGPT: A Large Language Model for Finance (see Chapter 5 Evaluation), Mar 2023, arxiv
- FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets, Oct 2023, arxiv
Other
-
Understanding the Capabilities of Large Language Models for Automated Planning, May 2023, arxiv
Other Collections
- LLM/VLM Benchmarks by Aman Chadha
- Awesome LLMs Evaluation Papers, a list of papers mentioned in the Evaluating Large Language Models: A Comprehensive Survey, Nov 2023
Citation
@article{Lopatenko2024CompendiumLLMEvaluation,
title = {Compendium of LLM Evaluation methods},
author = {Lopatenko, Andrei},
year = {2024},
note = {\url{https://github.com/alopatenko/LLMEvaluation}}
}