LLMEvaluation

Evaluation of LLM and LLM based Systems

Compendium of LLM Evaluation methods


Introduction

The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it’s about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.

My view on LLM Evaluation: Deck 24, and SF Big Analytics and AICamp 24 video Analytics Vidhya (Data Phoenix Mar 5 24) (by Andrei Lopatenko)

Adjacent compendium on LLM, Search and Recommender engines

The github repository

Evals are surprisingly often all you need

Table of contents


Leaderboards and Arenas


Evaluation Software

—

LLM Evaluation articles

in tech media and blog posts and podcasts from companies


Large benchmarks


Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation


Long Comprehensive Studies


HITL (Human in the Loop)


LLM as Judge


LLM Evaluation

Embeddings


In Context Learning


Hallucinations


Question answering

QA is used in many vertical domains, see Vertical section bellow


Multi Turn


Reasoning


Multi-Lingual


Multi-Lingual Embedding tasks


Multi-Modal


Ethical AI


Biases


Safe AI


Cybersecurity


Code Generating LLMs

and other software co-pilot tasks


Summarization


LLM quality (generic methods: overfitting, redundant layers etc)


Inference Performance


Agent LLM Architectures


AGI Evaluation

AGI (Artificial General Intelligence) evaluation refers to the process of assessing whether an AI system possesses or approaches general intelligence—the ability to perform any intellectual task that a human can.


Long Text Generation


Graph understanding


Reward Models


Various unclassified tasks

(TODO as there are more than three papers per class, make a class a separate chapter in this Compendium)


LLM Systems

RAG Evaluation

and knowledge assistant and information seeking LLM based systems


Conversational systems

And Dialog systems search


Copilots


Search and Recommendation Engines


Verticals

Healthcare and medicine


Law


Science


Math


Financial


Other


Other Collections


Citation

@article{Lopatenko2024CompendiumLLMEvaluation,
  title   = {Compendium of LLM Evaluation methods},
  author  = {Lopatenko, Andrei},
  year    = {2024},
  note    = {\url{https://github.com/alopatenko/LLMEvaluation}}
}