LLMEvaluation

Evaluation of LLM and LLM based Systems

Compendium of LLM Evaluation methods


Introduction

The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it’s about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.

My view on LLM Evaluation: Deck, and SF Big Analytics and AICamp video Analytics Vidhya (Data Phoenix Mar 5) (by Andrei Lopatenko)

Main repository

Evals are surprisingly often all you need

Table of contents

—

LLM Evaluation articles in tech media and blog posts from companies

Long Comprehensive Studies


HITL (Human in the Loop)

Multi Turn

-

Instruction Following

Summarization


LLM quality (generic methods: overfitting, redundant layers etc)



LLM Systems

RAG Evaluation


Conversational systems

And Dialog systems

Law

-

Math

Citation

@article{Lopatenko2024CompendiumLLMEvaluation,
  title   = {Compendium of LLM Evaluation methods},
  author  = {Lopatenko, Andrei},
  year    = {2024},
  note    = {\url{https://github.com/alopatenko/LLMEvaluation}}
}