LLM — Evaluation Metrics

Manish Poddar
4 min readJan 25, 2024

--

Large language models are difficult to evaluate due to their non-deterministic and language-based outputs. Metrics like ROUGE and BLEU provide a structured way to assess LLMs by comparing model outputs to human references. ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation, is used to evaluate summarization tasks. It measures overlap between model and reference summaries at the word (ROUGE-1) or bigram (ROUGE-2) level. BLEU, or Bilingual Evaluation Understudy, evaluates machine translation quality by calculating precision over multiple n-gram sizes between the model output and reference translations. While useful for diagnostic purposes, ROUGE and BLEU alone are not enough for final LLM evaluation. More robust benchmarks developed specifically to test large language models on multiple axes like correctness, fluency, factual consistency, and more are needed for that. But ROUGE and BLEU provide a simple way to iterate and compare models during development.

Uni-Gram: In the anatomy of language, a unigram is equivalent to a single word. So a unigram refers to a single word. For example, in the sentence “It is cold outside”, the unigrams would be:
It
is
cold
outside

Bi-Gram: A bigram refers to a group of two consecutive words.For example, in the sentence “It is cold outside”, the bi gram would be:
It is
is cold
cold outside

N-Gram: A n-gram refers to a group of n consecutive words.

ROUGE: ROUGE(Recall-Oriented Understudy for Gisting Evaluation) is an evaluation metric used to measure the performance of automatic text summarisation systems by comparing machine-generated summaries to ideal human-written summaries. It focuses on recall and measures things like unigram and bigram overlap between the generated and reference summaries.

Fig 1: ROUGE Example (Image Source : https://arxiv.org/pdf/1803.01937.pdf)

BLEU: BLEU (bilingual evaluation understudy) is an algorithm developed to assess the quality of machine-translated text by comparing it to human-generated reference translations. It works by quantifying matches in n-grams between the machine translation output and reference translations, and calculating a precision score that is averaged across different n-gram sizes. Higher BLEU scores indicate better quality of translation, with a score of 1 meaning a perfect match to the reference. As BLEU focuses on assessing translation quality rather than summarization or other language generation tasks, it serves as a specialized automated metric for evaluating machine translation systems.
Example:
Candidate sentence: The the cat.
Reference Sentence: The cat is on the mat.

Fig 2: BLUE Source (Image Source: https://en.wikipedia.org/wiki/BLEU)

The unigram score judges the candidate translation “the the cat” to be an excellent match with the reference, assigning it a perfect score. However, when we calculate the bi-gram score, we find that this translation does not match the reference as well as the unigram score would indicate. To compute an overall BLEU score for a corpus of candidate translations and reference translations, one can first calculate individual BLEU scores for each candidate sentence paired with its corresponding reference sentence(s). Then the BLEU scores for all candidate-reference pairs can be averaged to produce a single aggregate BLEU score for the entire corpus.

How Benchmarks data sets helps in LLM evaluation : Benchmarks refer to standardized datasets and tasks that allow for the objective evaluation and comparison of large language models (LLMs). Since metrics like ROUGE and BLEU are relatively simple and only suitable for specific tasks like summarization and translation, benchmarks provide a more comprehensive way to assess LLMs.

Researchers have developed benchmarks covering a diverse range of tasks and data to test different aspects of language understanding and generation. Using these benchmarks provides a standardized process for evaluating metrics across models. The benchmarks set a common framework of comparison, making it easier to determine if improvements from model tweaks like fine-tuning actually lead to genuine advances.

Additionally, benchmark evaluations better indicate real-world performance versus just optimization on a narrow metric. Relying solely on simple metrics could result in models that score highly on those metrics but fail to generalize well. Comparative benchmarking guards against this by testing models more rigorously.

Examples of benchmark datasets:
1. GLUE (General Language Understanding Evaluation): A collection of natural language tasks like sentiment analysis and question answering to measure models’ ability to generalize across different tasks.
2. SuperGLUE: A more challenging benchmark than GLUE that tests abilities like multi-sentence reasoning and reading comprehension.
3. MMLU (Massive Multitask Language Understanding): Tests world knowledge and problem solving on tasks like elementary math, history, law, etc. beyond just language.
4. BIG-bench: Consists of over 200 diverse tasks testing capabilities in linguistics, common sense reasoning, science domains, social biases, etc.
5. HELM (Holistic Evaluation of Language Models): Uses multiple metrics to measure accuracy as well as problematic behaviors across scenarios, focusing on transparency and guidance on model selection.

Overall, while metrics like ROUGE and BLEU have uses for development, benchmarks are necessary for robustly assessing and comparing LLMs to determine state-of-the-art performance. They lead to more meaningful evaluation than can be achieved by metrics alone.

--

--

Manish Poddar

Machine Learning Engineer at AWS | Generative AI | MS in AI & ML, Liverpool John Moores University | Solving Data Problem