How do you measure the performance of an LLM?

20 views

Q
Question

How do you measure the performance of LLM models?

A
Answer

To measure the performance of a Large Language Model (LLM), we use some common metrics as below:

Perplexity: Measures how well the model predicts a sample, commonly used in language modeling tasks.

Accuracy: Used for tasks like text classification to measure the proportion of correct predictions.

F1 Score: A harmonic mean of precision and recall, used for tasks like named entity recognition.

BLEU (Bilingual Evaluation Understudy) score: Measures the quality of machine-generated text against reference translations, commonly used in machine translation.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): A set of metrics that evaluate the overlap between generated text and reference text, often used in summarization tasks. They help quantify the model's effectiveness and guide further improvements.

E
Explanation

Evaluating the performance of Large Language Models (LLMs) is crucial to ensure they deliver accurate and valuable outputs.

Here are the equations in Markdown format using LaTeX syntax:

  1. Perplexity: Perplexity is defined as the inverse probability of the test set, normalized by the number of words:
Perplexity=2H(p)=21Ni=1Nlog2p(wi)\text{Perplexity} = 2^{H(p)} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 p(w_i)}
import numpy as np

def perplexity(probabilities):
    # Assuming probabilities is a list of predicted probabilities for each word
    N = len(probabilities)
    log_prob = np.sum(np.log2(probabilities))
    return 2 ** (-log_prob / N)

probabilities = [0.1, 0.3, 0.4, 0.2]  # Example predicted probabilities
print("Perplexity:", perplexity(probabilities))

  1. Accuracy: Accuracy is the proportion of correct predictions to total predictions:
Accuracy=Number of Correct PredictionsTotal Predictions=TP+TNTP+TN+FP+FN \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Predictions}} = \frac{TP + TN}{TP + TN + FP + FN}

where:

  • ( TP ) = True Positives
  • ( TN ) = True Negatives
  • ( FP ) = False Positives
  • ( FN ) = False Negatives
def accuracy(true_labels, predicted_labels):
    correct_predictions = sum([1 if true == pred else 0 for true, pred in zip(true_labels, predicted_labels)])
    return correct_predictions / len(true_labels)

true_labels = [1, 0, 1, 1, 0]  # Example true labels
predicted_labels = [1, 0, 0, 1, 1]  # Example predicted labels
print("Accuracy:", accuracy(true_labels, predicted_labels))

  1. F1 Score: The F1 score is the harmonic mean of precision and recall:
F1 Score=2×Precision×RecallPrecision+Recall \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

where Precision is ( \frac{TP}{TP + FP} ) and Recall is ( \frac{TP}{TP + FN} ).

from sklearn.metrics import f1_score

# Example usage
true_labels = [1, 0, 1, 1, 0]
predicted_labels = [1, 0, 0, 1, 1]
print("F1 Score:", f1_score(true_labels, predicted_labels))

  1. BLEU Score: The BLEU score is a metric for evaluating the quality of machine-generated text by comparing it with reference translations:
BLEU=BP×exp(n=1Nwnlogpn) \text{BLEU} = BP \times \exp \left( \sum_{n=1}^{N} w_n \log p_n \right)

where:

  • ( BP ) is the brevity penalty,
  • ( p_n ) is the precision of ( n )-grams in the generated text,
  • ( w_n ) is the weight for each ( n )-gram.
from nltk.translate.bleu_score import sentence_bleu

# Example usage
reference = [['this', 'is', 'a', 'test']]  # Reference translation (list of tokenized words)
candidate = ['this', 'is', 'test']  # Machine-generated translation (list of tokenized words)
print("BLEU Score:", sentence_bleu(reference, candidate))

  1. ROUGE Score: The ROUGE score is used for evaluating the recall of n-grams, word sequences, and word pairs in automatic summarization:
ROUGE-N=nn-gramsRecall(n)nn-gramsReference(n) \text{ROUGE-N} = \frac{\sum_{n \in \text{n-grams}} \text{Recall}(n)}{\sum_{n \in \text{n-grams}} \text{Reference}(n)}

where:

  • Recall(n)\text{Recall}(n) is the number of n-grams in the candidate summary matching n-grams in the reference summaries,
  • Reference(n) \text{Reference}(n) is the total number of n-grams in the reference summary.
from rouge_score import rouge_scorer

def rouge_score_summary(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return scores

# Example usage
reference = "The quick brown fox jumps over the lazy dog."
candidate = "A fast brown fox jumps over the sleepy dog."
print("ROUGE Score:", rouge_score_summary(reference, candidate))

Related Questions