Evaluation Framework
Automatic Metrics
1. Perplexity: Measures model confidence. Lower = better.
import math
loss = model.evaluate(test_dataset)["eval_loss"]
perplexity = math.exp(loss)
2. BLEU/ROUGE: Text similarity scores. Good for summarization.
3. Task-Specific Accuracy: Custom metrics for your use case.
LLM-as-Judge
Use stronger model to evaluate outputs:
judge_prompt = """
Rate the following response from 1-5:
Query: {query}
Response: {response}
Criteria: Accuracy, Helpfulness, Harmlessness
Score and reasoning:
"""
score = gpt4.invoke(judge_prompt)
A/B Testing Framework
class ABTest:
def __init__(self, model_a, model_b):
self.models = {"A": model_a, "B": model_b}
self.results = {"A": [], "B": []}
def evaluate(self, test_cases: list[dict]):
for case in test_cases:
for name, model in self.models.items():
output = model.generate(case["input"])
score = self.judge(case["expected"], output)
self.results[name].append(score)
return {
"A": np.mean(self.results["A"]),
"B": np.mean(self.results["B"]),
"p_value": ttest_ind(self.results["A"], self.results["B"]).pvalue
}
Benchmark Suites
| Benchmark | Measures | Use For |
|---|---|---|
| MMLU | Knowledge | General capability |
| HumanEval | Code | Coding models |
| MT-Bench | Chat | Conversational |
| Custom | Domain | Your specific use case |
Building Custom Eval Set
- Collect 100-500 examples từ real usage
- Label ground truth hoặc expected behavior
- Define scoring rubric (1-5 scale)
- Automate với LLM judge cho scalability
- Spot-check manual review 10%
Pro Tips
- Regression testing: Compare với base model trên general tasks
- Adversarial examples: Test edge cases, jailbreaks
- Human eval: Final validation trước production
🔥 No single metric tells whole story. Use multiple: auto metrics + LLM judge + human eval.
