CongDongVibeCode - AI Coder Vietnam

Evaluation Framework

Automatic Metrics

1. Perplexity: Measures model confidence. Lower = better.

import math
loss = model.evaluate(test_dataset)["eval_loss"]
perplexity = math.exp(loss)

2. BLEU/ROUGE: Text similarity scores. Good for summarization.

3. Task-Specific Accuracy: Custom metrics for your use case.

LLM-as-Judge

Use stronger model to evaluate outputs:

judge_prompt = """
Rate the following response from 1-5:
Query: {query}
Response: {response}
Criteria: Accuracy, Helpfulness, Harmlessness
Score and reasoning:
"""

score = gpt4.invoke(judge_prompt)

A/B Testing Framework

class ABTest:
    def __init__(self, model_a, model_b):
        self.models = {"A": model_a, "B": model_b}
        self.results = {"A": [], "B": []}
    
    def evaluate(self, test_cases: list[dict]):
        for case in test_cases:
            for name, model in self.models.items():
                output = model.generate(case["input"])
                score = self.judge(case["expected"], output)
                self.results[name].append(score)
        
        return {
            "A": np.mean(self.results["A"]),
            "B": np.mean(self.results["B"]),
            "p_value": ttest_ind(self.results["A"], self.results["B"]).pvalue
        }

Benchmark Suites

Benchmark	Measures	Use For
MMLU	Knowledge	General capability
HumanEval	Code	Coding models
MT-Bench	Chat	Conversational
Custom	Domain	Your specific use case

Building Custom Eval Set

Collect 100-500 examples từ real usage
Label ground truth hoặc expected behavior
Define scoring rubric (1-5 scale)
Automate với LLM judge cho scalability
Spot-check manual review 10%

Pro Tips

Regression testing: Compare với base model trên general tasks
Adversarial examples: Test edge cases, jailbreaks
Human eval: Final validation trước production

🔥 No single metric tells whole story. Use multiple: auto metrics + LLM judge + human eval.

AI Agent 101

AI Agent 101

Advanced RAG Architecture & LLM Fine-tuning

Tiến độ khoá học

Advanced RAG Architecture & LLM Fine-tuning

Tiến độ khoá học

Evaluation Metrics và Benchmarking

Evaluation Framework

Automatic Metrics

LLM-as-Judge

A/B Testing Framework

Benchmark Suites

Building Custom Eval Set

Pro Tips

AI Agent 101

AI Agent 101

Advanced RAG Architecture & LLM Fine-tuning

Tiến độ khoá học

CHƯƠNG 1Module 1: Nền Tảng RAG Nâng Cao

CHƯƠNG 2Module 2: Kỹ Thuật Retrieval Nâng Cao

CHƯƠNG 3Module 3: Fine-tuning LLM Căn Bản

CHƯƠNG 4Module 4: Production Fine-tuning & Evaluation

Advanced RAG Architecture & LLM Fine-tuning

Tiến độ khoá học

CHƯƠNG 1Module 1: Nền Tảng RAG Nâng Cao

CHƯƠNG 2Module 2: Kỹ Thuật Retrieval Nâng Cao

CHƯƠNG 3Module 3: Fine-tuning LLM Căn Bản

CHƯƠNG 4Module 4: Production Fine-tuning & Evaluation

Evaluation Metrics và Benchmarking

Evaluation Framework

Automatic Metrics

LLM-as-Judge

A/B Testing Framework

Benchmark Suites

Building Custom Eval Set

Pro Tips