CongDongVibeCode - AI Coder Vietnam

Vấn Đề: Đánh Giá Agent Thế Nào?

Traditional metrics không áp dụng được:

Accuracy: Agent output là text, không phải số
F1 Score: Không có ground truth labels
BLEU/ROUGE: Compare text nhưng không đánh giá reasoning

Solution: Dùng LLM khác để judge.

LLM-as-a-Judge Architecture

Agent (GPT-4) → Output → Judge (Claude) → Score + Reasoning

Implementation

LLMJudge class nhận task, agent_output, và criteria. Output là scores per criterion, reasoning, overall score, và pass/fail.

Evaluation Criteria Examples

For ReAct Agents

Tool Selection: Chọn đúng tool
Reasoning Quality: Logic và coherent
Efficiency: Số bước tối thiểu
Correctness: Final answer đúng
Grounding: Không hallucinate

For Multi-Agent Systems

Coordination: Phối hợp hiệu quả
Role Adherence: Đúng vai trò
Information Flow: Truyền thông tin đúng
Conflict Resolution: Xử lý disagreements

Analyzing Results

Overall pass rate
Per-criterion analysis
Find failure patterns

🔥 Chạy evaluation trên mỗi PR/commit để catch regressions sớm. Treat agent quality như test coverage.

AI Agent 101

AI Agent 101

Production-Grade Agentic GenAI Systems

Tiến độ khoá học

Production-Grade Agentic GenAI Systems

Tiến độ khoá học

Evaluation: LLM-as-a-Judge Framework

Vấn Đề: Đánh Giá Agent Thế Nào?

LLM-as-a-Judge Architecture

Implementation

Evaluation Criteria Examples

For ReAct Agents

For Multi-Agent Systems

Analyzing Results

AI Agent 101

AI Agent 101

Production-Grade Agentic GenAI Systems

Tiến độ khoá học

CHƯƠNG 1Module 1: The Agentic Shift – Bước Chuyển Tư Duy

CHƯƠNG 2Module 2: State & Memory với LangGraph

CHƯƠNG 3Module 3: Multi-Agent Orchestration với AutoGen

CHƯƠNG 4Module 4: Production Challenges – Thách Thức Thực Tế

Production-Grade Agentic GenAI Systems

Tiến độ khoá học

CHƯƠNG 1Module 1: The Agentic Shift – Bước Chuyển Tư Duy

CHƯƠNG 2Module 2: State & Memory với LangGraph

CHƯƠNG 3Module 3: Multi-Agent Orchestration với AutoGen

CHƯƠNG 4Module 4: Production Challenges – Thách Thức Thực Tế

Evaluation: LLM-as-a-Judge Framework

Vấn Đề: Đánh Giá Agent Thế Nào?

LLM-as-a-Judge Architecture

Implementation

Evaluation Criteria Examples

For ReAct Agents

For Multi-Agent Systems

Analyzing Results