Data Quality > Quantity
Fine-tuning với 1000 high-quality examples thường beats 10000 noisy examples.
Data Formats
1. Instruction Format
{
"instruction": "Summarize this article",
"input": "Article text here...",
"output": "Summary here..."
}
2. Chat Format (Preferred)
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "User message"},
{"role": "assistant", "content": "Assistant response"}
]
}
3. Completion Format (Legacy)
{
"prompt": "Human: Question\n\nAssistant:",
"completion": " Response here"
}
Data Collection Strategies
1. Human Annotation
- Gold standard quality
- Expensive, slow
- Use for evaluation set
2. Synthetic Data
- Generate với stronger model (GPT-4)
- Fast, cheap
- Quality depends on generation prompt
3. Distillation
- Student learns from teacher model outputs
- Good balance of quality/cost
Data Cleaning Checklist
- Remove duplicates
- Filter short/empty examples
- Check encoding issues
- Validate format consistency
- Balance class distribution
- Split train/val/test (80/10/10)
Implementation
from datasets import Dataset
import json
# Load data
with open("data.jsonl") as f:
data = [json.loads(line) for line in f]
# Convert to HF dataset
dataset = Dataset.from_list(data)
# Apply chat template
def format_chat(example):
return tokenizer.apply_chat_template(
example["messages"],
tokenize=False
)
dataset = dataset.map(lambda x: {"text": format_chat(x)})
Pro Tips
- Diversify examples: Cover edge cases, different phrasings
- Include negative examples: What NOT to do
- Version your data: Track changes, enable rollback
🔥 Rule: Spend 70% time on data, 30% on training. Data quality is everything.
