Two-Stage Retrieval Pipeline
Stage 1: Bi-Encoder (Fast) - Retrieve top-100 candidates Stage 2: Cross-Encoder (Accurate) - Re-rank to get top-10
Tại Sao Cần Re-ranking?
Bi-Encoder limitations:
- Pre-computed embeddings = no interaction between query and document
- Limited by embedding quality
- Fast nhưng accuracy có ceiling
Cross-Encoder advantages:
- Query và document được encode cùng nhau
- Full attention between all tokens
- Much higher accuracy
Cách Cross-Encoder Hoạt Động
Input: [CLS] query [SEP] document [SEP]
Output: relevance score (0-1)
Model sees both texts together → understands relationships better.
Popular Re-rankers
| Model | Speed | Accuracy | Notes |
|---|---|---|---|
| Cohere Rerank | Fast | Very High | API-based |
| BGE-reranker-v2 | Medium | High | Open-source |
| cross-encoder/ms-marco | Slow | High | Classic choice |
| Jina Reranker | Fast | High | Multilingual |
Implementation
from sentence_transformers import CrossEncoder
# Load model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Get initial candidates
candidates = vector_search(query, top_k=100)
# Re-rank
pairs = [(query, doc.content) for doc in candidates]
scores = reranker.predict(pairs)
# Sort by score
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
top_10 = [doc for doc, score in ranked[:10]]
Với Cohere API
import cohere
co = cohere.Client(api_key)
results = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=[doc.content for doc in candidates],
top_n=10
)
Pro Tips
- Limit candidates: Re-rank top-50 to top-100, không phải toàn bộ
- Batching: Cross-encoder chậm, batch predictions
- Caching: Cache re-rank results cho frequent queries
💡 Production: Cohere rerank cho simplicity. Self-hosted BGE-reranker nếu cần privacy.
