Vector Database Là Gì?
Vector database là hệ thống lưu trữ và tìm kiếm vectors dựa trên similarity (cosine, dot product, euclidean). Khác với traditional databases tìm kiếm exact match, vector DBs tìm kiếm approximate nearest neighbors (ANN).
Tại Sao Cần Vector Database?
1. Scale: Brute-force search O(n) không khả thi với millions of vectors. ANN algorithms cho O(log n).
2. Hybrid Search: Kết hợp vector similarity với metadata filtering.
3. Production Features: Replication, sharding, monitoring, backups.
So Sánh Các Options
| Database | Type | Strengths | Best For |
|---|---|---|---|
| Pinecone | Managed | Easy setup, scale | Startups, fast MVP |
| Weaviate | Open-source | Hybrid search, GraphQL | Complex queries |
| Qdrant | Open-source | Performance, Rust | High throughput |
| Chroma | Embedded | Simple, local | Prototyping |
| pgvector | Extension | PostgreSQL integration | Existing Postgres |
| Supabase Vector | Managed | Full-stack integration | Supabase users |
Indexing Algorithms
HNSW (Hierarchical Navigable Small World)
- Pros: Great recall, fast search
- Cons: High memory, slow rebuild
- Use: Default choice cho hầu hết use cases
IVF (Inverted File Index)
- Pros: Low memory, fast build
- Cons: Lower recall
- Use: Khi cần trade-off memory vs accuracy
Product Quantization (PQ)
- Pros: Massive compression (32x)
- Cons: Reduced accuracy
- Use: Billions of vectors
Pro Tips
- Tune HNSW params: ef_construction và M ảnh hưởng lớn đến recall vs latency
- Pre-filtering: Filter metadata trước khi vector search để reduce search space
- Batch upserts: Upsert theo batches 100-1000 vectors
🔥 Production: Dùng Qdrant hoặc Weaviate self-hosted cho control. Pinecone cho managed simplicity.
