Inference Optimization
Training done. Now how to serve efficiently?
Serving Frameworks
1. vLLM
- PagedAttention: Revolutionary KV cache management
- Continuous batching: Dynamic request batching
- Best for: High-throughput production
from vllm import LLM, SamplingParams
llm = LLM(model="./my-finetuned-model")
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(prompts, params)
2. Text Generation Inference (TGI)
- Hugging Face official
- Docker-based, easy deploy
- Good balance of features
docker run -p 8080:80 \
-v ./model:/model \
ghcr.io/huggingface/text-generation-inference \
--model-id /model
3. Ollama
- Local-first
- Super easy setup
- Good for dev/testing
Quantization for Deployment
| Method | Size Reduction | Quality Loss | Speed |
|---|---|---|---|
| FP16 | 2x | None | 1x |
| INT8 | 4x | Minimal | 1.2x |
| INT4 (GPTQ) | 8x | Small | 1.5x |
| INT4 (AWQ) | 8x | Minimal | 1.8x |
AWQ Quantization
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained(model_path)
model.quantize(tokenizer, quant_config)
model.save_quantized("./model-awq")
Deployment Architecture
┌─────────────────┐
│ Load Balancer │
└────────┬────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ vLLM │ │ vLLM │ │ vLLM │
│Instance │ │Instance │ │Instance │
└─────────┘ └─────────┘ └─────────┘
Scaling Considerations
- Horizontal: More GPU instances
- Batching: Group requests, increase throughput
- Caching: Cache common prompts/responses
- Streaming: Start responding before full generation
Pro Tips
- AWQ > GPTQ cho deployment (faster)
- vLLM > TGI cho throughput-critical
- Start with single GPU scale horizontally
💡 Profile first: Measure latency, throughput, memory. Then optimize bottlenecks.
