CongDongVibeCode - AI Coder Vietnam

Inference Optimization

Training done. Now how to serve efficiently?

Serving Frameworks

1. vLLM

PagedAttention: Revolutionary KV cache management
Continuous batching: Dynamic request batching
Best for: High-throughput production

from vllm import LLM, SamplingParams

llm = LLM(model="./my-finetuned-model")
params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(prompts, params)

2. Text Generation Inference (TGI)

Hugging Face official
Docker-based, easy deploy
Good balance of features

docker run -p 8080:80 \
  -v ./model:/model \
  ghcr.io/huggingface/text-generation-inference \
  --model-id /model

3. Ollama

Local-first
Super easy setup
Good for dev/testing

Quantization for Deployment

Method	Size Reduction	Quality Loss	Speed
FP16	2x	None	1x
INT8	4x	Minimal	1.2x
INT4 (GPTQ)	8x	Small	1.5x
INT4 (AWQ)	8x	Minimal	1.8x

AWQ Quantization

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained(model_path)
model.quantize(tokenizer, quant_config)
model.save_quantized("./model-awq")

Deployment Architecture

              ┌─────────────────┐
              │   Load Balancer │
              └────────┬────────┘
                       │
       ┌───────────────┼───────────────┐
       ▼               ▼               ▼
  ┌─────────┐    ┌─────────┐    ┌─────────┐
  │  vLLM   │    │  vLLM   │    │  vLLM   │
  │Instance │    │Instance │    │Instance │
  └─────────┘    └─────────┘    └─────────┘

Scaling Considerations

Horizontal: More GPU instances
Batching: Group requests, increase throughput
Caching: Cache common prompts/responses
Streaming: Start responding before full generation

Pro Tips

AWQ > GPTQ cho deployment (faster)
vLLM > TGI cho throughput-critical
Start with single GPU scale horizontally

💡 Profile first: Measure latency, throughput, memory. Then optimize bottlenecks.

AI Agent 101

AI Agent 101

Advanced RAG Architecture & LLM Fine-tuning

Tiến độ khoá học

Advanced RAG Architecture & LLM Fine-tuning

Tiến độ khoá học

Deployment Strategies: vLLM, TGI, và Optimization

Inference Optimization

Serving Frameworks

1. vLLM

2. Text Generation Inference (TGI)

3. Ollama

Quantization for Deployment

AWQ Quantization

Deployment Architecture

Scaling Considerations

Pro Tips

AI Agent 101

AI Agent 101

Advanced RAG Architecture & LLM Fine-tuning

Tiến độ khoá học

CHƯƠNG 1Module 1: Nền Tảng RAG Nâng Cao

CHƯƠNG 2Module 2: Kỹ Thuật Retrieval Nâng Cao

CHƯƠNG 3Module 3: Fine-tuning LLM Căn Bản

CHƯƠNG 4Module 4: Production Fine-tuning & Evaluation

Advanced RAG Architecture & LLM Fine-tuning

Tiến độ khoá học

CHƯƠNG 1Module 1: Nền Tảng RAG Nâng Cao

CHƯƠNG 2Module 2: Kỹ Thuật Retrieval Nâng Cao

CHƯƠNG 3Module 3: Fine-tuning LLM Căn Bản

CHƯƠNG 4Module 4: Production Fine-tuning & Evaluation

Deployment Strategies: vLLM, TGI, và Optimization

Inference Optimization

Serving Frameworks

1. vLLM

2. Text Generation Inference (TGI)

3. Ollama

Quantization for Deployment

AWQ Quantization

Deployment Architecture

Scaling Considerations

Pro Tips