10) Déploiement & coûts — RAG
Envs Dev/Stage/Prod, API & workers, DB docs/vecteurs, cache, tuning perfs, budgetisation et IaC/CI-CD.
Environnements Infrastructure Performance Budget & coûts Stratégie de déploiement Livrables
Mise en Œuvre des Outils
Terraform/Ansible, Docker/K8s, CI/CD, autoscaling & cache, cost-ops (metrics/alerts), runbook.
10) Déploiement & coûts (Jours 24–30)
Dev / Staging / Prod
- Isolation : VPC séparés, bases & index distincts, quotas différents.
- Secrets : variables sécurisées (Vault/KMS, rotation 90j), no secrets in code.
- Flags : features RAG (rerank, hybrid, prompt vN) activables par env.
- Approvals : promotion Stage → Prod avec validation métier + SRE.
- Migrations : scripts idempotents (DB, schémas index/embeddings).
Matrice de config (ex.)
| Param | Dev | Staging | Prod |
|---|---|---|---|
| LLM | gpt-5-mini | gpt-5-mini | gpt-5-pro |
| top_k retrieve | 20 | 30 | 50 |
| reranker | OFF | ON | ON |
| cache_ttl | 60s | 120s | 300s |
| sampling traces | 50% | 25% | 10% |
Composants
- API : FastAPI ou Django view (ASGI), timeout 30s, rate limit par IP/BU.
- Workers : Celery (queues: ingest, embed, eval), autoscaling & retry.
- Stockage : DB documents (Postgres), vecteurs (pgvector/OpenSearch/Weaviate), blobs (S3).
- Cache : Redis (prompt cache, passage cache, rate limiter).
- Observabilité : OTel + Prometheus + Grafana + Alertmanager.
Séparer plan **online** (latence) du plan **batch** (ingestion/embeddings).
Paramètres clés
- Pool DB: 10–50 connexions / pod ; backoff expo.
- Index: réplication 2×, snapshots quotidiens.
- Celery: prefetch_multiplier=1 (latence), acks_late=true.
- API: gzip/br pour passages, keep-alive, HTTP/2.
Optimisations runtime
- Batch embeddings : B=64/128, workers dédiés, file tampon.
- Warm pools : initialiser clients LLM/reranker au boot.
- Truncation : max-tokens input/output ; passage tokenizer (pruning).
- Streaming : réponse streamée → TTFB réduit.
- Cache : clés = (question_norm, lang, topk, model_ver).
Matrice de scaling (ex.)
| rps | pods API | workers | Redis | Index |
|---|---|---|---|---|
| 10 | 2 × 0.5 vCPU | 2 | cache S | 1 shard |
| 50 | 4 × 1 vCPU | 6 | cache M | 2 shards |
| 200 | 8 × 2 vCPU | 16 | cache L | 3 shards |
Modèle de coût (par 1k requêtes)
- C_emb = nb_passages × coût embedding.
- C_rerank = k_in × coût reranker (optionnel).
- C_gen = (tok_in + tok_out) × coût LLM.
- C_infra (amorti) = compute + stockage + trafic.
- Total = C_emb + C_rerank + C_gen + C_infra ≤ X €.
Objectifs
- < X €/1k req en Prod (fixer X selon contrat).
- Budget mensuel = (req/j × 30 × coût/req).
- Garde-fous: arrêt automatique si coût/req > seuil 1h.
Tableau d’exemple
| Élément | Hypothèse | Coût/1k |
|---|---|---|
| Embeddings | 20 passages × 0.0001 € | 2.00 € |
| Reranker | 20 → 5, 0.00005 €/passage | 1.00 € |
| LLM | tok_in 900, tok_out 250, 0.00001 €/tok | 11.50 € |
| Infra | amorti (CPU/RAM/storage) | 2.00 € |
| Total | — | 16.50 € |
Méthodes
- Canary : 5% → 25% → 100% (rollback automatique si SLO breach).
- Blue/Green : switch DNS/ingress ; migration à froid.
- Rollback : images versionnées + snapshot index.
- Compat’ : schémas index/back compat (version champs/mapping).
Checks post-déploiement
- Smoke tests Q/A (golden set), latence P50/P95, hit@5, coût/req 15m.
- Erreurs HTTP < 1%, saturation CPU/RAM < 70%.
- Alertes muettes (pas de spam), budgets OK.
Livrables
- IaC (Terraform/Ansible) + diagrammes infra.
- Runbook (incidents, scaling, coût, rollback).
- Matrices : sizing & autoscaling.
- Budget sheet & alertes coût (PromQL).
Contrat de déploiement (JSON exportable)
{
"envs":{"dev":{}, "staging":{}, "prod":{"flags":["rerank","hybrid"],"trace_sampling":0.1}},
"infra":{"api":"fastapi","workers":"celery","db":"postgres","vector":"pgvector","cache":"redis"},
"perf":{"batch_emb":128,"streaming":true,"tok_in_max":1200,"tok_out_max":400},
"costs":{"target_eur_per_1k":16.5,"guardrail_eur_per_req":0.02},
"deploy":{"strategy":"canary","stages":[5,25,100],"rollback_on":"slo_breach"},
"deliverables":["iac","runbook","scaling_matrix","budget_alerts"]
}Mise en Œuvre des Outils — Déploiement & coûts
# terraform/main.tf (extrait)
module "vpc" { source = "terraform-aws-modules/vpc/aws"; name = "rag" ... }
module "eks" { source = "terraform-aws-modules/eks/aws"; cluster_name = "rag-eks" ... }
module "rds" { source = "terraform-aws-modules/rds/aws"; engine="postgres" ... }
module "redis" { source = "cloudposse/elasticache-redis/aws" ... }
# outputs: endpoints + secrets (stockés KMS/Vault)# ansible/deploy_api.yaml (extrait)
- hosts: api
tasks:
- name: Pull image
community.docker.docker_image: name: "registry/rag-api:" source: pull
- name: Run container
community.docker.docker_container:
name: rag-api
image: "registry/rag-api:"
env: { DB_URL: "", REDIS_URL: "" }
ports: ["8080:8080"]# Dockerfile (FastAPI)
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt . && pip install -r requirements.txt
COPY src/ .
CMD ["uvicorn","app:api","--host","0.0.0.0","--port","8080","--workers","2"]# k8s/deploy.yaml (extrait)
apiVersion: apps/v1
kind: Deployment
metadata: {name: rag-api}
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: registry/rag-api:
envFrom: [{secretRef:{name: rag-secrets}}, {configMapRef:{name: rag-config}}]
resources: {requests:{cpu:"500m",memory:"512Mi"}, limits:{cpu:"1",memory:"1Gi"}}
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec: { minReplicas: 2, maxReplicas: 10,
metrics: [{ type: Resource, resource:{ name: cpu, target:{type:Utilization, averageUtilization:60}}}] }# .github/workflows/deploy.yml (extrait)
name: deploy
on: {workflow_dispatch: {}}
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: docker build -t registry/rag-api:$ .
- run: docker push registry/rag-api:$
canary:
needs: build
steps:
- run: kubectl set image deploy/rag-api api=registry/rag-api:$ --record
- run: kubectl -n default rollout status deploy/rag-api --timeout=120s
- run: ./scripts/canary_check.sh # SLO/erreurs/cout
promote:
if: success()
steps:
- run: ./scripts/promote_canary.sh# KEDA: scale sur profondeur de queue Celery
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
scaleTargetRef: {name: rag-worker}
triggers:
- type: redis
metadata: { address: REDIS_ADDR, listName: "celery", listLength: "100" }# cache.py (clé de cache)
def cache_key(q, lang, topk, model):
return f"{hash(q)}:{lang}:{topk}:{model}"
# TTL par env : dev=60s, stage=120s, prod=300s# cost_estimator.py
def cost_per_1k(passages=20, c_emb=0.0001, rerank=1.0, tok_in=900, tok_out=250, c_tok=1e-5, infra=2.0):
C_emb = passages * c_emb * 1000
C_rer = rerank
C_gen = (tok_in + tok_out) * c_tok * 1000
return round(C_emb + C_rer + C_gen + infra, 2)# PromQL budget (extraits)
sum(rate(rag_cost_eur_total[10m])) / sum(rate(rag_requests_total[10m])) # coût/req
(sum(rate(rag_cost_eur_total[1h])) > BUDGET_HOURLY) # alerteIncidents (extraits)
- Latence ↑ : vérifier index/rerank/LLM ; scaler API + workers ; activer cache.
- Coût ↑ : réduire top_k, tronquer tokens, modèle compact, TTL cache ↑.
- Erreurs : timeouts, quotas LLM, saturation DB/Redis, déploiement récent → rollback.
Procédure de rollback
- Basculer sur image N-1, désactiver flags à risque.
- Restaurer snapshot index (si nécessaire).
- Communiquer statut + RCA à T+24h.
