10) Déploiement & coûts — RAG

Envs Dev/Stage/Prod, API & workers, DB docs/vecteurs, cache, tuning perfs, budgetisation et IaC/CI-CD.

Environnements Infrastructure Performance Budget & coûts Stratégie de déploiement Livrables

Ouvrir la stratégie de déploiement

Envs, infra, perfs, budget, canary/blue-green, rollbacks & livrables avec export JSON.

Mise en Œuvre des Outils

Terraform/Ansible, Docker/K8s, CI/CD, autoscaling & cache, cost-ops (metrics/alerts), runbook.

Ouvrir la configuration technique

Modules IaC, manifests K8s, pipelines, matrices de scaling, estimateur de coût & alertes.

Dev / Staging / Prod

Isolation : VPC séparés, bases & index distincts, quotas différents.
Secrets : variables sécurisées (Vault/KMS, rotation 90j), no secrets in code.
Flags : features RAG (rerank, hybrid, prompt vN) activables par env.
Approvals : promotion Stage → Prod avec validation métier + SRE.
Migrations : scripts idempotents (DB, schémas index/embeddings).

Matrice de config (ex.)

Param	Dev	Staging	Prod
LLM	gpt-5-mini	gpt-5-mini	gpt-5-pro
top_k retrieve	20	30	50
reranker	OFF	ON	ON
cache_ttl	60s	120s	300s
sampling traces	50%	25%	10%

Composants

API : FastAPI ou Django view (ASGI), timeout 30s, rate limit par IP/BU.
Workers : Celery (queues: ingest, embed, eval), autoscaling & retry.
Stockage : DB documents (Postgres), vecteurs (pgvector/OpenSearch/Weaviate), blobs (S3).
Cache : Redis (prompt cache, passage cache, rate limiter).
Observabilité : OTel + Prometheus + Grafana + Alertmanager.

Séparer plan **online** (latence) du plan **batch** (ingestion/embeddings).

Paramètres clés

Pool DB: 10–50 connexions / pod ; backoff expo.
Index: réplication 2×, snapshots quotidiens.
Celery: prefetch_multiplier=1 (latence), acks_late=true.
API: gzip/br pour passages, keep-alive, HTTP/2.

Optimisations runtime

Batch embeddings : B=64/128, workers dédiés, file tampon.
Warm pools : initialiser clients LLM/reranker au boot.
Truncation : max-tokens input/output ; passage tokenizer (pruning).
Streaming : réponse streamée → TTFB réduit.
Cache : clés = (question_norm, lang, topk, model_ver).

Matrice de scaling (ex.)

rps	pods API	workers	Redis	Index
10	2 × 0.5 vCPU	2	cache S	1 shard
50	4 × 1 vCPU	6	cache M	2 shards
200	8 × 2 vCPU	16	cache L	3 shards

Modèle de coût (par 1k requêtes)

C_emb = nb_passages × coût embedding.
C_rerank = k_in × coût reranker (optionnel).
C_gen = (tok_in + tok_out) × coût LLM.
C_infra (amorti) = compute + stockage + trafic.
Total = C_emb + C_rerank + C_gen + C_infra ≤ X €.

Objectifs

< X €/1k req en Prod (fixer X selon contrat).
Budget mensuel = (req/j × 30 × coût/req).
Garde-fous: arrêt automatique si coût/req > seuil 1h.

Tableau d’exemple

Élément	Hypothèse	Coût/1k
Embeddings	20 passages × 0.0001 €	2.00 €
Reranker	20 → 5, 0.00005 €/passage	1.00 €
LLM	tok_in 900, tok_out 250, 0.00001 €/tok	11.50 €
Infra	amorti (CPU/RAM/storage)	2.00 €
Total	—	16.50 €

Méthodes

Canary : 5% → 25% → 100% (rollback automatique si SLO breach).
Blue/Green : switch DNS/ingress ; migration à froid.
Rollback : images versionnées + snapshot index.
Compat’ : schémas index/back compat (version champs/mapping).

Checks post-déploiement

Smoke tests Q/A (golden set), latence P50/P95, hit@5, coût/req 15m.
Erreurs HTTP < 1%, saturation CPU/RAM < 70%.
Alertes muettes (pas de spam), budgets OK.

Livrables

IaC (Terraform/Ansible) + diagrammes infra.
Runbook (incidents, scaling, coût, rollback).
Matrices : sizing & autoscaling.
Budget sheet & alertes coût (PromQL).

Contrat de déploiement (JSON exportable)

{
  "envs":{"dev":{}, "staging":{}, "prod":{"flags":["rerank","hybrid"],"trace_sampling":0.1}},
  "infra":{"api":"fastapi","workers":"celery","db":"postgres","vector":"pgvector","cache":"redis"},
  "perf":{"batch_emb":128,"streaming":true,"tok_in_max":1200,"tok_out_max":400},
  "costs":{"target_eur_per_1k":16.5,"guardrail_eur_per_req":0.02},
  "deploy":{"strategy":"canary","stages":[5,25,100],"rollback_on":"slo_breach"},
  "deliverables":["iac","runbook","scaling_matrix","budget_alerts"]
}

# terraform/main.tf (extrait)
module "vpc" { source = "terraform-aws-modules/vpc/aws"; name = "rag" ... }
module "eks" { source = "terraform-aws-modules/eks/aws"; cluster_name = "rag-eks" ... }
module "rds" { source = "terraform-aws-modules/rds/aws"; engine="postgres" ... }
module "redis" { source = "cloudposse/elasticache-redis/aws" ... }
# outputs: endpoints + secrets (stockés KMS/Vault)

# ansible/deploy_api.yaml (extrait)
- hosts: api
  tasks:
    - name: Pull image
      community.docker.docker_image: name: "registry/rag-api:" source: pull
    - name: Run container
      community.docker.docker_container:
        name: rag-api
        image: "registry/rag-api:"
        env: { DB_URL: "", REDIS_URL: "" }
        ports: ["8080:8080"]

# Dockerfile (FastAPI)
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt . && pip install -r requirements.txt
COPY src/ .
CMD ["uvicorn","app:api","--host","0.0.0.0","--port","8080","--workers","2"]

# k8s/deploy.yaml (extrait)
apiVersion: apps/v1
kind: Deployment
metadata: {name: rag-api}
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: registry/rag-api:
        envFrom: [{secretRef:{name: rag-secrets}}, {configMapRef:{name: rag-config}}]
        resources: {requests:{cpu:"500m",memory:"512Mi"}, limits:{cpu:"1",memory:"1Gi"}}
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec: { minReplicas: 2, maxReplicas: 10,
  metrics: [{ type: Resource, resource:{ name: cpu, target:{type:Utilization, averageUtilization:60}}}] }

# .github/workflows/deploy.yml (extrait)
name: deploy
on: {workflow_dispatch: {}}
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t registry/rag-api:$ .
      - run: docker push registry/rag-api:$
  canary:
    needs: build
    steps:
      - run: kubectl set image deploy/rag-api api=registry/rag-api:$ --record
      - run: kubectl -n default rollout status deploy/rag-api --timeout=120s
      - run: ./scripts/canary_check.sh  # SLO/erreurs/cout
  promote:
    if: success()
    steps:
      - run: ./scripts/promote_canary.sh

# KEDA: scale sur profondeur de queue Celery
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
  scaleTargetRef: {name: rag-worker}
  triggers:
  - type: redis
    metadata: { address: REDIS_ADDR, listName: "celery", listLength: "100" }

# cache.py (clé de cache)
def cache_key(q, lang, topk, model): 
  return f"{hash(q)}:{lang}:{topk}:{model}"
# TTL par env : dev=60s, stage=120s, prod=300s

# cost_estimator.py
def cost_per_1k(passages=20, c_emb=0.0001, rerank=1.0, tok_in=900, tok_out=250, c_tok=1e-5, infra=2.0):
  C_emb = passages * c_emb * 1000
  C_rer = rerank
  C_gen = (tok_in + tok_out) * c_tok * 1000
  return round(C_emb + C_rer + C_gen + infra, 2)

# PromQL budget (extraits)
sum(rate(rag_cost_eur_total[10m])) / sum(rate(rag_requests_total[10m]))  # coût/req
(sum(rate(rag_cost_eur_total[1h])) > BUDGET_HOURLY)                      # alerte

Incidents (extraits)

Latence ↑ : vérifier index/rerank/LLM ; scaler API + workers ; activer cache.
Coût ↑ : réduire top_k, tronquer tokens, modèle compact, TTL cache ↑.
Erreurs : timeouts, quotas LLM, saturation DB/Redis, déploiement récent → rollback.

Procédure de rollback

Basculer sur image N-1, désactiver flags à risque.
Restaurer snapshot index (si nécessaire).
Communiquer statut + RCA à T+24h.