7) Évaluation offline & bancs d’essai — RAG
Jeux d’évaluation dorés, métriques, ablations, protocole reproductible et livrables.
Jeux d’éval Métriques Ablations Procédure Livrables
Mise en Œuvre des Outils
Scripts d’évaluation, runner, calculs métriques, ablation grid, reporting, CI/CD & traçabilité.
7) Évaluation offline & bancs d’essai (Jours 18–24)
Constitution
- Gold set : 100–300 Q/A validées (v1…vn), couvrant domaines & types.
- Hard negatives : passages proches mais faux ⇒ robustesse.
- Splits : train/dev/test (ou holdout) + seed fixe.
- Traçabilité : source_id version date auteur.
Format JSON (référence)
{
"qid":"Q-001",
"question":"Comment réinitialiser mon SSO ?",
"answers":["Ouvrir le portail…"],
"metadata":{"domain":"IT","lang":"fr","difficulty":"M"},
"positives":[{"chunk_id":"C123","source_id":"S1"}],
"hard_negatives":[{"chunk_id":"C987","source_id":"S9"}]
}Retrieval
| Mesure | Seuil cible | Notes |
|---|---|---|
| hit@k | ≥ 0.85 (k=5) | Rappel |
| MRR | ≥ 0.65 | Position du 1er passage pertinent |
| nDCG@10 | ≥ 0.80 | Pertinence globale |
Q/A & perf
| Mesure | Seuil | Notes |
|---|---|---|
| Exact-Match | ≥ 0.75 | Réponses factuelles |
| F1 | ≥ 0.80 | Réponses libres courtes |
| Faithfulness | ≥ 95 % | LLM-as-a-judge + contrôle n-grammes |
| Latence P50/P95 | ≤ 0.7s / 1.5s | Retrieval + génération |
| Coût/req | ≤ €0.01 | Modèle compact + cache |
Plan
- Embeddings A/B : e5/bge/TE3.
- Chunk size : 400/600/800/1000 ; overlap : 0/80/120.
- Reranker ON/OFF ; top-k : 20/50/100.
- Hybrid (BM25+dense) vs dense-only.
Matrice (ex.)
| Config | hit@5 | MRR | nDCG@10 | Coût |
|---|---|---|---|---|
| e5-base • 800/100 • rerank ON | 0.90 | 0.69 | 0.83 | ++ |
| bge-small • 600/80 • rerank OFF | 0.86 | 0.62 | 0.80 | + |
| dense-only (e5) • 800/100 | 0.81 | 0.57 | 0.75 | + |
Étapes standard
- Figer dataset version + seed.
- Snapshot des index, configs & embeddings.
- Runner : N exécutions ⇒ moyennes/écarts + logs.
- Calcul des métriques & bootstrap (significativité).
- Scorecards & rapport décisionnel.
CLI (exemples)
$ rag eval --dataset GOLD_V1 --cfg configs/eval/base.yaml --seed 42 --runs 3
$ rag ablation --space configs/ablation.yaml --dataset GOLD_V1 --export results/
$ rag scorecard --input results/base.json --thresholds configs/thresholds.yaml --out report.html
Livrables
- Tableau de bord d’évaluation (scorecards par domaine/version).
- Rapport d’ablation (choix techniques justifiés).
- Artifacts versionnés : datasets, index, configs, résultats.
Plan d’évaluation (JSON exportable)
{
"dataset_version":"GOLD_V1",
"sizes":{"gold":200,"hard_neg":400},
"metrics":{
"retrieval":{"hit@5":0.85,"mrr":0.65,"ndcg@10":0.80},
"qa":{"em":0.75,"f1":0.80,"faithfulness":0.95},
"perf":{"p50_ms":700,"p95_ms":1500,"cost_eur_per_req":0.01}
},
"ablation_space":{
"embeddings":["e5-base","bge-small","te3"],
"chunk_size":[400,600,800,1000],
"overlap":[0,80,120],
"reranker":[true,false],
"top_k":[20,50,100],
"hybrid":[true,false]
},
"procedure":{"runs":3,"seed":42,"significance":"bootstrap"}
}Mise en Œuvre des Outils — Évaluation
Stockage & schéma (SQL)
-- tables d'évaluation
CREATE TABLE eval_questions(
qid TEXT PRIMARY KEY,
question TEXT, answers JSONB, meta JSONB, version TEXT
);
CREATE TABLE eval_gold_links(
qid TEXT, chunk_id TEXT, label TEXT CHECK (label IN ('positive','hard_negative'))
);
CREATE TABLE eval_runs(
run_id TEXT PRIMARY KEY, cfg JSONB, dataset_version TEXT,
seed INT, started_at TIMESTAMP, finished_at TIMESTAMP
);Préparation (Python)
# prepare_datasets.py
def build_gold(version):
gold = load_annotations(version)
save_sql(gold.questions, gold.links) # vers tables ci-dessus
def split(seed=42):
# renvoyer train/dev/test ou holdout
...
if __name__ == "__main__":
build_gold("GOLD_V1"); split()Runner (pseudo-code)
# eval_runner.py
def run(cfg, dataset, seed=42, runs=3):
recs = retrieve(cfg, dataset) # hit@k / MRR / nDCG
answers = generate(cfg, recs) # EM / F1 / Faithfulness
perf = measure_perf_cost(cfg) # P50/P95 / coût/req
return {**recs, **answers, **perf}
if __name__ == "__main__":
log = run(load_cfg("configs/eval/base.yaml"), "GOLD_V1")Stockage des résultats (SQL)
CREATE TABLE eval_results(
run_id TEXT, metric TEXT, value DOUBLE PRECISION, domain TEXT, PRIMARY KEY(run_id, metric, domain)
);
-- insert depuis le runner, puis agréger pour les dashboards
Calculs (Python)
# metrics.py (extraits)
def exact_match(pred, gold): ...
def f1(pred, gold): ...
def faithfulness(answer, passages): ... # LLM-as-a-judge + n-gram control
def hit_at_k(results, k=5): ...
def mrr(results): ...
def ndcg(results, k=10): ...
Scorecard (SQL)
-- example: récupérer les métriques clés d'un run
SELECT metric, ROUND(AVG(value)::numeric, 3) AS score
FROM eval_results
WHERE run_id = :run
GROUP BY metric
ORDER BY metric;Construire un **dashboard** (Grafana/Metabase) branché sur eval_results.
Espace de recherche
# configs/ablation.yaml (ex.)
embeddings: ["e5-base","bge-small","te3"]
chunk_size: [400,600,800,1000]
overlap: [0,80,120]
reranker: [true,false]
top_k: [20,50,100]
hybrid: [true,false]Lanceur
# ablation.py
for cfg in sweep("configs/ablation.yaml"):
res = run(cfg, "GOLD_V1", seed=42)
save_run(res) # vers eval_runs/eval_results
# sélectionner la meilleure config sous contraintes coût/latence
Traçabilité
- Versionner datasets, index, configs et résultats.
- Snapshot des embeddings / index avant chaque run.
- Tags git :
eval-GOLD_V1-YYYYMMDD.
CI/CD (ex. GitHub Actions)
name: eval
on: [workflow_dispatch]
jobs:
bench:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: python prepare_datasets.py
- run: python eval_runner.py --cfg configs/eval/base.yaml --seed 42 --runs 3
- run: python scorecard.py --input results/base.json --out report.html
- uses: actions/upload-artifact@v4
with: {name: "report", path: "report.html"}