4) Indexation & vecteurs — RAG

Embeddings multilingues (dim & normalisation L2), choix d’index (FAISS / pgvector / OpenSearch / Weaviate / Milvus), filtres par métadonnées, requêtes & tests de rappel.

Embeddings Index Filtres Recall@k Livrables

Ouvrir le canevas d’indexation

Choix d’un modèle d’embeddings, dimension & L2-norm, magasin vecteur, schéma & filtres.

Mise en Œuvre des Outils

Scripts de (re)build, mapping/index SQL/NoSQL, requêtes filtrées, sizing & perfs, tests de rappel, CI/CD.

Ouvrir la configuration technique

FAISS/pgvector/OpenSearch/Weaviate/Milvus • Cosine via L2-norm • Filtres langue/BU/fraîcheur/type • recall@k & sizing.

Modèles conseillés (multilingues)

Modèle	Dim	Notes
bge-m3 (ou bge-base-m3)	1024 / 768	Bon rappel, multilingue
e5-large	1024	Qualité élevée, + lourd
text-embedding-3-large	3072	Très haute qualité, coûteux

Fixer une dimension (dim) et s’y tenir sur tout le pipeline.
Normalisation L2 pour simil. cosinus (cosine≈dot sur vecteurs L2-normés).
Stocker meta.lang & meta.tags pour filtrer.

Règle pratique : privilégier dimension 768–1024 pour coût / latence / qualité équilibrés.

Génération d’embeddings (pseudo-Python)

# embed.py
def embed_texts(texts):
    vecs = model.encode(texts, normalize_embeddings=True)  # L2-norm ici
    # assurez-vous que len(vec) == DIM attendu par votre index
    return vecs

Config YAML (embeddings + index)

embedding:
  model: bge-m3          # e5-large | text-embedding-3-large
  dim: 1024
  normalize_l2: true
index:
  store: pgvector        # faiss | pgvector | opensearch | weaviate | milvus
  metric: cosine         # cosine | l2
  ann: ivfflat           # flat | ivfflat | hnsw
  params:
    ivf_lists: 200
    hnsw_m: 16
    hnsw_efc: 80
filters:
  language: ["fr","en"]
  bu: ["IT","HR"]
  freshness_days: 90
  type: ["documentation","policy"]

FAISS (PoC rapide)

# faiss_build.py
import faiss, numpy as np

DIM = 1024
xb = np.array(embeddings, dtype="float32")      # déjà L2-normés
# Flat L2
index = faiss.IndexFlatL2(DIM)                  # ou IndexFlatIP si non-normalisés
index.add(xb)

# IVF + HNSW (optionnel)
quantizer = faiss.IndexHNSWFlat(DIM, 32)
ivf = faiss.IndexIVFFlat(quantizer, DIM, 200)   # lists=200
ivf.train(xb)
ivf.add(xb)
# recherche
D, I = ivf.search(query_vecs, k=24)

Ne pas mélanger dimensions (DIM) ou normalisation entre build et search.

pgvector (production SQL-first)

-- schema.sql (PostgreSQL)
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE rag_chunks(
  id TEXT PRIMARY KEY,
  text TEXT,
  meta JSONB,
  embedding VECTOR(1024)
);
-- IVFFlat
CREATE INDEX rag_chunks_ivf ON rag_chunks USING ivfflat (embedding vector_l2_ops) WITH (lists=200);
-- HNSW (si dispo)
-- CREATE INDEX rag_chunks_hnsw ON rag_chunks USING hnsw (embedding vector_l2_ops) WITH (m=16, ef_construction=64);

-- requête: kNN + filtres
-- :qvec = embedding de la requête (VECTOR(1024))
SELECT id, text, meta
FROM rag_chunks
WHERE (meta->>'lang') = 'fr'
  AND (meta->>'bu') = 'IT'
  AND (meta->>'type') IN ('documentation','policy')
  AND (meta->>'last_modified')::date >= (now()::date - interval '90 days')
ORDER BY embedding <-> :qvec   -- L2 ; si vecteurs L2-normés => approx. cosine
LIMIT 24;

OpenSearch / Elasticsearch (dense vectors)

// OpenSearch (KNN plugin)
PUT /rag_chunks
{
  "settings": { "index": { "knn": true } },
  "mappings": {
    "properties": {
      "text": { "type": "text" },
      "lang": { "type": "keyword" },
      "bu":   { "type": "keyword" },
      "type": { "type": "keyword" },
      "last_modified": { "type": "date" },
      "embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {"name":"hnsw","space_type":"cosinesimil","engine":"nmslib"}
      }
    }
  }
}

// recherche (kNN + bool filters)
POST /rag_chunks/_search
{
  "size": 24,
  "query": {
    "bool": {
      "filter": [
        {"term": {"lang":"fr"}},
        {"term": {"bu":"IT"}},
        {"terms": {"type":["documentation","policy"]}},
        {"range":{"last_modified":{"gte":"now-90d"}}}
      ],
      "must": {
        "knn": {"embedding": {"vector": [..qvec..], "k": 24}}
      }
    }
  }
}

Weaviate / Milvus (volumétrie élevée)

# Weaviate (schéma)
{
  "class": "RagChunk",
  "vectorizer": "none",
  "vectorIndexConfig": {"distance":"cosine","efConstruction":64,"maxConnections":16}
}

# Milvus (pymilvus)
from pymilvus import Collection, FieldSchema, CollectionSchema, DataType, connections, utility
dim = 1024
fields = [
  FieldSchema("id", DataType.VARCHAR, is_primary=True, max_length=128),
  FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=dim),
  FieldSchema("lang", DataType.VARCHAR, max_length=8),
  FieldSchema("bu", DataType.VARCHAR, max_length=32),
  FieldSchema("type", DataType.VARCHAR, max_length=32),
  FieldSchema("last_modified", DataType.INT64)
]
schema = CollectionSchema(fields, "rag chunks")
col = Collection("rag_chunks", schema)
col.create_index("embedding", {"index_type":"HNSW","metric_type":"COSINE","params":{"M":16,"efConstruction":64}})
# recherche
col.load()
res = col.search([qvec], "embedding", {"metric_type":"COSINE","params":{"ef":80}}, limit=24,
                 expr="lang=='fr' and bu=='IT' and type in ['documentation','policy'] and last_modified >= now()-90*86400")

Schéma de métadonnées (rappel)

lang : fr/en…
bu : IT/HR/Finance…
type : documentation/policy/faq…
last_modified : ISO datetime
+ source_id, version, hash

Pousser les filtres côté moteur d’indexation (pré-filtrage) avant le scoring vectoriel quand c’est possible.

Requête hybride (pgvector + fulltext)

-- Exemple combiné (dense + BM25) avec filtres
WITH dense AS (
  SELECT id, text, meta, 1 - (embedding <-> :qvec) AS s_d
  FROM rag_chunks
  WHERE (meta->>'lang')='fr' AND (meta->>'bu')='IT'
    AND (meta->>'type') IN ('documentation','policy')
    AND (meta->>'last_modified')::date >= (now()::date - interval '90 days')
  ORDER BY embedding <-> :qvec
  LIMIT 60
),
sparse AS (
  SELECT id, text, meta, ts_rank_cd(to_tsvector('french', text), plainto_tsquery('french', :q)) AS s_s
  FROM rag_chunks
  WHERE to_tsvector('french', text) @@ plainto_tsquery('french', :q)
  LIMIT 60
)
SELECT * FROM (
  SELECT *, s_d*0.6 AS score FROM dense
  UNION
  SELECT *, s_s*0.6 AS score FROM sparse
) u
ORDER BY score DESC
LIMIT 24;

Livrables attendus

Index créé (config documentée : modèle, dim, metric, ann, params).
Scripts de build/rebuild (full & incrémental) + readme.
Tests de rappel (recall@k) sur golden set.

Contrat YAML de livraison

{
  "embedding":{"model":"bge-m3","dim":1024,"normalize_l2":true},
  "index":{"store":"pgvector","metric":"cosine","ann":"ivfflat","params":{"lists":200}},
  "filters":{"lang":["fr","en"],"bu":["IT","HR"],"freshness_days":90,"type":["documentation","policy"]},
  "artifacts":["schema.sql","build_index.py","rebuild_index.py","recall_report.json"]
}

Scripts de build

# build_index.py
from db import upsert_chunk, fetch_chunks
from embed import embed_texts
def build(full=False):
    it = fetch_chunks(full=full)  # lit chunks.jsonl ou DB
    batch = []
    for i, (text, meta) in enumerate(it, 1):
        batch.append((text, meta))
        if len(batch) == 256:
            vecs = embed_texts([t for t,_ in batch])
            upsert_batch(vecs, [m for _,m in batch])  # vers FAISS/pgvector/...
            batch = []
    if batch:
        vecs = embed_texts([t for t,_ in batch])
        upsert_batch(vecs, [m for _,m in batch])

if __name__ == "__main__":
    build(full=True)

Rebuild incrémental (hash/last_modified)

# rebuild_index.py
def rebuild_incremental():
    for ch in changed_chunks(since_last_run=True):  # hash différent ou last_modified plus récent
        vec = embed_texts([ch.text])[0]
        upsert_one(vec, ch.meta)  # upsert dans l'index + maj métadonnées
    vacuum_index()  # recompact/reload si nécessaire

Intégrer ces scripts à CI/CD ; déclencher un rebuild après ingestion validée.

Requête FAISS

# search_faiss.py
D, I = index.search(np.array([qvec]).astype("float32"), k=24)
results = [corpus[idx] for idx in I[0]]

Requête pgvector (SQL)

SELECT id, text, meta
FROM rag_chunks
WHERE (meta->>'lang')='fr' AND (meta->>'bu')='IT'
  AND (meta->>'type') IN ('documentation','policy')
  AND (meta->>'last_modified')::date >= (now()::date - interval '90 days')
ORDER BY embedding <-> :qvec
LIMIT 24;

Requête OpenSearch (KNN + filtres)

POST /rag_chunks/_search
{
  "size": 24,
  "query": {
    "bool": {
      "filter": [
        {"term":{"lang":"fr"}},{"term":{"bu":"IT"}},
        {"terms":{"type":["documentation","policy"]}},
        {"range":{"last_modified":{"gte":"now-90d"}}}
      ],
      "must": {"knn":{"embedding":{"vector":[..qvec..],"k":24}}}
    }
  }
}

Règles de dimensionnement

Stockage ≈ nb_chunks × dim × 4 bytes (float32) + overhead index.
IVFFlat : lists ≈ sqrt(nb_vectors) (ajustez par mesures).
HNSW : M 12–24 ; ef_search 50–150 (qualité vs latence).
Batch d’upsert 128–512 pour saturer le CPU/IO.

Mesurer P50/P95 latency, throughput (qps), mémoire, et taux de rappel vs golden set.

Contrôles d’intégrité

Vérifier dim constant (assert sur longueur des vecteurs).
Vérifier L2-norm==1 si similarité cosinus.
Compter trous/suppressions → vacuum périodique.

# checks.py
assert all(abs(np.linalg.norm(v)-1) < 1e-3 for v in vecs)

Protocole recall@k

Golden set : (question → ids de passages pertinents).
Évaluer pour k ∈ {5,10,20} : recall@k = % de questions où ≥1 passage pertinent est dans top-k.
Comparer variantes : modèle, dim, ann, params, filtres.

Script (pseudo)

# recall.py
def recall_at_k(queries, gold, k=10):
    ok = 0
    for q in queries:
        qvec = embed_texts([q.text])[0]
        top = search(qvec, k=k)  # retourne [ids]
        if set(top) & set(gold[q.id]): ok += 1
    return ok/len(queries)

report = {
  "k5":  recall_at_k(queries, gold, 5),
  "k10": recall_at_k(queries, gold, 10),
  "k20": recall_at_k(queries, gold, 20),
}
save_json(report, "recall_report.json")