4) Indexation & vecteurs â RAG
Embeddings multilingues (dim & normalisation L2), choix dâindex (FAISS / pgvector / OpenSearch / Weaviate / Milvus), filtres par mĂ©tadonnĂ©es, requĂȘtes & tests de rappel.
Embeddings Index Filtres Recall@k Livrables
Mise en Ćuvre des Outils
Scripts de (re)build, mapping/index SQL/NoSQL, requĂȘtes filtrĂ©es, sizing & perfs, tests de rappel, CI/CD.
4) Indexation & vecteurs (Jours 10â15)
ModÚles conseillés (multilingues)
| ModĂšle | Dim | Notes |
|---|---|---|
| bge-m3 (ou bge-base-m3) | 1024 / 768 | Bon rappel, multilingue |
| e5-large | 1024 | Qualité élevée, + lourd |
| text-embedding-3-large | 3072 | TrÚs haute qualité, coûteux |
- Fixer une dimension (dim) et sây tenir sur tout le pipeline.
- Normalisation L2 pour simil. cosinus (cosineâdot sur vecteurs L2-normĂ©s).
- Stocker meta.lang & meta.tags pour filtrer.
RĂšgle pratique : privilĂ©gier dimension 768â1024 pour coĂ»t / latence / qualitĂ© Ă©quilibrĂ©s.
GĂ©nĂ©ration dâembeddings (pseudo-Python)
# embed.py
def embed_texts(texts):
vecs = model.encode(texts, normalize_embeddings=True) # L2-norm ici
# assurez-vous que len(vec) == DIM attendu par votre index
return vecs
Config YAML (embeddings + index)
embedding:
model: bge-m3 # e5-large | text-embedding-3-large
dim: 1024
normalize_l2: true
index:
store: pgvector # faiss | pgvector | opensearch | weaviate | milvus
metric: cosine # cosine | l2
ann: ivfflat # flat | ivfflat | hnsw
params:
ivf_lists: 200
hnsw_m: 16
hnsw_efc: 80
filters:
language: ["fr","en"]
bu: ["IT","HR"]
freshness_days: 90
type: ["documentation","policy"]FAISS (PoC rapide)
# faiss_build.py
import faiss, numpy as np
DIM = 1024
xb = np.array(embeddings, dtype="float32") # déjà L2-normés
# Flat L2
index = faiss.IndexFlatL2(DIM) # ou IndexFlatIP si non-normalisés
index.add(xb)
# IVF + HNSW (optionnel)
quantizer = faiss.IndexHNSWFlat(DIM, 32)
ivf = faiss.IndexIVFFlat(quantizer, DIM, 200) # lists=200
ivf.train(xb)
ivf.add(xb)
# recherche
D, I = ivf.search(query_vecs, k=24)Ne pas mélanger dimensions (DIM) ou normalisation entre build et search.
pgvector (production SQL-first)
-- schema.sql (PostgreSQL)
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE rag_chunks(
id TEXT PRIMARY KEY,
text TEXT,
meta JSONB,
embedding VECTOR(1024)
);
-- IVFFlat
CREATE INDEX rag_chunks_ivf ON rag_chunks USING ivfflat (embedding vector_l2_ops) WITH (lists=200);
-- HNSW (si dispo)
-- CREATE INDEX rag_chunks_hnsw ON rag_chunks USING hnsw (embedding vector_l2_ops) WITH (m=16, ef_construction=64);
-- requĂȘte: kNN + filtres
-- :qvec = embedding de la requĂȘte (VECTOR(1024))
SELECT id, text, meta
FROM rag_chunks
WHERE (meta->>'lang') = 'fr'
AND (meta->>'bu') = 'IT'
AND (meta->>'type') IN ('documentation','policy')
AND (meta->>'last_modified')::date >= (now()::date - interval '90 days')
ORDER BY embedding <-> :qvec -- L2 ; si vecteurs L2-normés => approx. cosine
LIMIT 24;OpenSearch / Elasticsearch (dense vectors)
// OpenSearch (KNN plugin)
PUT /rag_chunks
{
"settings": { "index": { "knn": true } },
"mappings": {
"properties": {
"text": { "type": "text" },
"lang": { "type": "keyword" },
"bu": { "type": "keyword" },
"type": { "type": "keyword" },
"last_modified": { "type": "date" },
"embedding": {
"type": "knn_vector",
"dimension": 1024,
"method": {"name":"hnsw","space_type":"cosinesimil","engine":"nmslib"}
}
}
}
}
// recherche (kNN + bool filters)
POST /rag_chunks/_search
{
"size": 24,
"query": {
"bool": {
"filter": [
{"term": {"lang":"fr"}},
{"term": {"bu":"IT"}},
{"terms": {"type":["documentation","policy"]}},
{"range":{"last_modified":{"gte":"now-90d"}}}
],
"must": {
"knn": {"embedding": {"vector": [..qvec..], "k": 24}}
}
}
}
}Weaviate / Milvus (volumétrie élevée)
# Weaviate (schéma)
{
"class": "RagChunk",
"vectorizer": "none",
"vectorIndexConfig": {"distance":"cosine","efConstruction":64,"maxConnections":16}
}
# Milvus (pymilvus)
from pymilvus import Collection, FieldSchema, CollectionSchema, DataType, connections, utility
dim = 1024
fields = [
FieldSchema("id", DataType.VARCHAR, is_primary=True, max_length=128),
FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=dim),
FieldSchema("lang", DataType.VARCHAR, max_length=8),
FieldSchema("bu", DataType.VARCHAR, max_length=32),
FieldSchema("type", DataType.VARCHAR, max_length=32),
FieldSchema("last_modified", DataType.INT64)
]
schema = CollectionSchema(fields, "rag chunks")
col = Collection("rag_chunks", schema)
col.create_index("embedding", {"index_type":"HNSW","metric_type":"COSINE","params":{"M":16,"efConstruction":64}})
# recherche
col.load()
res = col.search([qvec], "embedding", {"metric_type":"COSINE","params":{"ef":80}}, limit=24,
expr="lang=='fr' and bu=='IT' and type in ['documentation','policy'] and last_modified >= now()-90*86400")Schéma de métadonnées (rappel)
- lang : fr/enâŠ
- bu : IT/HR/FinanceâŠ
- type : documentation/policy/faqâŠ
- last_modified : ISO datetime
- + source_id, version, hash
Pousser les filtres cĂŽtĂ© moteur dâindexation (prĂ©-filtrage) avant le scoring vectoriel quand câest possible.
RequĂȘte hybride (pgvector + fulltext)
-- Exemple combiné (dense + BM25) avec filtres
WITH dense AS (
SELECT id, text, meta, 1 - (embedding <-> :qvec) AS s_d
FROM rag_chunks
WHERE (meta->>'lang')='fr' AND (meta->>'bu')='IT'
AND (meta->>'type') IN ('documentation','policy')
AND (meta->>'last_modified')::date >= (now()::date - interval '90 days')
ORDER BY embedding <-> :qvec
LIMIT 60
),
sparse AS (
SELECT id, text, meta, ts_rank_cd(to_tsvector('french', text), plainto_tsquery('french', :q)) AS s_s
FROM rag_chunks
WHERE to_tsvector('french', text) @@ plainto_tsquery('french', :q)
LIMIT 60
)
SELECT * FROM (
SELECT *, s_d*0.6 AS score FROM dense
UNION
SELECT *, s_s*0.6 AS score FROM sparse
) u
ORDER BY score DESC
LIMIT 24;Livrables attendus
- Index créé (config documentée : modÚle, dim, metric, ann, params).
- Scripts de build/rebuild (full & incrémental) + readme.
- Tests de rappel (recall@k) sur golden set.
Contrat YAML de livraison
{
"embedding":{"model":"bge-m3","dim":1024,"normalize_l2":true},
"index":{"store":"pgvector","metric":"cosine","ann":"ivfflat","params":{"lists":200}},
"filters":{"lang":["fr","en"],"bu":["IT","HR"],"freshness_days":90,"type":["documentation","policy"]},
"artifacts":["schema.sql","build_index.py","rebuild_index.py","recall_report.json"]
}ImplĂ©mentation technique â Indexation & vecteurs
Scripts de build
# build_index.py
from db import upsert_chunk, fetch_chunks
from embed import embed_texts
def build(full=False):
it = fetch_chunks(full=full) # lit chunks.jsonl ou DB
batch = []
for i, (text, meta) in enumerate(it, 1):
batch.append((text, meta))
if len(batch) == 256:
vecs = embed_texts([t for t,_ in batch])
upsert_batch(vecs, [m for _,m in batch]) # vers FAISS/pgvector/...
batch = []
if batch:
vecs = embed_texts([t for t,_ in batch])
upsert_batch(vecs, [m for _,m in batch])
if __name__ == "__main__":
build(full=True)Rebuild incrémental (hash/last_modified)
# rebuild_index.py
def rebuild_incremental():
for ch in changed_chunks(since_last_run=True): # hash différent ou last_modified plus récent
vec = embed_texts([ch.text])[0]
upsert_one(vec, ch.meta) # upsert dans l'index + maj métadonnées
vacuum_index() # recompact/reload si nécessaireIntégrer ces scripts à CI/CD ; déclencher un rebuild aprÚs ingestion validée.
RequĂȘte FAISS
# search_faiss.py
D, I = index.search(np.array([qvec]).astype("float32"), k=24)
results = [corpus[idx] for idx in I[0]]RequĂȘte pgvector (SQL)
SELECT id, text, meta
FROM rag_chunks
WHERE (meta->>'lang')='fr' AND (meta->>'bu')='IT'
AND (meta->>'type') IN ('documentation','policy')
AND (meta->>'last_modified')::date >= (now()::date - interval '90 days')
ORDER BY embedding <-> :qvec
LIMIT 24;RequĂȘte OpenSearch (KNN + filtres)
POST /rag_chunks/_search
{
"size": 24,
"query": {
"bool": {
"filter": [
{"term":{"lang":"fr"}},{"term":{"bu":"IT"}},
{"terms":{"type":["documentation","policy"]}},
{"range":{"last_modified":{"gte":"now-90d"}}}
],
"must": {"knn":{"embedding":{"vector":[..qvec..],"k":24}}}
}
}
}RĂšgles de dimensionnement
- Stockage â nb_chunks Ă dim Ă 4 bytes (float32) + overhead index.
- IVFFlat : lists â sqrt(nb_vectors) (ajustez par mesures).
- HNSW : M 12â24 ; ef_search 50â150 (qualitĂ© vs latence).
- Batch dâupsert 128â512 pour saturer le CPU/IO.
Mesurer P50/P95 latency, throughput (qps), mémoire, et taux de rappel vs golden set.
ContrĂŽles dâintĂ©gritĂ©
- Vérifier dim constant (assert sur longueur des vecteurs).
- Vérifier L2-norm==1 si similarité cosinus.
- Compter trous/suppressions â vacuum pĂ©riodique.
# checks.py
assert all(abs(np.linalg.norm(v)-1) < 1e-3 for v in vecs)Protocole recall@k
- Golden set : (question â ids de passages pertinents).
- Ăvaluer pour k â {5,10,20} : recall@k = % de questions oĂč â„1 passage pertinent est dans top-k.
- Comparer variantes : modĂšle, dim, ann, params, filtres.
Script (pseudo)
# recall.py
def recall_at_k(queries, gold, k=10):
ok = 0
for q in queries:
qvec = embed_texts([q.text])[0]
top = search(qvec, k=k) # retourne [ids]
if set(top) & set(gold[q.id]): ok += 1
return ok/len(queries)
report = {
"k5": recall_at_k(queries, gold, 5),
"k10": recall_at_k(queries, gold, 10),
"k20": recall_at_k(queries, gold, 20),
}
save_json(report, "recall_report.json")