2) Ingestion & parsing (structure-aware)
Objectif : transformer du code & des schémas DB en objets structurés + chunks textuels riches en contexte (modèles, champs, contraintes, migrations, admin/DRF).
Python AST / Tree-sitter Introspection MySQL / Postgres Migrations → état courant Admin & DRF
Parsers AST & Tree-sitter — extraction des modèles Django
🎯 But
- Repérer toutes les classes
class X(models.Model)dans les apps. - Capturer nom, app_label (Meta), docstring.
- Lister les champs avec type + arguments nommés.
🧩 AST Python (rapide & sans dépendances C)
import ast, pathlib
class ModelVisitor(ast.NodeVisitor):
def __init__(self): self.models=[]
def visit_ClassDef(self, node):
# hérite-t-elle de models.Model ?
bases = [getattr(b, "attr", None) or getattr(b, "id", None) for b in node.bases]
if "Model" in bases:
item = {
"name": node.name,
"doc": ast.get_docstring(node) or "",
"fields": [], "meta": {}
}
for stmt in node.body:
# fields = Assign(Name, Call(models.XxxField(...)))
if isinstance(stmt, ast.Assign) and isinstance(stmt.value, ast.Call):
ftype = getattr(getattr(stmt.value.func, "attr", None), "lower", lambda:"")()
if ftype.endswith("field"):
fname = stmt.targets[0].id
kwargs = {kw.arg: ast.literal_eval(kw.value) for kw in stmt.value.keywords}
item["fields"].append({"name": fname, "type": ftype, "args": kwargs})
# class Meta:
if isinstance(stmt, ast.ClassDef) and stmt.name=="Meta":
for a in stmt.body:
if isinstance(a, ast.Assign) and a.targets and hasattr(a.targets[0],"id"):
item["meta"][a.targets[0].id] = ast.literal_eval(a.value)
self.models.append(item)
root = pathlib.Path("apps")
for p in root.rglob("models.py"):
tree = ast.parse(p.read_text(encoding="utf8"))
v = ModelVisitor(); v.visit(tree)
Variante Tree-sitter : plus robuste aux code styles/alias ; utile si du Python non “évaluable” (f-strings complexes).
🧭 Points d’attention
- Champs dynamiques construits via
getattr/ alias → préférer Tree-sitter. - Import alias (
from django.db import models as m) → normaliser. app_labeldéfini dansMetaou viaapps.py→ fallback module.- Support des champs custom (
MyJSONField) → catégoriser par héritage deField.
Exemple de résultat (extrait)
{
"name": "Invoice", "doc": "Factures clients",
"fields": [{"name":"ref","type":"charfield","args":{"max_length":32,"db_index":true}}],
"meta": {"db_table":"billing_invoice","ordering":["-created_at"]}
}Champs & arguments — types, options et validations
📚 Typologie fréquente
CharField,TextField,IntegerField,DecimalFieldDateTimeField,BooleanField,JSONFieldForeignKey,ManyToManyField(on_delete, related_name…)
🔧 Options clés
null,blank,default,choicesdb_index,unique,index_togethervalidators(Regex, Min/Max, custom)
🔎 Extraction des kwargs (AST)
def kw_to_dict(call: ast.Call):
out={}
for kw in call.keywords:
try: out[kw.arg]=ast.literal_eval(kw.value)
except Exception: out[kw.arg]="__expr__"
return out
Astuce : marquer les expressions non évaluables par
"__expr__" pour post-traitement.🧪 Validators (exemple)
from django.core.validators import RegexValidator
code = RegexValidator(regex=r"^[A-Z0-9]{8}$", message="REF invalide")🧱 ForeignKey et on_delete
- Capturer
to(modèle cible),on_delete,related_name. - Détecter FK orphelines (cible absente / renommée).
- Tracer la direction & la cardinalité pour le graphe.
Extrait de champ sérialisé
{"name":"customer","type":"foreignkey",
"args":{"to":"crm.Customer","on_delete":"CASCADE","db_index":true}}Class Meta — db_table, indexes, constraints, ordering…
🧩 Extraction (AST)
# Dans visit_ClassDef: pour "class Meta:"
if isinstance(stmt, ast.ClassDef) and stmt.name=="Meta":
for a in stmt.body:
if isinstance(a, ast.Assign) and hasattr(a.targets[0],"id"):
key=a.targets[0].id
try: meta_val=ast.literal_eval(a.value)
except: meta_val="__expr__"
item["meta"][key]=meta_val
📌 Champs courants
db_table,ordering,unique_together,index_togetherindexes(liste demodels.Index(fields=[...], name="..."))constraints(CheckConstraint, UniqueConstraint, FK déduites)
🧾 Normalisation “Index/Constraint”
- Évaluer les expressions si possible (noms d’index générés).
- Traduire en forme canonique pour comparer à la DB.
Index canonique (exemple)
{"table":"billing_invoice","name":"idx_invoice_ref","fields":["ref"],"unique":false}Migrations parser — chronologie → état courant du modèle
🧭 Lecture séquentielle
- Parcourir
app/migrations/*.pydans l’ordre des dépendances. - Extraire la liste des opérations :
AddField,AlterField,AlterUniqueTogether,RemoveField,RunSQL… - Appliquer chaque op sur un state store (dico modèles → champs → meta).
class State:
def __init__(self): self.models = {}
def apply(self, op):
t = type(op).__name__
if t=="AddField": self.models[op.model]["fields"][op.name]=op.field
elif t=="AlterField": self.models[op.model]["fields"][op.name]=op.field
elif t=="RemoveField": self.models[op.model]["fields"].pop(op.name, None)
elif t=="AlterUniqueTogether": self.models[op.model]["meta"]["unique_together"]=op.unique_together
elif t=="RunSQL": pass # marquer comme "non modélisable"
🧪 Déduire l’état courant
- Après la relecture → snapshot “modèle courant” (champs + meta).
- Taguer les changements “non modélisables” (RunSQL) pour contrôle manuel.
- Exporter le diff “models.py ↔ migrations” (détecter dérives).
Trace chronologique (exemple)
[ "0001_initial:AddField ref", "0002_alterunique:unique_together('ref','customer')",
"0003_runsql:CREATE INDEX idx_inv_date ..." ]DB introspection — MySQL/MariaDB & PostgreSQL
🐬 MySQL / MariaDB
DDL
SHOW CREATE TABLE billing_invoice\GIndexes
SELECT TABLE_NAME, INDEX_NAME, COLUMN_NAME, NON_UNIQUE
FROM information_schema.STATISTICS
WHERE TABLE_SCHEMA = DATABASE() AND TABLE_NAME='billing_invoice'
ORDER BY SEQ_IN_INDEX;FKs
SELECT CONSTRAINT_NAME, COLUMN_NAME, REFERENCED_TABLE_NAME, REFERENCED_COLUMN_NAME
FROM information_schema.KEY_COLUMN_USAGE
WHERE TABLE_SCHEMA = DATABASE()
AND TABLE_NAME='billing_invoice'
AND REFERENCED_TABLE_NAME IS NOT NULL;🐘 PostgreSQL
Indexes
SELECT i.relname AS index_name, a.attname AS column_name, ix.indisunique
FROM pg_class t
JOIN pg_index ix ON t.oid = ix.indrelid
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = ANY(ix.indkey)
WHERE t.relname='billing_invoice';FKs
SELECT con.conname AS fk_name, att2.attname AS col, cl.relname AS ref_table
FROM pg_constraint con
JOIN pg_class cl ON con.confrelid = cl.oid
JOIN pg_attribute att2 ON att2.attrelid = con.conrelid AND att2.attnum = ANY(con.conkey)
WHERE con.contype='f' AND con.conrelid = 'billing_invoice'::regclass;Vue normalisée (extrait)
{"table":"billing_invoice",
"indexes":[{"name":"idx_invoice_ref","cols":["ref"],"unique":false}],
"fks":[{"name":"fk_invoice_customer","col":"customer_id","ref":"crm_customer(id)"}]}Diff modèles ↔ DB — détection d’anomalies
🔍 Règles simples
- Index manquant si
db_index=Trueouunique=Truedans le modèle mais absent en DB. - FK orpheline si
ForeignKey(to="X")sans table/clé cible en DB. - Dérive de type (ex.
Char(32)vsVARCHAR(64)). - Méta non appliquée :
unique_together/indexesnon reflétés.
def diff(model_json, db_json):
issues=[]
# indexes
wanted=set((tuple(i["fields"]), i.get("unique",False)) for i in model_json["meta"].get("indexes",[]))
have=set((tuple(ix["cols"]), ix.get("unique",False)) for ix in db_json["indexes"])
for w in wanted-have: issues.append({"type":"MISSING_INDEX","detail":w})
# FKs orphelines
# ...
return issues📋 Rapport d’anomalies (exemple)
[
{"type":"MISSING_INDEX","table":"billing_invoice","index":"idx_invoice_ref"},
{"type":"ORPHAN_FK","table":"billing_invoice","fk":"fk_invoice_customer"},
{"type":"TYPE_DRIFT","table":"billing_invoice","col":"ref","model":"varchar(32)","db":"varchar(64)"}
]Exporter en JSON + “chunker” un texte explicatif pour le QA.
Admin & DRF — ModelAdmin / ModelSerializer
🛠️ ModelAdmin
# admin.py (extraction AST analogue)
class InvoiceAdmin(admin.ModelAdmin):
list_display = ("ref","customer","total","created_at")
list_filter = ("status","created_at")
search_fields = ("ref","customer__name")- Capturer
list_display,list_filter,search_fields. - Enrichir les chunks “UX admin” (filtres, colonnes, recherche).
🔌 DRF ModelSerializer
class InvoiceSer(ModelSerializer):
class Meta:
model = Invoice
fields = ("id","ref","customer","total")
extra_kwargs = {"ref":{"validators":[RegexValidator(r"^[A-Z0-9]{8}$")]}}- Extraire
fields,exclude,validators,read_only. - Relier aux champs du modèle pour détecter incohérences.
Sortie d’ingestion — objets JSON + chunks textuels “structure-aware”
🧱 Objet JSON (modèle)
{
"app": "billing",
"model": "Invoice",
"doc": "Factures clients",
"fields": [
{"name":"ref","type":"char","args":{"max_length":32,"db_index":true}},
{"name":"customer","type":"fk","args":{"to":"crm.Customer","on_delete":"CASCADE"}}
],
"meta": {"db_table":"billing_invoice","ordering":["-created_at"]},
"admin": {"list_filter":["status"], "search_fields":["ref","customer__name"]},
"drf": {"fields":["id","ref","customer","total"]},
"db_schema": {"indexes":[{"name":"idx_invoice_ref","cols":["ref"]}]}
}📦 Chunks contextuels (exemples)
CHUNK: billing.Invoice / Champs
- ref: Char(32), db_index
- customer: FK → crm.Customer (CASCADE)
- total: Decimal(10,2)
CHUNK: billing.Invoice / Contraintes
- unique_together: (ref, customer)
- indexes: idx_invoice_ref(ref)
CHUNK: billing.Invoice / Admin & API
- Admin: list_filter=status; search=ref, customer__name
- DRF: fields=id, ref, customer, totalTip : utiliser des titles H2/H3 pour guider le chunker et encodage des méta (table, champs, contraintes).
Checklist & anti-patterns — ingestion structure-aware
✅ Checklist
- AST → modèles/champs/meta OK sur toutes les apps.
- Migrations lues dans l’ordre → état courant cohérent.
- Introspection DB normalisée (index, FKs, types).
- Diff modèles ↔ DB exporté (JSON + résumé lisible).
- Admin/DRF intégrés (list_filter, search, validators).
- Sortie = objet JSON + 3–5 chunks contextuels par modèle.
⛔ Anti-patterns
- Parser naïf basé sur regex (trop fragile).
- Ignorer
RunSQL→ incohérences silencieuses. - Comparer noms d’index sans normaliser colonnes/ordre.
- Oublier les métadonnées (langue, confidentialité) pour le filtrage RAG.
🚀 Conseils de prod
- Stocker un hash par fichier (models/migrations) pour l’incrémental.
- Limiter les chunks à ~300–600 tokens avec overlap 10–20%.
- Tagger chaque chunk:
app,model,section,lang,pii. - Inclure un score de fraicheur (timestamp commit) dans les métas.
