Outils indispensables du DBA
De lâobservabilitĂ© (Prometheus/Grafana, Zabbix, Nagios, Datadog) Ă lâadministration (pgAdmin, SQL DeveloperâŠ), lâautomatisation (Bash, Python, Ansible, Terraform), la gestion des migrations (Liquibase, Flyway) et les plateformes cloud & Kubernetes (Operators, Helm, RDS/Aurora, Cloud SQL).
KPIs: p95 latence, RPO/RTO, lag réplication, bloat, IOPS CI/CD DB: Liquibase/Flyway + Git + Pipelines DBaaS & K8s: Operators, Helm, Secrets/KMS
1) Monitoring â mĂ©triques, logs, traces, alertes
Stack & principes
- Prometheus (scrape) + Grafana (dashboards) â exporters DB.
- Zabbix/Nagios pour checks actifs/synthétiques, ping ports/latence.
- Datadog (SaaS) : métriques, logs, APM, traces distribuées.
- Core KPIs : p95 latence, QPS/TPS, locks, bloat, IOPS, lag réplication, RPO réel.
Exporters
- PG :
postgres_exporter,pgbouncer_exporter - MySQL/MariaDB :
mysqld_exporter - Oracle : agent OEM, scripts SQL + node exporter
Alertes Prometheus (YAML)
groups:
- name: dba.rules
rules:
- alert: PostgresReplicationLagHigh
expr: pg_replication_lag_bytes > 52428800 # 50MB
for: 5m
labels: {severity: critical}
annotations:
summary: "Lag réplication élevé"
runbook: "/runbooks/postgres/replication-lag.md"
- alert: BackupFreshnessExceeded
expr: time() - last_successful_backup_timestamp > 86400
for: 15m
labels: {severity: warning}
Tableau comparatif
| Outil | Type | Forces | Limites |
|---|---|---|---|
| Prometheus | Pull metrics | OSS, flexible, alerting | Long-term storage à prévoir |
| Zabbix | Agent/poll | DĂ©couverte, auto-remĂ©diation | Courbe dâapprentissage |
| Nagios | Checks | Simplicité, plugins | Moins moderne |
| Datadog | SaaS | Intégrations rapides | Coût récurrent |
Anti-patterns : pas dâalertes sur backups/lag, aucun SLO, mĂ©triques non historisĂ©es.
2) Outils dâadministration (clients, GUIs, diagnostics)
Panorama
- Universel : DBeaver (multi-moteur), HeidiSQL (MySQL/MariaDB).
- PG : pgAdmin, psql, pgBouncer UI.
- Oracle : SQL Developer, OEM.
- MySQL/MariaDB : mysql client, MySQL Shell, phpMyAdmin (basique).
Snippets de diagnostic
-- PostgreSQL Top N par temps
SELECT query, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;
-- MySQL InnoDB status
SHOW ENGINE INNODB STATUS\G;
-- Oracle plan
EXPLAIN PLAN FOR SELECT * FROM T WHERE COL = :b1;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY());Checklist dâusage sĂ©curisĂ©
- AccĂšs via bastion/VPN + MFA, comptes nominatifs.
- Limiter droits (RO vs RW), logs dâaudit activĂ©s.
- Politique de versions (GUI/clients Ă jour, compat driver).
Anti-patterns : comptes partagés admin, outils non patchés, GUI en accÚs public.
3) Scripts & automatisation (Bash, Python, Ansible, Terraform)
Bash/Python (ops rapides)
# Bash â dump Postgres compressĂ©
pg_dump -Fc -j4 appdb | gzip > /bkp/appdb_$(date +%F).dump.gz
# Python â check lag rĂ©plicas (psycopg/connector)
import psycopg
with psycopg.connect(conninfo) as con:
cur=con.execute("SELECT application_name, write_lag FROM pg_stat_replication")
print(cur.fetchall())Ansible (post-install & tuning)
- name: Tune PostgreSQL
hosts: db
become: yes
tasks:
- ini_file:
path: /var/lib/pgsql/data/postgresql.conf
option: shared_buffers
value: 8GB
- service: name=postgresql state=reloadedTerraform (DBaaS)
resource "aws_db_instance" "pg" {
engine="postgres"
instance_class="db.m6g.large"
multi_az=true
allocated_storage=200
backup_retention_period=35
deletion_protection=true
publicly_accessible=false
}Anti-patterns : scripts non versionnés, pas de dry-run, secrets en clair, pas de rollback.
4) Migrations de schĂ©ma â Liquibase & Flyway
Principes
- Versionner le schéma en Git, idempotent + rollback défini.
- Pattern expand/contract pour zero-downtime (ajout colonne nullable â backfill â rendre NOT NULL).
- Tests : EXPLAIN des nouvelles requĂȘtes + temps de lock/DDL.
Liquibase (XML extrait)
<changeSet id="2025-09-13-001" author="dba">
<addColumn tableName="orders">
<column name="status" type="varchar(16)" defaultValue="NEW"/>
</addColumn>
</changeSet>Flyway (SQL)
-- V20250913__add_index_orders.sql
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_customer_created
ON orders(customer_id, created_at);Pipeline CI (pseudo-yaml)
steps:
- checkout
- restore-db-from-snapshot
- run: liquibase updateSQL && liquibase update
- run: psql -f tests/explain_checks.sql
- run: python tests/locks_timing.py
- approve: manual_gate
- deploy: prod-windowAnti-patterns : migrations manuelles en prod, DDL longues sans fenĂȘtrage, absence de rollback/backout.
5) Cloud & Kubernetes â Operators, Helm, DBaaS
Operators (PostgreSQL)
- Zalando / Crunchy Data : failover automatique, backups S3, users/roles CRDs.
- Resources : StatefulSets, PVC, PDB (budgets de perturbation).
- Snapshots CSI + rétention, restore de contrÎle.
# Manifest (extrait CRD Zalando)
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata: {name: app-cluster}
spec:
numberOfInstances: 3
volume:
size: 200Gi
users:
app_user: ["reader", "writer"]Helm (valeurs extraites)
replicaCount: 3
resources:
requests: {cpu: "2", memory: "8Gi"}
limits: {cpu: "4", memory: "16Gi"}
persistence:
enabled: true
size: 200Gi
podDisruptionBudget:
minAvailable: 2DBaaS
- AWS : RDS/Aurora (Multi-AZ, Global DB).
- GCP : Cloud SQL / AlloyDB.
- Azure : Flexible Server.
Sécurité & secrets
- Secrets K8s â External Secrets + KMS (AWS/GCP/Azure) ; rotation.
- Réseau : policies, privés par défaut, egress restreint.
- Backups : bucket immuable (Object Lock), encryption server-side.
FinOps
- Right-sizing, auto-pause dev, tiering stockage (Standard/GP3/IO2).
- Alertes ⏠/jour, rapports IOPS/latence.
Anti-patterns : exécuter bases critiques en K8s sans Operator mature, pas de PDB, pas de snapshots, DB publique.
6) ObservabilitĂ© â dashboards, AWR/ASH, perf schema, SLO
Dashboards indispensables
- Santé : uptime, p95/99, erreurs, connexions, saturation pool.
- Stockage : IOPS, latence, espace libre, bloat/fragmentation.
- Réplication : lag, slots, divergence, switchover logs.
- Backups : fraßcheur, durée, taux succÚs.
Outils spécifiques
- PostgreSQL :
pg_stat_statements,pg_stat_activity, extensions. - Oracle : AWR/ASH, ADDM, SQL Monitor.
- MySQL : Performance Schema, slow query log.
SLO & budgets dâerreur
- Dispo ℠99.95%, RPO †5m, RTO †30m ; latence p95 lecture/écriture.
- Alertes âbruitâ limitĂ©es (agrĂ©gation, inhibition, time windows).
Diagramme â chaĂźne dâobservabilitĂ©
Anti-patterns : dashboards non alignĂ©s sur SLO, pas dâAlertmanager, logs sans rĂ©tention ni centralisation.
7) CI/CD & Policy-as-Code pour la base de données
Pipelines DB
- PR â lint SQL, dry-run des migrations, EXPLAIN checks, tests de locks/DDL.
- Gate manuel prod, fenĂȘtres de dĂ©ploiement, backout plan.
- Artifacts : rapports EXPLAIN, AWR/ASH, slow logs.
Policy-as-Code
# Conftest/OPA (rÚgle simplifiée)
package dbpolicy
deny[msg] {
input.change.type == "DROP_TABLE"
msg := "DROP TABLE interdit hors fenĂȘtre approuvĂ©e"
}GitOps
- Source of truth : repo schéma/migrations.
- Tagging/versions = déploiement DB synchronisé avec app.
- Auditabilité complÚte (qui/quoi/quand).
Anti-patterns : migrations par SSH en prod, absence de PR review, pas de tests EXPLAIN.
8) Stacks types par moteur (Admin, Obs, Backup, HA)
| Moteur | Admin | Backup | Obs/Monit | HA/DR | Cloud/K8s |
|---|---|---|---|---|---|
| PostgreSQL | psql, pgAdmin, DBeaver | pg_dump/pg_restore, pg_basebackup, WAL | postgres_exporter, Grafana, pganalyze | Patroni, pg_auto_failover, PITR | RDS/Aurora, Cloud SQL, Operators Zalando/Crunchy |
| Oracle | SQL Developer, OEM | RMAN, Flashback (opt.) | AWR/ASH, OEM | RAC, Data Guard | OCI, Exadata, Autonomous |
| MySQL/MariaDB | mysql client, MySQL Shell, HeidiSQL | mysqldump, XtraBackup/mariabackup | mysqld_exporter, Percona, Orchestrator | Group Replication / Galera, semi-sync | RDS/Aurora MySQL, Azure/GCP, MaxScale |
Diagramme rapide â Toolchain DBA
Anti-patterns : pas de Git pour le schĂ©ma, CI/CD Ă la main, pas dâIaC, absence de mĂ©triques.
