Project Oxygen & Ideo-LabIDEO LAB Dashboard 2026

☁️ Scaleway Cloud – Hyper-Dense Guide

A practical, production-focused map of Scaleway services: compute, Kubernetes, serverless, storage, managed data services, observability, security, governance, and cost control.

Core
Compute
Containers
Network
Storage
Data
Observability
Security
FinOps
1.1

Foundations

Account/projects model, API-first, regions/AZ strategy, service boundaries, production basics.

CoreStartOps
1.2

Reference Landing Zone

Network segmentation, shared services, logging, secrets, environments, delivery pipelines.

PlatformLZGuardrails
1.3

IaC & Automation

Terraform-first patterns, idempotent deploys, drift control, CI gates, reproducible environments.

IaCTerraformDrift
1.4

Reference Architectures

3-tier web, container platform, event-driven serverless, data platform patterns.

ArchitecturePatternsDesign
1.5

APIs & Tooling

Console vs CLI vs API, access tokens, automation workflows, operational scripting conventions.

APICLIAutomation
1.6

Cheat-sheet

Quick commands, checklists, deployment templates, incident triage shortcuts.

QuickstartChecklistsOps
2.1

Instances (Virtual Machines)

CPU/memory sizing, disk strategy, images, patching, backups, lifecycle and automation.

ComputeVMSizing
2.2

Elastic Metal & Dedicated

Bare metal choices, dedicated performance, provisioning model, workloads: DB, GPU, high IO.

Bare MetalPerformanceOps
2.3

GPU & AI Workloads

GPU sizing, storage throughput, container strategy, batch inference patterns, cost controls.

GPUAICost
2.4

Images & Bootstrapping

Golden images, cloud-init baseline, configuration drift prevention, secrets injection.

ImagesCloud-initHygiene
2.5

Backup & DR

RPO/RTO tiers, snapshot strategy, restore drills, multi-AZ design, runbooks.

BackupDRRunbooks
2.6

Operations Playbook

Access, patch cycles, monitoring agents, hardening, incident steps, postmortems.

OpsSRERunbook
3.1

Kubernetes (Kapsule / Kosmos)

Cluster design, node pools, upgrades, network policy, multi-AZ, security & ops patterns.

K8sManagedUpgrades
3.2

Container Registry

Immutable images, promotion by digest, scanning gates, SBOM, signing strategy.

RegistrySupply ChainCI/CD
3.3

Serverless Containers

Stateless web workloads, scale-to-zero, deploy flows, timeouts, concurrency, reliability.

ServerlessContainersEvents
3.4

Serverless Functions

Triggers, packaging, environments, retries, idempotency, dead-letter strategy.

ServerlessFunctionsReliability
3.5

Ingress & Exposure

Edge entry points, TLS, L7 routing, WAF-equivalent strategy, private services exposure.

IngressTLSEdge
3.6

GitOps & Delivery

Declarative deploys, environments, progressive rollout (canary/blue-green), rollback playbooks.

GitOpsRolloutsEvidence
4.1

Network Core

Private networks/VPC-like patterns, subnetting, segmentation, routing, service boundaries.

NetworkSegmentationRouting
4.2

Public Exposure

Public IP strategy, NAT/egress control, reverse proxy layer, rate limits and resilience.

PublicEdgeDDoS-ish
4.3

Network Security

Security groups, least privilege, egress allowlists, service-to-service isolation.

SecuritySGEgress
4.4

DNS Patterns

Split-horizon, internal naming, private service endpoints, cluster DNS considerations.

DNSPrivateOps
4.5

Hybrid Connectivity

IP planning, tunnels, routing rules, operational ownership, failover behavior documentation.

HybridRoutingDocs
4.6

Network Troubleshooting

Latency, MTU, DNS failures, packet loss, K8s connectivity issues, structured triage.

TriageLatencyRunbook
5.1

Object Storage (S3-compatible)

Buckets, lifecycle, versioning strategy, access controls, encryption, backup/archives.

StorageS3Lifecycle
5.2

Block Storage

SSD volumes, IOPS/throughput thinking, DB workloads, snapshots, resize, consistency rules.

BlockSSDSnapshots
5.3

Storage Backup Strategy

3-2-1 design, immutable copies, restore drills, retention tiers, legal hold patterns.

BackupImmutabilityRPO
5.4

Storage Security

Bucket policies, key management pattern, least privilege, audit logs, access reviews.

SecurityIAMAudit
5.5

Storage Performance

Throughput vs IOPS, parallelism, multipart upload, cache strategy, DB WAL patterns.

PerfIOPSTuning
5.6

Storage Cost Control

Lifecycle policies, cold tiers, log ingestion vs budget, data egress awareness.

FinOpsLifecycleEgress
6.1

Managed Relational DB

PostgreSQL/MySQL managed instances: HA mindset, backups, upgrades, monitoring, safe migrations.

DBPostgreSQLManaged
6.2

Serverless SQL

Serverless database principles: scale-to-zero, connection patterns, pooling, latency trade-offs.

ServerlessSQLTrade-offs
6.3

Managed Redis

Low-latency caching, persistence choices, eviction policy, HA, session storage design.

RedisCacheLatency
6.4

Managed NoSQL

Document-oriented DB patterns: indexing, TTL, schema evolution, backups, scaling strategy.

NoSQLDocumentsModeling
6.5

Analytics (Warehouse)

Analytical workloads: ingestion, partitioning, materialized views, cost control, governance basics.

AnalyticsWarehouseCost
6.6

Search / OpenSearch

Index design, shards/replicas, ingestion pipelines, query profiling, retention, observability.

SearchOpenSearchTuning
7.1

Cockpit (Observability)

Unified observability: metrics, logs, dashboards, alert routing, and operational visibility.

ObservabilityMetricsLogs
7.2

Central Logging

Log taxonomy, retention tiers, sampling, PII discipline, cost-aware ingestion strategy.

LogsCostRetention
7.3

APM & Tracing

Distributed tracing, correlation IDs, RED/USE metrics, latency SLOs, dependency maps.

APMTracingSLO
7.4

Alerting System

Actionable alerts only: ownership, severity, runbooks, escalation, noise control.

AlertsNoiseRunbooks
7.5

SRE Workflow

Incident lifecycle, postmortems, error budgets, continuous improvement loops.

SREPostmortemLoop
7.6

Observability Cost Control

High-volume telemetry: sampling, drop rules, archive, and “signal-first” dashboards.

FinOpsSamplingArchive
8.1

Identity & Access

Service accounts/tokens, RBAC mapping, least privilege, secretless runtime access patterns.

IAMRBACTokens
8.2

Secrets & Key Management

Secrets lifecycle, rotation, injection patterns, auditability, and “no secrets in code” rules.

SecretsRotationAudit
8.3

Security Baseline

Hardening, patch cadence, vulnerability management, supply chain controls, secure defaults.

BaselineHardeningScan
8.4

Edge Security

TLS, WAF-like controls, rate limiting, bot mitigation, incident playbooks at the edge.

EdgeTLSRate limit
8.5

Compliance & Data Protection

Data classification, retention, encryption posture, audit evidence, operational controls.

ComplianceRetentionEvidence
8.6

Security Incident Response

Containment steps, credential rotation, forensic preservation, timeline, and prevention work.

IRForensicsPlaybook
9.1

FinOps Core

Budgets, tags/labels, showback, cost anomalies, and monthly optimization routines.

FinOpsBudgetsAnomaly
9.2

Compute Cost Playbook

Rightsizing, autoscaling, reserved/commit models, environment shutdown, batch windows.

OptimizeScaleShutdown
9.3

Storage Cost Playbook

Lifecycle rules, cold tiers, archive policies, versioning, and egress awareness.

StorageLifecycleEgress
9.4

Data Services Cost

Replica strategy, HA tiers, backup retention, scaling triggers, and query-cost discipline.

DataHARetention
9.5

Observability Spend

Control ingestion, sampling, keep only high-value logs hot, archive the rest.

LogsSamplingArchive
9.6

FinOps KPIs

Unit cost metrics, cost per request, cost per tenant, cost per GB stored, cost per deploy.

KPIsUnit costGovern
1.1 Foundations (Projects, API-first mindset, production rules)
Scope & environment model
Recommended environments
- sandbox (experiments)
- dev (integration)
- staging (release candidate)
- prod (strict guardrails)

Core principles
- separate environments by projects and access controls
- define naming standards and ownership labels
- enforce defaults via IaC templates
Rule: treat environments as products: consistent, repeatable, and auditable.
Production rules (non-negotiable)
  • Everything deployable via IaC (no “clickops” drift).
  • Central logs + alerts from day one (no blind spots).
  • Secrets not stored in app config or repos (rotation required).
  • Least privilege access, time-bound where possible.
  • Backup/restore drills are scheduled and measured.
Service selection framework
NeedDefault choiceEscalate to
Fast web APImanaged containers / K8sVMs for special cases
Batch jobsserverless containersdedicated compute for heavy IO/GPU
Relational DBmanaged DBbare metal for extreme constraints
Object dataobject storagearchive tiers and lifecycle rules
1.2 Reference Landing Zone (Network, shared services, guardrails)
Topology blueprint
Edge (public)
  - reverse proxy / ingress
  - TLS termination
  - rate limiting + bot protection

Private networks
  - app subnet(s)
  - data subnet(s)
  - admin subnet(s) (bastion-like access)

Shared
  - central logging
  - secrets + rotation
  - CI/CD runners (if self-hosted)
  - artifact registry
Rule: keep public surface minimal; everything else private by default.
Shared services (platform subscription equivalent)
  • Central observability workspace (metrics/logs) and alert routing.
  • Secrets store + rotation workflow (and incident “break glass” policy).
  • Container registry and artifact promotion rules.
  • Network egress control points and DNS/naming conventions.
Guardrails (policy-as-code mindset)
  • Enforce naming/labels and ownership on resources.
  • Block direct public exposure of data services unless explicitly approved.
  • Mandatory logging configuration for compute and platforms.
  • Minimum baseline for TLS, credentials, and patching.
Ops evidence: what you must be able to prove
EvidenceHowWhy
Who deployed whatCI logs + artifact digestsauditability
Security posturescan reports + patch reportsrisk control
Recoverabilityrestore drill resultsreal DR
SLO compliancedashboards + incidentscustomer trust
1.3 IaC & Automation (Terraform-first, drift control, CI gates)
Terraform workflow (gold standard)
Stages
1) fmt + validate
2) plan (saved plan)
3) policy checks (custom)
4) approval gate (prod)
5) apply
6) smoke tests + monitoring hooks
Rule: no apply in prod without a reviewed plan.
Drift control
  • Scheduled plan to detect drift.
  • Alert on out-of-band changes.
  • Either reconcile (apply) or revert (incident).
  • Track “exceptions” explicitly and time-bound them.
1.4 Reference Architectures (web, container platform, serverless, data)
3-tier baseline (private-first)
Internet -> Edge (TLS + routing + rate limits)
  -> App (containers / K8s / VMs in private networks)
    -> Data (managed DB + object storage)
Observability + secrets + backups are platform-wide.
K8s platform baseline
  • Separate system and workload node pools.
  • GitOps deployment with environment overlays.
  • Network policy + minimal service exposure.
  • Supply chain gates: scan + SBOM + signature verification.
  • Observability: metrics + logs + traces as default.
Event-driven serverless baseline
Triggers -> Serverless Functions / Containers
  -> durable storage (DB/object)
  -> dead-letter strategy + alerts
  -> idempotency keys for every handler
Rule: retries without idempotency create data corruption.
Data platform sketch
Ingest -> Object storage (raw)
Transform -> compute (batch / containers)
Serve -> warehouse / search index / APIs
Govern -> access model + retention + audit trail
1.5 APIs & Tooling (Console/CLI/API, automation conventions)
Automation conventions
  • Prefer API/IaC over console for repeatability.
  • Store credentials securely; rotate and audit.
  • Every script must be idempotent and log its actions.
  • Keep a “break glass” playbook, but isolate it.
Script contract
- inputs validated
- dry-run supported
- logs to stdout in structured lines
- exit codes reliable
- safe retries
Cheat-sheet (Checklists, templates, incident shortcuts)
Platform checklist
Landing zone
- private networks segmentation
- edge entry points minimal
- centralized observability + alert routing
- secrets lifecycle + rotation
- backups + restore drills
- IaC modules + CI gates
- supply chain controls for containers
Serverless checklist
Serverless reliability
- idempotency keys
- bounded retries
- dead-letter strategy + alerts
- timeouts sized per workload
- concurrency limits
- structured logs + tracing
Cost checklist
FinOps loop (monthly)
- top 10 spenders review
- rightsizing candidates
- storage lifecycle enforcement
- log ingestion reduction
- idle resources cleanup
- unit cost KPIs (per request / per tenant)
Incident shortcut
Triage steps
1) user impact scope (SLO breach?)
2) recent deployments
3) saturation signals (CPU/mem/IO/conn)
4) network/DNS failures
5) data errors (locks/slow queries)
6) rollback or mitigation
7) postmortem actions
2.1 Instances (VMs) – Sizing, disks, images, patching, lifecycle
Sizing method (no guessing)
SignalWhat to watchAction
CPUp95 utilization + stealrightsize / scale out
Memorypressure + OOM riskincrease RAM / reduce footprint
DiskIOPS/throughput + queuemove to faster volume / shard
Networkpps + retransmitstune edge / improve routing
Disk strategy (DB-grade thinking)
  • Separate OS disk from data disk when needed.
  • For databases: isolate WAL/redo logs if possible; measure IOPS and fsync latency.
  • Snapshots are not backups unless restore is tested and retention is enforced.
  • Use filesystem options aligned with workload (barriers, journaling choices).
Rule: treat storage latency as a primary production KPI.
Ops baseline
  • SSH via controlled entry (no open world access).
  • Patching cadence + emergency patch process.
  • Central logs and metrics with alerts on saturation.
  • Immutable infrastructure mindset where possible (rebuild over patch drift).
2.2 Elastic Metal & Dedicated – When bare metal is justified
Decision criteria
ConstraintWhy metalMitigation if not
Extreme IOlowest latency, dedicated throughputsharding + caching
Licensingper-core constraintsoptimize core counts
Isolationstrict tenancy needsstrong security baseline
GPU intensivededicated acceleratorsbatch windows + scaling
Rule: bare metal increases operational responsibility—plan automation and monitoring first.
2.3 GPU & AI – Container strategy, batch inference, cost discipline
GPU platform patterns
  • Prefer containers for reproducibility (drivers/toolkit pinned).
  • Separate training vs inference: different scheduling and scaling models.
  • Use batch windows and auto-shutdown for idle GPU time.
Cost controls (mandatory)
  • Define maximum concurrency and max runtime per job.
  • Track cost per 1k inferences / per training epoch.
  • Cache model artifacts in object storage with versioning.
2.4 Images & Bootstrapping – Golden images + cloud-init baseline
Golden image contract
Golden image must include
- base hardening (sshd settings, firewall defaults)
- monitoring agent install step
- log forwarding configuration
- time sync and DNS defaults
- minimal packages only

cloud-init responsibilities
- inject host keys safely
- configure app runtime
- register into monitoring
- pull secrets from secure store
Rule: rebuild is safer than mutate. Keep servers disposable.
2.5 Backup & DR – RPO/RTO tiers, snapshots, restore drills
TierTargetDesignVerification
Tier 0minutesmulti-AZ + replicationgame day drills
Tier 1hourssnapshots + managed backupsmonthly restores
Tier 2dayobject backups + manualquarterly audits
Rule: backups are only real if you restore and measure recovery time.
2.6 Operations Playbook – Access, patching, monitoring, incidents
Operational loop
Daily
- check SLO dashboards
- review alerts + top errors
- confirm backup jobs

Weekly
- patch window for non-prod
- capacity review (CPU/mem/IO)
- vulnerability scan review

Monthly
- cost review + rightsizing
- restore drill
- postmortem action items verification
3.1 Kubernetes (Kapsule / Kosmos) – Cluster design, security, upgrades
Cluster foundation
  • Separate system and workload pools.
  • Define ingress strategy and TLS as a platform standard.
  • Use autoscaling carefully: HPA + cluster autoscaler with safe limits.
  • Pin base images and enforce immutable deployments.
Security essentials
  • Network policies: default deny + allow by service needs.
  • RBAC: least privilege, separate admin from deploy roles.
  • Pod security and runtime constraints (no privileged by default).
  • Supply chain: scan + SBOM + signature validation in CI/CD.
Operations
  • Upgrades: staged, maintenance windows, canary cluster if needed.
  • Observability: cluster + node + workload dashboards.
  • Backups: stateful systems are backed up outside the cluster; configs are GitOps.
Resilience
Resilience checklist
- readiness/liveness probes
- pod disruption budgets
- multi-node spread (anti-affinity)
- rate limits at ingress
- graceful shutdown
- chaos-style drills (optional but valuable)
3.2 Container Registry – Promotion by digest, SBOM, signing, scanning gates
Supply chain gates
GateWhat it checksBlock on
Vuln scanCVEs in OS/libshigh/critical
SBOMdependency inventorymissing SBOM
Signatureimage provenanceunsigned images
Policybase image allowlistunapproved base
Rule: deploy by digest, not mutable tags.
3.3 Serverless Containers – Stateless workloads, scaling, timeouts, reliability
Best for
  • Stateless web APIs and job-like workloads.
  • Scale-to-zero services with bursty traffic.
  • Event-driven handlers packaged as containers.
Rule: keep state in managed services (DB/object), never on ephemeral filesystem.
Reliability checklist
Reliability
- strict request timeout budgeting
- bounded concurrency
- retry policy aligned with idempotency
- dead-letter handling for async patterns
- structured logs + correlation IDs
Cost discipline
  • Track cost per request and cost per job.
  • Cap max scale for “runaway traffic” scenarios.
  • Use caching and edge rate limits to avoid amplification.
3.4 Serverless Functions – Triggers, retries, idempotency, dead-letters
Golden rules
  • Idempotency is mandatory for event handlers.
  • Use deterministic retry strategy (max attempts, backoff, time budget).
  • Write logs as structured events with correlation IDs.
  • Separate “poison messages” to a dead-letter stream and alert on it.
Handler skeleton (concept)
- validate payload
- compute idempotency key
- check processed marker
- process business logic
- persist result atomically
- return success
- on error: classify retryable vs non-retryable
3.5 Ingress & Exposure – TLS, routing, rate limits, private services
ConcernEdge controlNotes
TLSterminate + rotate certsenforce modern ciphers
RoutingL7 rulespath-based and host-based
Abuserate limits + IP rulesprevent traffic amplification
Private servicesinternal routingavoid public endpoints
Rule: if it does not need to be public, do not make it public.
3.6 GitOps & Delivery – Environments, progressive rollouts, rollback playbooks
Release patterns
PatternBest forRequirement
Blue/greensafe cutovertraffic switch + fast rollback
Canaryrisk reductionmetric-based promotion
Ringsenterpriseprogressive exposure
Rule: rollout strategy requires SLO dashboards and rollback automation.
4.1 Network Core – Segmentation, routing, service boundaries
Segmentation blueprint
Network zones
- edge (public entry)
- app (private workloads)
- data (private databases)
- admin (restricted access)
- shared (observability, registry, secrets)
Rule: segmentation is an incident containment tool, not a checkbox.
4.2 Public Exposure – IP strategy, NAT/egress control, resilience
Public entry rules
  • Terminate TLS at a controlled edge layer.
  • Rate limit by IP and by identity where possible.
  • Implement request timeouts and size limits.
  • Log edge events and alert on anomalies.
Rule: edge defenses must be measurable (traffic, blocks, latency impact).
4.3 Network Security – Security groups, egress allowlists, isolation
ControlGoalCommon failure
Ingress rulesallow only required ports0.0.0.0/0 to admin ports
Egress rulesprevent data exfilallow all outbound by default
Service isolationcontain compromiseflat network with shared creds
Rule: outbound traffic control is often your strongest last-line defense.
4.4 DNS Patterns – Split-horizon, internal naming, cluster DNS
DNS rules
  • Document resolution chain (who resolves what, where, and why).
  • Use internal names for private services; keep external DNS minimal.
  • For Kubernetes: standardize service discovery and ingress hostnames.
Rule: most “mysterious outages” are DNS + routing + timeouts combined.
4.5 Hybrid Connectivity – IP planning, routing rules, failover ownership
Hybrid contract
Hybrid must define
- prefix plan (no overlaps)
- routing ownership (who changes what)
- failover behavior (tested)
- change windows and rollback
- monitoring for tunnel health
4.6 Network Troubleshooting – Structured triage for latency, loss, DNS
Triage checklist
SymptomCheckAction
Timeoutsedge logs + upstream latencytighten timeouts, fix bottleneck
DNS failuresresolver health + TTLstabilize DNS chain
Packet lossretransmits, MTUfix MTU or routing
Slow K8snetwork policy + CNItrace flows, simplify rules
5.1 Object Storage (S3-compatible) – Lifecycle, versioning, access control
Bucket design
  • Separate buckets by data classification and lifecycle needs.
  • Define naming conventions and ownership labels.
  • Prefer immutable object versions for critical artifacts.
Security rules
  • Least privilege: scoped credentials and access review.
  • Encrypt data and restrict cross-project access.
  • Audit access and alert on anomalies.
Lifecycle policy (cost control)
Lifecycle example
- day 0-30: hot
- day 31-180: cool
- day 181+: archive
- delete markers and old versions per policy
Rule: lifecycle policies are your best storage cost lever.
5.2 Block Storage – SSD volumes, DB workloads, snapshots, resize
DB-grade checklist
  • Measure fsync latency and queue depth.
  • Separate write-heavy volumes from OS when needed.
  • Snapshots are not a substitute for logical backups.
  • Test restore path and automate validation.
Rule: if IO latency spikes, your entire platform degrades.
5.3 Storage Backup Strategy – 3-2-1, immutability, restore drills
3-2-1
- 3 copies
- 2 different media (block + object)
- 1 offsite (separate project/zone)

Operational must-haves
- documented restore steps
- monthly restore drill
- retention and deletion protection
5.4 Storage Security – Least privilege, audits, encryption posture
ControlHowOutcome
Least privilegescoped credentialsreduced blast radius
Access reviewmonthly auditremove stale access
Encryptionstandardize policyconsistent posture
Loggingcentral logs + alertsdetect anomalies
5.5 Storage Performance – IOPS, throughput, multipart, caching
Performance guidance
  • Object storage: parallel uploads + multipart for big objects.
  • Block storage: monitor queue depth and fsync latency for databases.
  • Caching: avoid re-downloading artifacts; version them and cache safely.
5.6 Storage Cost Control – Lifecycle, cold tiers, egress awareness
Cost levers
  • Lifecycle transitions and deletion policies.
  • Archive old versions; keep only what you restore.
  • Track egress drivers (CDN/edge caches reduce outbound).
Rule: storage costs explode due to “forgotten” data and missing lifecycle rules.
6.1 Managed Relational DB – HA thinking, backups, upgrades, tuning
HA/DR mindset
  • Know your failure domains and design accordingly.
  • Prefer managed HA where available; document failover behavior.
  • Measure RPO/RTO and validate with drills.
Backups and safe migrations
Safe migration flow
1) backup + verify restore path
2) schema change in small steps
3) dual-write or compatibility window (if needed)
4) monitor errors + latency
5) cleanup after stabilization
Performance loop
DB tuning loop
1) capture slow queries
2) explain/analyze
3) index or rewrite
4) validate with p95/p99
5) regressions guardrails (tests + dashboards)
6.2 Serverless SQL – Connection patterns, pooling, latency trade-offs
Serverless database pitfalls
  • Cold start latency can hit first queries—budget for it.
  • Connection storms are common: use pooling or connection limits.
  • Long transactions reduce scalability—keep transactions short.
Rule: serverless DB is an application architecture decision, not only a DB decision.
6.3 Managed Redis – Caching strategy, persistence, eviction, HA
TopicDecisionRule
TTLper key classno infinite TTL without justification
Evictionpolicy choicealign with data criticality
Persistenceif neededcache != source of truth
HAreplicationtest failover behavior
6.4 Managed NoSQL – Indexing, schema evolution, TTL, backups
Modeling checklist
  • Design queries first, then indexes.
  • Use explicit version fields for schema evolution.
  • TTL for ephemeral data and cost control.
  • Backup strategy independent from the DB engine.
6.5 Analytics (Warehouse) – Partitioning, ingestion, cost discipline
Warehouse rules
- ingest in append-only patterns
- partition by time and key dimensions
- keep hot and cold datasets separate
- track cost per query / per dashboard
- implement retention and archiving
Rule: analytics cost is dominated by data scanned and query concurrency.
7.1 Cockpit – Unified observability (metrics, logs, dashboards, alerting)
Observability model
Signals
- metrics (fast, low cost)
- logs (deep, higher cost)
- traces (request path)

System
- dashboards for SLOs
- alerts wired to runbooks
- retention policies as code
Rule: if you cannot see it, you cannot operate it.
7.2 Central Logging – Taxonomy, retention tiers, sampling, PII discipline
Log taxonomy
Levels
- audit (security relevant)
- error (actionable failures)
- warn (degradation)
- info (operational events)
- debug (short retention, controlled)
Rule: log retention is a cost and a compliance requirement—treat it as policy.
7.3 APM & Tracing – Correlation IDs, RED/USE, latency SLOs
Minimum viable tracing
  • Correlation ID across services and logs.
  • Trace external dependencies (DB, cache, HTTP calls).
  • Track p95/p99 latency and error rate for each service.
Rule: trace sampling must preserve high-error and high-latency requests.
7.4 Alerting System – Actionable alerts, ownership, runbooks, escalation
AlertConditionRunbook
SLO breacherror rate or latency over thresholdrollback / mitigate / scale
SaturationCPU/mem/IO high + queuerightsize / scale / shard
Securityauth anomaliesrotate creds / block / investigate
Backup failurejob missing or errorrepair + re-run + verify restore
Rule: if an alert cannot be acted upon, it is noise.
7.5 SRE Workflow – Incidents, postmortems, error budgets
Incident lifecycle
Detect -> Triage -> Mitigate -> Recover -> Postmortem

Postmortem must include
- timeline
- root cause
- contributing factors
- detection gaps
- action items with owners and deadlines
Rule: postmortems are improvement engines, not blame tools.
7.6 Observability Cost Control – Sampling, drop rules, archive strategy
Cost levers
  • Sampling for traces and high-volume logs.
  • Keep audit/security logs high priority; reduce verbose app logs.
  • Short hot retention, long archive retention.
Rule: keep the signal hot; archive the history.
8.1 Identity & Access – Tokens, RBAC mapping, least privilege
Access model
Principles
- least privilege by role
- separate admin vs deploy vs read-only
- time-bound access for sensitive actions
- credential rotation policy
- audit trail for privileged operations
Rule: credentials without rotation are liabilities.
8.2 Secrets & Key Management – Rotation, injection, auditability
Secrets lifecycle
PhaseWhatControl
Creategenerate securelyno manual weak secrets
Storesecure vaultaccess logs
Injectruntime fetchno secrets in images
Rotatescheduledalert on failures
Revokeincident responsefast containment
Rule: design secrets rotation before production launch.
8.3 Security Baseline – Hardening, patching, scanning, secure defaults
Baseline checklist
  • OS hardening and minimal packages.
  • Patch cadence + emergency patch process.
  • Container scanning + signed images.
  • Runtime controls: least privileges and no privileged containers by default.
  • Audit logs routed centrally and retained.
8.4 Edge Security – TLS, rate limits, bot mitigation, incident playbooks
Edge controls
Controls
- strict TLS configuration
- request size limits
- rate limit by IP and by token
- allowlist for admin endpoints
- anomaly detection from edge logs
- fast block / unblock workflow
Rule: your edge is your blast radius boundary.
8.5 Compliance & Data Protection – Classification, retention, encryption, evidence
AreaControlProof
Data classlabels + access policyinventory report
Retentionpolicy-as-codeconfig snapshots
Encryptionstandard postureaudit checks
Backupsdrillsrestore logs
8.6 Security Incident Response – Containment, rotation, forensics, prevention
IR playbook
1) contain: block entry, isolate systems
2) preserve evidence: logs, snapshots
3) rotate credentials: tokens, DB creds, registry secrets
4) eradicate: patch, remove persistence
5) recover: restore services, monitor
6) learn: postmortem + guardrails
Rule: rotate creds early; attackers love long-lived secrets.
9.1 FinOps Core – Budgets, showback, anomalies, monthly routines
FinOps loop
Weekly
- anomaly detection review
- top spenders quick scan

Monthly
- rightsizing and idle cleanup
- storage lifecycle enforcement
- log ingestion reduction
- unit cost KPI review
Rule: cost is an engineering metric. Make it visible to teams.
9.2 Compute Cost – Rightsize, autoscale, shutdown schedules, batch windows
LeverActionProof
Rightsizeadjust CPU/RAMutilization report
Scaleautoscale safelySLO stability
Shutdownstop non-prod nightlyschedule evidence
Batchrun heavy jobs in windowscost per job
9.3 Storage Cost – Lifecycle rules, retention, egress reduction
Top storage wastes
  • No lifecycle rules (everything stays hot forever).
  • Unlimited versions and no cleanup.
  • Unbounded logs in object storage with no retention.
  • Unexpected egress due to lack of caching/edge.
Rule: lifecycle without enforcement is only documentation.
9.4 Data Services Cost – HA tiers, backups, scaling triggers, query discipline
Data cost drivers
- always-on replicas
- long retention for backups/logs
- inefficient queries scanning too much data
- overprovisioned instance sizes

Controls
- rightsizing reviews
- query performance budgets
- retention as policy
9.5 Observability Spend – Ingestion control, sampling, archive tiers
Signal-first policy
  • Keep audit/security logs hot and long retention.
  • Sample traces aggressively but keep “slow/error” traces.
  • Archive bulk logs; keep dashboards based on SLO signals.
9.6 FinOps KPIs – Unit economics for cloud
KPIDefinitionUse
Cost / 1k requestsinfra spend divided by trafficscale economics
Cost / tenantmonthly spend per customerpricing sanity
Cost / GB storedstorage + lifecycle efficiencyretention tuning
Cost / deployCI/CD + artifact + test spendpipeline efficiency
Rule: unit costs drive architecture decisions more than raw “monthly spend”.