☁️ OVHcloud – Hyper-Dense Cloud Guide (Public Cloud, Bare Metal, K8s, Data, Security, Observability, FinOps)

1.1 Foundations – Account/projects, regions, ownership, production rules

Environment model

Recommended environments
- sandbox (experiments)
- dev (integration)
- staging (release candidate)
- prod (strict guardrails)

Core principles
- separate environments with clear access boundaries
- enforce naming/labels and ownership
- default to private networking and minimal public surface

Rule: every environment must be reproducible from IaC and observable from day one.

Production non-negotiables

Everything deployable via IaC (no unmanaged drift).
Central logs + metrics + alert routing (no blind spots).
Secrets are never stored in repositories or baked into images.
Least privilege and separation of duties.
Backups and restore drills scheduled and measured.

Selection framework

Need	Default	Escalate to
Fast web API	MKS + managed DB	Instances/metal for special constraints
High IO DB	Managed DB when fit	Bare Metal for extreme latency/throughput
Unstructured data	Object Storage	Cold Archive for long retention
Private connectivity	vRack	Hybrid patterns (tunnels, routing governance)

1.2 Reference Landing Zone – Segmentation, shared services, guardrails

Topology blueprint

Edge (public)
  - TLS termination + routing
  - rate limits + request size limits
  - security logging (edge events)

Private networks
  - app subnet(s)
  - data subnet(s)
  - admin subnet(s)

Shared
  - secrets lifecycle
  - observability workspace
  - artifact/container registry policy
  - backups + restore drills

Rule: default to private. Only the edge is public.

Shared services

Central logs and metrics with alert routing.
Secrets store and rotation workflow.
Supply chain controls for containers (scan/SBOM/signing).
CI/CD pipelines with promotion gates (dev → staging → prod).

Guardrails

Block direct public exposure of data services unless approved.
Mandatory logging on compute and platforms.
Least privilege and access reviews.
Backup retention and restore drill evidence required.

Operational evidence

Evidence	How	Why
Who deployed what	CI logs + artifact digests	auditability
Recoverability	restore drill results	real DR
Security posture	scan reports + patch logs	risk control
SLO compliance	dashboards + incidents	reliability

1.3 OVHcloud Portfolio Map – Public Cloud vs Bare Metal vs platforms

Decision principles

Option	Best for	Trade-off
Public Cloud	elastic workloads, fast iteration	you must design HA/ops
Managed K8s (MKS)	container platform	platform ops discipline needed
Managed DB	reduce DBA toil	feature/extension constraints
Bare Metal	extreme IO, isolation	more responsibility

Common architecture patterns

Pattern A: Public Cloud app
- Instances/MKS (app)
- Managed DB (data)
- Object Storage (artifacts)
- Logs Data Platform (logs)

Pattern B: High IO core on metal
- Bare Metal for DB/storage layer
- Public Cloud/MKS for stateless app
- vRack private network between tiers

Rule: choose metal only with strong automation and observability maturity.

1.4 IaC & Automation – Terraform-first workflows and drift control

Terraform workflow (gold standard)

Stages
1) fmt + validate
2) plan (saved artifact)
3) policy checks (custom)
4) approval gate (prod)
5) apply
6) smoke tests + monitoring hooks

Rule: no production apply without a reviewed plan artifact.

Drift control

Scheduled plan to detect drift.
Alert on out-of-band changes.
Reconcile via code or revert as an incident.
Time-bound exceptions with explicit ownership.

1.5 Console / CLI / API – Operational conventions

Automation contract

Script contract
- inputs validated
- dry-run supported
- structured logs to stdout
- reliable exit codes
- safe retries and idempotency

Credential policy
- short-lived when possible
- stored in vault/secrets store
- rotated and audited
- break-glass procedure documented

Rule: automation without audit trails becomes a security risk.

Cheat-sheet – Checklists, templates, triage shortcuts

Platform checklist

Landing zone
- private segmentation + minimal public edge
- centralized logs/metrics + alert routing
- secrets lifecycle + rotation
- backups + restore drills
- IaC modules + CI gates
- container supply chain controls

Incident shortcut

Triage steps
1) user impact + SLO breach?
2) recent deployments (last 60 minutes)
3) saturation (CPU/mem/IO/conn)
4) network/DNS failures
5) data layer errors (locks/slow queries)
6) rollback or mitigation
7) postmortem actions

Cost checklist

Monthly FinOps loop
- top spenders review
- rightsizing candidates
- storage lifecycle enforcement
- log ingestion reduction
- idle resources cleanup
- unit cost KPIs (per request / per tenant)

Kubernetes checklist

K8s production
- separate system/workload node pools
- PDB + probes + graceful shutdown
- network policies (default deny)
- GitOps with promotion
- SLO dashboards and rollback automation

2.1 Public Cloud Compute – Instances, sizing, disks, scaling, hygiene

Sizing method

Signal	Watch	Action
CPU	p95 + saturation	rightsize or scale out
Memory	pressure + OOM risk	increase RAM or reduce footprint
Disk	IO latency + queue	move to faster disks, shard, cache
Network	pps + retransmits	edge tuning, rate limiting, routing

Disk strategy

Separate OS disk from data disk for write-heavy workloads.
For databases: measure fsync latency and queue depth.
Snapshots are not backups unless restore is tested.
Standardize encryption and retention policies.

Rule: storage latency is a first-class production KPI.

Operational baseline

Access via controlled entry (no global admin ports).
Patching cadence + emergency patch workflow.
Central logging and metrics with saturation alerts.
Prefer rebuild over mutate (disposable servers mindset).

2.2 Bare Metal / Dedicated – When metal is justified

Decision table

Constraint	Why metal	Mitigation if not
Extreme IO	lowest latency and dedicated throughput	sharding + caching + async
Isolation	strict tenancy / compliance needs	strong segmentation + IAM
Licensing	per-core constraints	optimize core counts
Virtualization stack	VMware/Nutanix-like workloads	managed where possible

Rule: metal increases operational responsibility—automation and observability first.

2.3 Virtualization Platforms – Ownership, operations, cost model

Platform ownership checklist

Who patches the hypervisor and management plane?
How are backups done (VM-level vs app-level)?
How do you observe saturation (CPU ready, storage latency, network pps)?
What is the failure domain and recovery process?

TCO and risk framing

TCO drivers
- always-on capacity (reserved compute)
- storage replication and backups
- licensing
- operations staffing
- outage blast radius if governance is weak

Rule: virtualization is a platform product; treat it like one.

2.4 Images & Bootstrapping – Golden images + cloud-init baseline

Golden image contract

Golden image includes
- minimal packages and hardened defaults
- monitoring/log forwarding agent install
- time sync and DNS defaults
- baseline firewall rules

cloud-init responsibilities
- inject runtime config
- fetch secrets from secure store
- register node into monitoring
- configure service discovery

Rule: rebuild is safer than mutate. Keep servers disposable.

2.5 Backup & DR – RPO/RTO tiers and restore drills

Tier	Target	Design	Verification
Tier 0	minutes	multi-zone + replication	game day drills
Tier 1	hours	managed backups + snapshots	monthly restores
Tier 2	day	object backups + archive	quarterly audits

Rule: backups are real only if you restore and measure recovery time.

2.6 Operations Playbook – Patch cycles, monitoring, incidents

Operational loop

Daily
- check SLO dashboards
- review alerts and top errors
- confirm backups

Weekly
- patch non-prod
- capacity review (CPU/mem/IO/conn)
- vulnerability review

Monthly
- cost review and rightsizing
- restore drill
- verify postmortem actions

3.1 Managed Kubernetes Service (MKS) – Design, security, upgrades

Cluster foundation

Separate system and workload node pools.
Define ingress/TLS strategy as a platform standard.
Use autoscaling with strict limits to avoid runaway spend/outages.
Pin base images and enforce immutable deployments.

MKS is OVHcloud’s managed Kubernetes offering. :contentReference[oaicite:1]{index=1}

Security essentials

Network policies: default deny, allow only required flows.
RBAC: least privilege, separate admins from deploy roles.
Pod security: avoid privileged containers, enforce baseline constraints.
Supply chain gates: scan + SBOM + signing policy.

Operations

Upgrades: staged rollout with maintenance windows.
Observability: cluster/node/workload dashboards and alert routing.
Backups: state is backed up outside the cluster; configs are GitOps.

Resilience checklist

Resilience
- readiness/liveness probes
- pod disruption budgets
- anti-affinity / topology spread
- rate limits at ingress
- graceful shutdown
- rollback automation and SLO gates

3.2 Managed Rancher Service (MRS) – Multi-cluster governance

Why Rancher in enterprise

Central governance for many clusters.
Unified RBAC, policies, and visibility across environments.
Standardized cluster provisioning and upgrades.

Common pitfalls

Over-privileged global admin roles.
No cluster lifecycle policy (old clusters never upgraded).
Missing evidence: who changed what, when, and why.

OVHcloud documents MRS/MKS integration patterns and cluster creation guides. :contentReference[oaicite:2]{index=2}

3.3 Container Supply Chain – Scanning, SBOM, signing, promotion by digest

Supply chain gates

Gate	Checks	Block on
Vulnerability scan	CVEs in OS/libs	high/critical
SBOM	dependency inventory	missing inventory
Signature	provenance	unsigned images
Policy	approved base images	unapproved base

Rule: deploy by digest, not mutable tags. Keep promotion as code.

3.4 Ingress & Exposure – TLS, routing, rate limits, private services

Concern	Edge control	Notes
TLS	terminate + rotate certs	enforce modern ciphers
Routing	L7 rules	host/path-based routing
Abuse	rate limits + blocks	prevent amplification
Private services	internal exposure	avoid public endpoints

Rule: if it does not need to be public, do not make it public.

3.5 GitOps & Delivery – Progressive rollouts and rollback playbooks

Release patterns

Pattern	Best for	Requirement
Blue/green	safe cutover	traffic switch + fast rollback
Canary	risk reduction	metric-based promotion
Rings	enterprise	progressive exposure

Rule: rollout strategy requires SLO dashboards and automated rollback paths.

3.6 Kubernetes Operations – Autoscaling, capacity, SLO, backup of state

K8s ops must-haves
- node pool templates and capacity guardrails
- cluster autoscaler with max bounds
- resource requests/limits enforced
- admission policies for baseline security
- SLO dashboards (p95 latency + error rate)
- stateful backups outside the cluster

OVHcloud maintains extensive MKS operational documentation (node pools, autoscaler, audit logs). :contentReference[oaicite:3]{index=3}

4.1 Network Core – Segmentation, routing, failure domains

Segmentation blueprint

Network zones
- edge (public entry)
- app (private workloads)
- data (private databases)
- admin (restricted access)
- shared (observability, registry, secrets)

Rule: segmentation is an incident containment tool, not a checkbox.

4.2 vRack Private Network – Private interconnect patterns

What vRack enables

Private network connectivity across OVHcloud services.
Tier isolation (app/data/admin) without public exposure.
Hybrid patterns: metal + public cloud combined.

Operational rules

Explicit routing governance: who changes what and when.
Document failure behavior and test it.
Monitor latency, packet loss, and DNS resolution paths.

OVHcloud lists vRack as a private connectivity building block in Bare Metal connectivity options. :contentReference[oaicite:4]{index=4}

4.3 BYOIP & Additional IP – Portability and migration strategy

Why IP portability matters

Preserve network identity during migrations.
Reduce DNS churn and partner allowlist changes.
Enable staged cutovers with controlled routing.

Migration playbook (high-level)
1) prepare parallel stack
2) validate security + observability
3) move/attach IP ranges (when applicable)
4) controlled traffic shift
5) rollback readiness and post-cutover monitoring

OVHcloud references BYOIP as a Bare Metal capability. :contentReference[oaicite:5]{index=5}

4.4 Edge Exposure – TLS posture, abuse controls, and resilience

Edge controls
- strict TLS configuration and rotation
- request size limits and timeouts
- rate limiting by IP and identity
- allowlist for admin endpoints
- anomaly detection from edge logs
- fast block/unblock workflow (runbook)

Rule: the edge is the boundary of your blast radius.

4.5 DNS Patterns – Split-horizon, internal naming, resilience

DNS rules

Document resolution chain and ownership.
Use internal names for private services.
Standardize ingress hostnames and certificates.
Monitor NXDOMAIN spikes and resolver latency.

Rule: many outages are DNS + routing + timeouts combined.

4.6 Network Troubleshooting – Latency, MTU, DNS failures, loss

Symptom	Check	Action
Timeouts	edge logs + upstream latency	fix bottleneck, tighten timeouts
DNS failures	resolver health + TTL	stabilize resolver chain
Packet loss	retransmits, MTU	fix MTU/routing, isolate noisy neighbors
K8s connectivity	CNI + network policy	trace flows, simplify rules

5.1 Object Storage – Buckets, lifecycle, versioning, access, encryption

Bucket design

Separate buckets by data classification and lifecycle needs.
Define naming conventions and ownership labels.
Prefer immutable versions for critical artifacts.

OVHcloud positions Object Storage for unstructured data and backups. :contentReference[oaicite:6]{index=6}

Security rules

Least privilege with scoped credentials.
Restrict cross-project access and enforce encryption posture.
Audit access and alert on anomalies via centralized logging.

Lifecycle policy (cost control)

Lifecycle example
- day 0-30: hot
- day 31-180: cool
- day 181+: archive
- delete old versions per retention rules

Rule: lifecycle policies are the strongest lever for storage spend control.

5.2 Block Storage – IOPS/throughput thinking for DB-grade workloads

DB-grade checklist

Measure fsync latency and IO queue depth.
Separate write-heavy volumes from OS when needed.
Snapshots are not a substitute for logical backups.
Restore path is tested and documented.

OVHcloud lists Block Storage as a Public Cloud storage category. :contentReference[oaicite:7]{index=7}

5.3 File Storage – Shared file systems for instances and containers

Design rules

Know access patterns: many small IO vs large sequential streams.
Keep multi-tenant isolation explicit (permissions and network boundaries).
Monitor latency and throughput; define a performance budget.
Avoid using shared file storage as a database log device.

OVHcloud lists File Storage in Public Cloud storage offerings. :contentReference[oaicite:8]{index=8}

5.4 Cold Archive & Retention – Long-term data and compliance tiers

Archive rules
- treat archive as immutable when possible
- keep retrieval times and costs documented
- define retention by classification (audit, legal, analytics)
- test restore from archive quarterly

OVHcloud lists Cold Archive for long-term storage in Public Cloud storage options. :contentReference[oaicite:9]{index=9}

5.5 Storage Security – Least privilege, audits, encryption posture

Control	How	Outcome
Least privilege	scoped credentials	reduced blast radius
Access review	monthly audit	remove stale access
Encryption	standardize policy	consistent posture
Logging	central logs + alerts	detect anomalies

5.6 Storage Cost Control – Lifecycle rules and egress awareness

Top storage wastes

No lifecycle rules: everything stays hot forever.
Unlimited versions without cleanup.
High-volume logs stored without retention tiers.
Unexpected egress due to missing caching/edge strategy.

Rule: storage cost explodes due to forgotten data and missing lifecycle enforcement.

6.1 Public Cloud Databases – HA mindset, backups, upgrades, performance

HA/DR mindset

Know failure domains and document failover behavior.
Measure RPO/RTO and validate with drills.
Keep application timeouts aligned with failover behavior.

Backups

Backup contract
- automated backups enabled
- retention policy defined
- restore runbook documented
- monthly restore drill executed
- evidence stored (timestamps, results)

Performance loop

DB tuning loop
1) capture slow queries
2) explain/analyze
3) index or rewrite
4) validate p95/p99 latency
5) protect with regression dashboards

Safe migrations

Safe migration flow
1) backup + verified restore path
2) schema change in small steps
3) compatibility window (dual-write if needed)
4) monitor errors and latency
5) cleanup after stabilization

OVHcloud provides managed database offerings within Public Cloud Databases. :contentReference[oaicite:10]{index=10}

6.2 Cache Layer (Redis Patterns) – TTL, eviction, HA, session storage

Topic	Decision	Rule
TTL	per data class	no infinite TTL without justification
Eviction	policy choice	align with data criticality
Persistence	optional	cache is not source of truth
HA	replication/failover	test failover behavior

6.3 Search / Indexing – Mapping, ingestion, retention, cost control

Area	What matters	Action
Mapping	field types, analyzers	freeze mapping early
Ingestion	bulk + backpressure	avoid overload loops
Retention	index lifecycle	rollover + delete
Cost	storage + query load	sample logs, archive cold data

Rule: search performance is mapping + shard sizing + ingestion control.

6.4 Data Platform Blueprint – Ingest, store, transform, serve, govern

Data platform sketch
Ingest -> Object Storage (raw)
Transform -> compute (batch / containers)
Serve -> APIs + indexes + warehouse
Govern -> access model + retention + audit trail

Rules
- treat raw data as immutable
- separate hot and cold datasets
- define unit-cost metrics per pipeline

Rule: governance is not optional; build it into the architecture.

6.5 Safe Data Migrations – Zero-downtime patterns and rollback readiness

Zero-downtime patterns

Expand/contract schema (add fields first, remove later).
Backfill with throttling and progress evidence.
Compatibility window with dual reads/writes if required.
Cutover with feature flags and rapid rollback.

Rule: migration success is measured by user impact, not by “schema applied”.

6.6 Data Cost Control – Retention, replicas, query budgets, unit cost KPIs

Data cost drivers
- always-on replicas
- long retention for backups/logs
- inefficient queries scanning too much data
- overprovisioned instance sizes

Controls
- rightsizing reviews
- query performance budgets
- retention policy-as-code
- archive cold datasets

7.1 Logs Data Platform (LDP) – Collect, store, analyze logs (managed)

What LDP is for

Central log collection and retention.
Operational dashboards and investigations.
Security/audit log correlation.

Ingestion discipline

Log policy
- keep high-signal logs hot (errors, audit)
- sample verbose logs (debug)
- apply retention tiers
- alert on log pipeline failures

OVHcloud describes Logs Data Platform as a turnkey managed log collection/analysis solution. :contentReference[oaicite:11]{index=11}

7.2 Metrics & Dashboards – Golden signals, SLO, saturation and capacity

Golden signals (RED)
- Rate (traffic)
- Errors (error rate)
- Duration (latency)

USE (infra)
- Utilization
- Saturation
- Errors

Dashboards
- SLO: p95 latency + error rate
- saturation: CPU/mem/IO/conn
- capacity trend: 30d forecast

Rule: dashboards must tell you “what to do next”, not only “what happened”.

7.3 APM & Tracing – Correlation IDs, sampling, dependency maps

Minimum viable tracing

Correlation ID across services and logs.
Trace external dependencies (DB, cache, HTTP calls).
Sampling keeps slow/error traces by priority.

Rule: preserve high-latency and high-error traces; sample the rest.

7.4 Alerting & On-call – Actionable alerts only

Alert	Condition	Runbook
SLO breach	error rate/latency over threshold	rollback/mitigate/scale
Saturation	CPU/mem/IO high + queue	rightsize/shard/cache
Security	auth anomalies	rotate creds/block/investigate
Backup failure	job missing/error	repair/re-run/verify restore

Rule: if an alert cannot be acted upon, it is noise.

7.5 SRE Workflow – Incidents, postmortems, error budgets

Incident lifecycle
Detect -> Triage -> Mitigate -> Recover -> Postmortem

Postmortem includes
- timeline
- root cause and contributing factors
- detection gaps
- action items with owners and deadlines

Rule: postmortems are improvement engines, not blame tools.

7.6 Observability Cost Control – Sampling, drop rules, archive tiers

Cost levers

Sampling for traces and high-volume logs.
Keep audit/security logs high priority.
Short hot retention, long archive retention.

Rule: keep the signal hot; archive the history.

8.1 Identity & Access – Least privilege, separation of duties, auditing

Principles
- least privilege by role
- separate admin vs deploy vs read-only
- time-bound access for sensitive actions
- credential rotation policy
- audit trail for privileged operations

Rule: credentials without rotation are liabilities.

8.2 Secrets & Key Management – Rotation, injection, auditability

Phase	What	Control
Create	secure generation	no weak secrets
Store	vault/secrets store	access logs
Inject	runtime fetch	no secrets baked into images
Rotate	scheduled	alert on failures
Revoke	incident response	fast containment

Rule: design secrets rotation before production launch.

8.3 Security Baseline – Hardening, patching, scanning, secure defaults

Baseline checklist

OS hardening and minimal packages.
Patch cadence + emergency patch workflow.
Container scanning + signed images.
Runtime controls: least privileges and no privileged containers by default.
Audit logs routed centrally and retained.

8.4 Edge Security – TLS, rate limits, abuse controls, incident playbooks

Controls
- strict TLS configuration
- request size limits
- rate limiting by IP and token
- allowlist for admin endpoints
- anomaly detection from edge logs
- fast block/unblock workflow

Rule: edge protections must be measurable (traffic, blocks, latency impact).

8.5 Compliance & Data Protection – Classification, retention, evidence

Area	Control	Proof
Classification	labels + access policy	inventory report
Retention	policy-as-code	config snapshots
Encryption	standard posture	audit checks
Backups	restore drills	restore logs

8.6 Security Incident Response – Containment, rotation, forensics, recovery

IR playbook
1) contain: block entry, isolate systems
2) preserve evidence: logs, snapshots
3) rotate credentials: tokens, DB creds, registry secrets
4) eradicate: patch, remove persistence
5) recover: restore services, monitor
6) learn: postmortem + guardrails

Rule: rotate credentials early; attackers love long-lived secrets.

9.1 FinOps Core – Budgets, showback, anomalies, monthly routines

Weekly
- anomaly review
- top spenders scan

Monthly
- rightsizing + idle cleanup
- storage lifecycle enforcement
- log ingestion reduction
- unit cost KPI review

Rule: cost is an engineering metric. Make it visible to teams.

9.2 Compute Cost – Rightsize, autoscale guardrails, shutdown schedules

Lever	Action	Proof
Rightsize	adjust CPU/RAM	utilization report
Scale	autoscale safely	SLO stability
Shutdown	stop non-prod nightly	schedule evidence
Tiering	metal only when justified	latency/throughput proof

9.3 Storage Cost – Lifecycle rules, retention, egress reduction

Top storage wastes

No lifecycle policies (everything stays hot forever).
Old versions kept indefinitely.
Audit logs not tiered (hot vs archive).
Uncontrolled data egress patterns.

Rule: enforce lifecycle as code; review compliance monthly.

9.4 Data Services Cost – HA tiers, backup retention, query discipline

Cost drivers
- always-on replicas
- long backup retention
- overprovisioned instance sizes
- inefficient queries

Controls
- rightsizing reviews
- retention limits
- query budgets and slow-query governance

9.5 Observability Spend – Ingestion control, sampling, archive tiers

Signal-first policy

Keep audit/security logs long retention.
Sample verbose application logs.
Archive bulk logs; keep hot only what drives decisions.

LDP is positioned as a managed log platform with pricing and features. :contentReference[oaicite:12]{index=12}

9.6 FinOps KPIs – Unit economics for cloud

KPI	Definition	Use
Cost / 1k requests	infra spend divided by traffic	scale economics
Cost / tenant	monthly spend per customer	pricing sanity
Cost / GB stored	storage + lifecycle efficiency	retention tuning
Cost / deploy	CI/CD + tests + artifacts	pipeline efficiency

Rule: unit costs drive architecture decisions more than raw monthly spend.

☁️ OVHcloud – Hyper-Dense Cloud Guide

Foundations

Reference Landing Zone

OVHcloud Portfolio Map

IaC & Automation

Console / CLI / API

Cheat-sheet

Public Cloud Compute (Instances)

Bare Metal / Dedicated Servers

Virtualization Platforms

Images & Bootstrapping

Backup & DR

Operations Playbook

Managed Kubernetes Service (MKS)

Managed Rancher Service (MRS)

Container Supply Chain

Ingress & Exposure

GitOps & Delivery

Kubernetes Operations

Network Core

vRack Private Network

BYOIP & Additional IP

Edge Exposure & Hardening

DNS Patterns

Network Troubleshooting

Object Storage

Block Storage

File Storage

Cold Archive & Retention

Storage Security

Storage Cost Control

Public Cloud Databases

Cache Layer (Redis Patterns)

Search / Indexing Patterns

Data Platform Blueprint

Safe Data Migrations

Data Cost Control

Logs Data Platform (LDP)

Metrics & Dashboards

APM & Tracing

Alerting & On-call

SRE Workflow

Observability Cost Control

Identity & Access

Secrets & Key Management

Security Baseline

Edge Security

Compliance & Data Protection

Security Incident Response

FinOps Core

Compute Cost Playbook

Storage Cost Playbook

Data Services Cost

Observability Spend

FinOps KPIs

Environment model

Production non-negotiables

Selection framework

Topology blueprint

Shared services

Guardrails

Operational evidence

Decision principles

Common architecture patterns

Terraform workflow (gold standard)

Drift control

Automation contract

Platform checklist

Incident shortcut

Cost checklist

Kubernetes checklist

Sizing method

Disk strategy

Operational baseline

Decision table

Platform ownership checklist

TCO and risk framing

Golden image contract

Operational loop

Cluster foundation