☁️ OVHcloud – Hyper-Dense Cloud Guide
Production-focused map of OVHcloud: Public Cloud, Bare Metal, Managed Kubernetes Service, data services, storage, networking, security, observability, and FinOps.
Foundations
Account/project strategy, API-first usage, regions/AZ thinking, operational ownership model.
CoreStartOpsReference Landing Zone
Network segmentation, shared services, observability, secrets, CI/CD, environments, guardrails.
PlatformLZGuardrailsOVHcloud Portfolio Map
Public Cloud vs Bare Metal vs hosted platforms: when to pick what, and why.
PortfolioDecisionPatternsIaC & Automation
Terraform-first workflows, drift control, promotion gates, reproducible environments.
IaCTerraformDriftConsole / CLI / API
Operational conventions: idempotent scripts, secure credentials, break-glass processes.
APIAutomationControlCheat-sheet
Checklists, deployment templates, incident triage shortcuts, cost-control levers.
QuickstartChecklistsOpsPublic Cloud Compute (Instances)
Sizing, images, disks, scaling, automation, patching, and production hygiene.
ComputeVMSizingBare Metal / Dedicated Servers
When metal is justified: IO, licensing, isolation, virtualization stacks, predictable performance.
Bare MetalPerformanceVirtualizationVirtualization Platforms
How to think about VMware/Nutanix/OpenStack-style layers (ownership, ops, and costs).
VirtualizationOpsTCOImages & Bootstrapping
Golden images, cloud-init baseline, drift prevention, secrets injection patterns.
ImagesCloud-initHygieneBackup & DR
RPO/RTO tiers, snapshots vs backups, restore drills, runbooks, multi-zone design.
BackupDRRunbooksOperations Playbook
Patch cycles, monitoring agents, incidents, postmortems, and continuous improvement loops.
OpsSRERunbookManaged Kubernetes Service (MKS)
Cluster design, node pools, upgrades, network policy, security posture, operational patterns.
K8sManagedUpgradesManaged Rancher Service (MRS)
Multi-cluster governance and operational control plane for Kubernetes fleets.
RancherFleetGovernanceContainer Supply Chain
Registry strategy, promotion by digest, scanning gates, SBOM, signing, provenance.
RegistrySupply ChainCI/CDIngress & Exposure
TLS termination, routing, rate limiting, private service exposure, edge observability.
IngressTLSEdgeGitOps & Delivery
Declarative deploys, progressive rollout (canary/blue-green), rollbacks, evidence.
GitOpsRolloutsEvidenceKubernetes Operations
Autoscaling, capacity guardrails, backup of state, SLO dashboards, and on-call playbooks.
K8sOpsSLONetwork Core
Private segmentation, routing, service boundaries, and failure domains awareness.
NetworkSegmentationRoutingvRack Private Network
Private interconnect patterns: isolate tiers, reduce exposure, unify multi-service architectures.
vRackPrivateTopologyBYOIP & Additional IP
Network identity portability, migration strategy, and controlled public addressing.
IPPortabilityMigrationEdge Exposure & Hardening
TLS, rate limits, bot mitigation, abuse detection, and incident playbooks at the edge.
EdgeTLSRate limitDNS Patterns
Split-horizon, internal naming, service discovery, and operational resilience around DNS.
DNSPrivateOpsNetwork Troubleshooting
Latency, MTU, DNS failures, packet loss, K8s connectivity: structured triage and fixes.
TriageLatencyRunbookObject Storage
S3-compatible storage patterns: buckets, lifecycle, versioning, access control, encryption.
StorageS3LifecycleBlock Storage
Persistent disks, throughput/IOPS thinking, DB workloads, snapshot strategy, restore validation.
BlockSSDSnapshotsFile Storage
Shared file systems for instances/containers: access patterns, performance, and isolation rules.
FileSharedPerfCold Archive & Retention
Long-term storage, immutable archives, legal retention, and disaster recovery data tiers.
ArchiveRetentionDRStorage Security
Bucket policies, least privilege, encryption posture, audit logs, and access reviews.
SecurityIAMAuditStorage Cost Control
Lifecycle transitions, version cleanup, log ingestion discipline, and egress awareness.
FinOpsLifecycleEgressPublic Cloud Databases
Managed databases: HA mindset, backups, upgrades, safe migrations, and performance loops.
DBManagedHACache Layer (Redis Patterns)
TTL strategy, eviction, session storage, HA, and cache-aside vs write-through tradeoffs.
RedisCacheLatencySearch / Indexing Patterns
Index lifecycle, ingestion backpressure, query profiling, retention, and cost control.
SearchIndexTuningData Platform Blueprint
Ingest → store → transform → serve → govern. Practical architecture patterns and pitfalls.
DataArchitectureGovernSafe Data Migrations
Zero-downtime patterns, phased rollouts, dual-write windows, validation, rollback playbooks.
MigrationsRiskRollbackData Cost Control
Retention policies, replica strategy, query budgets, and unit-cost observability.
FinOpsRetentionUnit costLogs Data Platform (LDP)
Central log collection, storage, analysis, dashboards, ingestion discipline and retention tiers.
ObservabilityLogsRetentionMetrics & Dashboards
Golden signals, RED/USE, SLO dashboards, saturation detection, and capacity planning loops.
MetricsSLOCapacityAPM & Tracing
Correlation IDs, distributed traces, sampling rules, dependency maps, and latency budgets.
APMTracingLatencyAlerting & On-call
Actionable alerts, ownership, severity model, escalation, runbooks, and noise control.
AlertsNoiseRunbooksSRE Workflow
Incident lifecycle, postmortems, error budgets, reliability investments and governance.
SREPostmortemLoopObservability Cost Control
Sampling, drop rules, hot vs archive retention, and signal-first telemetry design.
FinOpsSamplingArchiveIdentity & Access
Least privilege, separation of duties, token hygiene, time-bound access, auditing.
IAMRBACAuditSecrets & Key Management
Secrets lifecycle: creation, storage, injection, rotation, revocation; no secrets in images.
SecretsRotationEvidenceSecurity Baseline
Hardening, patch cadence, vulnerability management, supply chain controls for containers.
BaselineHardeningScanEdge Security
TLS posture, WAF-like controls, rate limiting, bot mitigation, and incident response at edge.
EdgeTLSAbuseCompliance & Data Protection
Classification, retention, encryption, evidence, audit trails, and operational controls.
ComplianceRetentionEvidenceSecurity Incident Response
Containment, credential rotation, forensics preservation, recovery, and prevention actions.
IRForensicsPlaybookFinOps Core
Budgets, labels, showback, anomalies, and a monthly optimization routine.
FinOpsBudgetsAnomalyCompute Cost Playbook
Rightsizing, autoscaling guardrails, shutdown schedules, and workload tiering.
OptimizeScaleShutdownStorage Cost Playbook
Lifecycle rules, cold tiers, archive policies, version cleanup, and egress control.
StorageLifecycleEgressData Services Cost
Replica strategy, HA tiers, backup retention, scaling triggers, and query discipline.
DataHARetentionObservability Spend
Ingestion control, sampling, hot retention, archive tiers, and dashboard efficiency.
LogsSamplingArchiveFinOps KPIs
Unit costs: cost per request, per tenant, per GB stored, and per deploy.
KPIsUnit costGovernEnvironment model
Recommended environments - sandbox (experiments) - dev (integration) - staging (release candidate) - prod (strict guardrails) Core principles - separate environments with clear access boundaries - enforce naming/labels and ownership - default to private networking and minimal public surface
Production non-negotiables
- Everything deployable via IaC (no unmanaged drift).
- Central logs + metrics + alert routing (no blind spots).
- Secrets are never stored in repositories or baked into images.
- Least privilege and separation of duties.
- Backups and restore drills scheduled and measured.
Selection framework
| Need | Default | Escalate to |
|---|---|---|
| Fast web API | MKS + managed DB | Instances/metal for special constraints |
| High IO DB | Managed DB when fit | Bare Metal for extreme latency/throughput |
| Unstructured data | Object Storage | Cold Archive for long retention |
| Private connectivity | vRack | Hybrid patterns (tunnels, routing governance) |
Topology blueprint
Edge (public) - TLS termination + routing - rate limits + request size limits - security logging (edge events) Private networks - app subnet(s) - data subnet(s) - admin subnet(s) Shared - secrets lifecycle - observability workspace - artifact/container registry policy - backups + restore drills
Shared services
- Central logs and metrics with alert routing.
- Secrets store and rotation workflow.
- Supply chain controls for containers (scan/SBOM/signing).
- CI/CD pipelines with promotion gates (dev → staging → prod).
Guardrails
- Block direct public exposure of data services unless approved.
- Mandatory logging on compute and platforms.
- Least privilege and access reviews.
- Backup retention and restore drill evidence required.
Operational evidence
| Evidence | How | Why |
|---|---|---|
| Who deployed what | CI logs + artifact digests | auditability |
| Recoverability | restore drill results | real DR |
| Security posture | scan reports + patch logs | risk control |
| SLO compliance | dashboards + incidents | reliability |
Decision principles
| Option | Best for | Trade-off |
|---|---|---|
| Public Cloud | elastic workloads, fast iteration | you must design HA/ops |
| Managed K8s (MKS) | container platform | platform ops discipline needed |
| Managed DB | reduce DBA toil | feature/extension constraints |
| Bare Metal | extreme IO, isolation | more responsibility |
Common architecture patterns
Pattern A: Public Cloud app - Instances/MKS (app) - Managed DB (data) - Object Storage (artifacts) - Logs Data Platform (logs) Pattern B: High IO core on metal - Bare Metal for DB/storage layer - Public Cloud/MKS for stateless app - vRack private network between tiers
Terraform workflow (gold standard)
Stages 1) fmt + validate 2) plan (saved artifact) 3) policy checks (custom) 4) approval gate (prod) 5) apply 6) smoke tests + monitoring hooks
Drift control
- Scheduled plan to detect drift.
- Alert on out-of-band changes.
- Reconcile via code or revert as an incident.
- Time-bound exceptions with explicit ownership.
Automation contract
Script contract - inputs validated - dry-run supported - structured logs to stdout - reliable exit codes - safe retries and idempotency Credential policy - short-lived when possible - stored in vault/secrets store - rotated and audited - break-glass procedure documented
Platform checklist
Landing zone - private segmentation + minimal public edge - centralized logs/metrics + alert routing - secrets lifecycle + rotation - backups + restore drills - IaC modules + CI gates - container supply chain controls
Incident shortcut
Triage steps 1) user impact + SLO breach? 2) recent deployments (last 60 minutes) 3) saturation (CPU/mem/IO/conn) 4) network/DNS failures 5) data layer errors (locks/slow queries) 6) rollback or mitigation 7) postmortem actions
Cost checklist
Monthly FinOps loop - top spenders review - rightsizing candidates - storage lifecycle enforcement - log ingestion reduction - idle resources cleanup - unit cost KPIs (per request / per tenant)
Kubernetes checklist
K8s production - separate system/workload node pools - PDB + probes + graceful shutdown - network policies (default deny) - GitOps with promotion - SLO dashboards and rollback automation
Sizing method
| Signal | Watch | Action |
|---|---|---|
| CPU | p95 + saturation | rightsize or scale out |
| Memory | pressure + OOM risk | increase RAM or reduce footprint |
| Disk | IO latency + queue | move to faster disks, shard, cache |
| Network | pps + retransmits | edge tuning, rate limiting, routing |
Disk strategy
- Separate OS disk from data disk for write-heavy workloads.
- For databases: measure fsync latency and queue depth.
- Snapshots are not backups unless restore is tested.
- Standardize encryption and retention policies.
Operational baseline
- Access via controlled entry (no global admin ports).
- Patching cadence + emergency patch workflow.
- Central logging and metrics with saturation alerts.
- Prefer rebuild over mutate (disposable servers mindset).
Decision table
| Constraint | Why metal | Mitigation if not |
|---|---|---|
| Extreme IO | lowest latency and dedicated throughput | sharding + caching + async |
| Isolation | strict tenancy / compliance needs | strong segmentation + IAM |
| Licensing | per-core constraints | optimize core counts |
| Virtualization stack | VMware/Nutanix-like workloads | managed where possible |
Platform ownership checklist
- Who patches the hypervisor and management plane?
- How are backups done (VM-level vs app-level)?
- How do you observe saturation (CPU ready, storage latency, network pps)?
- What is the failure domain and recovery process?
TCO and risk framing
TCO drivers - always-on capacity (reserved compute) - storage replication and backups - licensing - operations staffing - outage blast radius if governance is weak
Golden image contract
Golden image includes - minimal packages and hardened defaults - monitoring/log forwarding agent install - time sync and DNS defaults - baseline firewall rules cloud-init responsibilities - inject runtime config - fetch secrets from secure store - register node into monitoring - configure service discovery
| Tier | Target | Design | Verification |
|---|---|---|---|
| Tier 0 | minutes | multi-zone + replication | game day drills |
| Tier 1 | hours | managed backups + snapshots | monthly restores |
| Tier 2 | day | object backups + archive | quarterly audits |
Operational loop
Daily - check SLO dashboards - review alerts and top errors - confirm backups Weekly - patch non-prod - capacity review (CPU/mem/IO/conn) - vulnerability review Monthly - cost review and rightsizing - restore drill - verify postmortem actions
Cluster foundation
- Separate system and workload node pools.
- Define ingress/TLS strategy as a platform standard.
- Use autoscaling with strict limits to avoid runaway spend/outages.
- Pin base images and enforce immutable deployments.
Security essentials
- Network policies: default deny, allow only required flows.
- RBAC: least privilege, separate admins from deploy roles.
- Pod security: avoid privileged containers, enforce baseline constraints.
- Supply chain gates: scan + SBOM + signing policy.
Operations
- Upgrades: staged rollout with maintenance windows.
- Observability: cluster/node/workload dashboards and alert routing.
- Backups: state is backed up outside the cluster; configs are GitOps.
Resilience checklist
Resilience - readiness/liveness probes - pod disruption budgets - anti-affinity / topology spread - rate limits at ingress - graceful shutdown - rollback automation and SLO gates
Why Rancher in enterprise
- Central governance for many clusters.
- Unified RBAC, policies, and visibility across environments.
- Standardized cluster provisioning and upgrades.
Common pitfalls
- Over-privileged global admin roles.
- No cluster lifecycle policy (old clusters never upgraded).
- Missing evidence: who changed what, when, and why.
Supply chain gates
| Gate | Checks | Block on |
|---|---|---|
| Vulnerability scan | CVEs in OS/libs | high/critical |
| SBOM | dependency inventory | missing inventory |
| Signature | provenance | unsigned images |
| Policy | approved base images | unapproved base |
| Concern | Edge control | Notes |
|---|---|---|
| TLS | terminate + rotate certs | enforce modern ciphers |
| Routing | L7 rules | host/path-based routing |
| Abuse | rate limits + blocks | prevent amplification |
| Private services | internal exposure | avoid public endpoints |
Release patterns
| Pattern | Best for | Requirement |
|---|---|---|
| Blue/green | safe cutover | traffic switch + fast rollback |
| Canary | risk reduction | metric-based promotion |
| Rings | enterprise | progressive exposure |
K8s ops must-haves - node pool templates and capacity guardrails - cluster autoscaler with max bounds - resource requests/limits enforced - admission policies for baseline security - SLO dashboards (p95 latency + error rate) - stateful backups outside the cluster
Segmentation blueprint
Network zones - edge (public entry) - app (private workloads) - data (private databases) - admin (restricted access) - shared (observability, registry, secrets)
What vRack enables
- Private network connectivity across OVHcloud services.
- Tier isolation (app/data/admin) without public exposure.
- Hybrid patterns: metal + public cloud combined.
Operational rules
- Explicit routing governance: who changes what and when.
- Document failure behavior and test it.
- Monitor latency, packet loss, and DNS resolution paths.
Why IP portability matters
- Preserve network identity during migrations.
- Reduce DNS churn and partner allowlist changes.
- Enable staged cutovers with controlled routing.
Migration playbook (high-level) 1) prepare parallel stack 2) validate security + observability 3) move/attach IP ranges (when applicable) 4) controlled traffic shift 5) rollback readiness and post-cutover monitoring
Edge controls - strict TLS configuration and rotation - request size limits and timeouts - rate limiting by IP and identity - allowlist for admin endpoints - anomaly detection from edge logs - fast block/unblock workflow (runbook)
DNS rules
- Document resolution chain and ownership.
- Use internal names for private services.
- Standardize ingress hostnames and certificates.
- Monitor NXDOMAIN spikes and resolver latency.
| Symptom | Check | Action |
|---|---|---|
| Timeouts | edge logs + upstream latency | fix bottleneck, tighten timeouts |
| DNS failures | resolver health + TTL | stabilize resolver chain |
| Packet loss | retransmits, MTU | fix MTU/routing, isolate noisy neighbors |
| K8s connectivity | CNI + network policy | trace flows, simplify rules |
Bucket design
- Separate buckets by data classification and lifecycle needs.
- Define naming conventions and ownership labels.
- Prefer immutable versions for critical artifacts.
Security rules
- Least privilege with scoped credentials.
- Restrict cross-project access and enforce encryption posture.
- Audit access and alert on anomalies via centralized logging.
Lifecycle policy (cost control)
Lifecycle example - day 0-30: hot - day 31-180: cool - day 181+: archive - delete old versions per retention rules
DB-grade checklist
- Measure fsync latency and IO queue depth.
- Separate write-heavy volumes from OS when needed.
- Snapshots are not a substitute for logical backups.
- Restore path is tested and documented.
Design rules
- Know access patterns: many small IO vs large sequential streams.
- Keep multi-tenant isolation explicit (permissions and network boundaries).
- Monitor latency and throughput; define a performance budget.
- Avoid using shared file storage as a database log device.
Archive rules - treat archive as immutable when possible - keep retrieval times and costs documented - define retention by classification (audit, legal, analytics) - test restore from archive quarterly
| Control | How | Outcome |
|---|---|---|
| Least privilege | scoped credentials | reduced blast radius |
| Access review | monthly audit | remove stale access |
| Encryption | standardize policy | consistent posture |
| Logging | central logs + alerts | detect anomalies |
Top storage wastes
- No lifecycle rules: everything stays hot forever.
- Unlimited versions without cleanup.
- High-volume logs stored without retention tiers.
- Unexpected egress due to missing caching/edge strategy.
HA/DR mindset
- Know failure domains and document failover behavior.
- Measure RPO/RTO and validate with drills.
- Keep application timeouts aligned with failover behavior.
Backups
Backup contract - automated backups enabled - retention policy defined - restore runbook documented - monthly restore drill executed - evidence stored (timestamps, results)
Performance loop
DB tuning loop 1) capture slow queries 2) explain/analyze 3) index or rewrite 4) validate p95/p99 latency 5) protect with regression dashboards
Safe migrations
Safe migration flow 1) backup + verified restore path 2) schema change in small steps 3) compatibility window (dual-write if needed) 4) monitor errors and latency 5) cleanup after stabilization
| Topic | Decision | Rule |
|---|---|---|
| TTL | per data class | no infinite TTL without justification |
| Eviction | policy choice | align with data criticality |
| Persistence | optional | cache is not source of truth |
| HA | replication/failover | test failover behavior |
| Area | What matters | Action |
|---|---|---|
| Mapping | field types, analyzers | freeze mapping early |
| Ingestion | bulk + backpressure | avoid overload loops |
| Retention | index lifecycle | rollover + delete |
| Cost | storage + query load | sample logs, archive cold data |
Data platform sketch Ingest -> Object Storage (raw) Transform -> compute (batch / containers) Serve -> APIs + indexes + warehouse Govern -> access model + retention + audit trail Rules - treat raw data as immutable - separate hot and cold datasets - define unit-cost metrics per pipeline
Zero-downtime patterns
- Expand/contract schema (add fields first, remove later).
- Backfill with throttling and progress evidence.
- Compatibility window with dual reads/writes if required.
- Cutover with feature flags and rapid rollback.
Data cost drivers - always-on replicas - long retention for backups/logs - inefficient queries scanning too much data - overprovisioned instance sizes Controls - rightsizing reviews - query performance budgets - retention policy-as-code - archive cold datasets
What LDP is for
- Central log collection and retention.
- Operational dashboards and investigations.
- Security/audit log correlation.
Ingestion discipline
Log policy - keep high-signal logs hot (errors, audit) - sample verbose logs (debug) - apply retention tiers - alert on log pipeline failures
Golden signals (RED) - Rate (traffic) - Errors (error rate) - Duration (latency) USE (infra) - Utilization - Saturation - Errors Dashboards - SLO: p95 latency + error rate - saturation: CPU/mem/IO/conn - capacity trend: 30d forecast
Minimum viable tracing
- Correlation ID across services and logs.
- Trace external dependencies (DB, cache, HTTP calls).
- Sampling keeps slow/error traces by priority.
| Alert | Condition | Runbook |
|---|---|---|
| SLO breach | error rate/latency over threshold | rollback/mitigate/scale |
| Saturation | CPU/mem/IO high + queue | rightsize/shard/cache |
| Security | auth anomalies | rotate creds/block/investigate |
| Backup failure | job missing/error | repair/re-run/verify restore |
Incident lifecycle Detect -> Triage -> Mitigate -> Recover -> Postmortem Postmortem includes - timeline - root cause and contributing factors - detection gaps - action items with owners and deadlines
Cost levers
- Sampling for traces and high-volume logs.
- Keep audit/security logs high priority.
- Short hot retention, long archive retention.
Principles - least privilege by role - separate admin vs deploy vs read-only - time-bound access for sensitive actions - credential rotation policy - audit trail for privileged operations
| Phase | What | Control |
|---|---|---|
| Create | secure generation | no weak secrets |
| Store | vault/secrets store | access logs |
| Inject | runtime fetch | no secrets baked into images |
| Rotate | scheduled | alert on failures |
| Revoke | incident response | fast containment |
Baseline checklist
- OS hardening and minimal packages.
- Patch cadence + emergency patch workflow.
- Container scanning + signed images.
- Runtime controls: least privileges and no privileged containers by default.
- Audit logs routed centrally and retained.
Controls - strict TLS configuration - request size limits - rate limiting by IP and token - allowlist for admin endpoints - anomaly detection from edge logs - fast block/unblock workflow
| Area | Control | Proof |
|---|---|---|
| Classification | labels + access policy | inventory report |
| Retention | policy-as-code | config snapshots |
| Encryption | standard posture | audit checks |
| Backups | restore drills | restore logs |
IR playbook 1) contain: block entry, isolate systems 2) preserve evidence: logs, snapshots 3) rotate credentials: tokens, DB creds, registry secrets 4) eradicate: patch, remove persistence 5) recover: restore services, monitor 6) learn: postmortem + guardrails
Weekly - anomaly review - top spenders scan Monthly - rightsizing + idle cleanup - storage lifecycle enforcement - log ingestion reduction - unit cost KPI review
| Lever | Action | Proof |
|---|---|---|
| Rightsize | adjust CPU/RAM | utilization report |
| Scale | autoscale safely | SLO stability |
| Shutdown | stop non-prod nightly | schedule evidence |
| Tiering | metal only when justified | latency/throughput proof |
Top storage wastes
- No lifecycle policies (everything stays hot forever).
- Old versions kept indefinitely.
- Audit logs not tiered (hot vs archive).
- Uncontrolled data egress patterns.
Cost drivers - always-on replicas - long backup retention - overprovisioned instance sizes - inefficient queries Controls - rightsizing reviews - retention limits - query budgets and slow-query governance
Signal-first policy
- Keep audit/security logs long retention.
- Sample verbose application logs.
- Archive bulk logs; keep hot only what drives decisions.
| KPI | Definition | Use |
|---|---|---|
| Cost / 1k requests | infra spend divided by traffic | scale economics |
| Cost / tenant | monthly spend per customer | pricing sanity |
| Cost / GB stored | storage + lifecycle efficiency | retention tuning |
| Cost / deploy | CI/CD + tests + artifacts | pipeline efficiency |
