☁️ Scaleway Cloud – Hyper-Dense Guide
A practical, production-focused map of Scaleway services: compute, Kubernetes, serverless, storage, managed data services, observability, security, governance, and cost control.
Foundations
Account/projects model, API-first, regions/AZ strategy, service boundaries, production basics.
CoreStartOpsReference Landing Zone
Network segmentation, shared services, logging, secrets, environments, delivery pipelines.
PlatformLZGuardrailsIaC & Automation
Terraform-first patterns, idempotent deploys, drift control, CI gates, reproducible environments.
IaCTerraformDriftReference Architectures
3-tier web, container platform, event-driven serverless, data platform patterns.
ArchitecturePatternsDesignAPIs & Tooling
Console vs CLI vs API, access tokens, automation workflows, operational scripting conventions.
APICLIAutomationCheat-sheet
Quick commands, checklists, deployment templates, incident triage shortcuts.
QuickstartChecklistsOpsInstances (Virtual Machines)
CPU/memory sizing, disk strategy, images, patching, backups, lifecycle and automation.
ComputeVMSizingElastic Metal & Dedicated
Bare metal choices, dedicated performance, provisioning model, workloads: DB, GPU, high IO.
Bare MetalPerformanceOpsGPU & AI Workloads
GPU sizing, storage throughput, container strategy, batch inference patterns, cost controls.
GPUAICostImages & Bootstrapping
Golden images, cloud-init baseline, configuration drift prevention, secrets injection.
ImagesCloud-initHygieneBackup & DR
RPO/RTO tiers, snapshot strategy, restore drills, multi-AZ design, runbooks.
BackupDRRunbooksOperations Playbook
Access, patch cycles, monitoring agents, hardening, incident steps, postmortems.
OpsSRERunbookKubernetes (Kapsule / Kosmos)
Cluster design, node pools, upgrades, network policy, multi-AZ, security & ops patterns.
K8sManagedUpgradesContainer Registry
Immutable images, promotion by digest, scanning gates, SBOM, signing strategy.
RegistrySupply ChainCI/CDServerless Containers
Stateless web workloads, scale-to-zero, deploy flows, timeouts, concurrency, reliability.
ServerlessContainersEventsServerless Functions
Triggers, packaging, environments, retries, idempotency, dead-letter strategy.
ServerlessFunctionsReliabilityIngress & Exposure
Edge entry points, TLS, L7 routing, WAF-equivalent strategy, private services exposure.
IngressTLSEdgeGitOps & Delivery
Declarative deploys, environments, progressive rollout (canary/blue-green), rollback playbooks.
GitOpsRolloutsEvidenceNetwork Core
Private networks/VPC-like patterns, subnetting, segmentation, routing, service boundaries.
NetworkSegmentationRoutingPublic Exposure
Public IP strategy, NAT/egress control, reverse proxy layer, rate limits and resilience.
PublicEdgeDDoS-ishNetwork Security
Security groups, least privilege, egress allowlists, service-to-service isolation.
SecuritySGEgressDNS Patterns
Split-horizon, internal naming, private service endpoints, cluster DNS considerations.
DNSPrivateOpsHybrid Connectivity
IP planning, tunnels, routing rules, operational ownership, failover behavior documentation.
HybridRoutingDocsNetwork Troubleshooting
Latency, MTU, DNS failures, packet loss, K8s connectivity issues, structured triage.
TriageLatencyRunbookObject Storage (S3-compatible)
Buckets, lifecycle, versioning strategy, access controls, encryption, backup/archives.
StorageS3LifecycleBlock Storage
SSD volumes, IOPS/throughput thinking, DB workloads, snapshots, resize, consistency rules.
BlockSSDSnapshotsStorage Backup Strategy
3-2-1 design, immutable copies, restore drills, retention tiers, legal hold patterns.
BackupImmutabilityRPOStorage Security
Bucket policies, key management pattern, least privilege, audit logs, access reviews.
SecurityIAMAuditStorage Performance
Throughput vs IOPS, parallelism, multipart upload, cache strategy, DB WAL patterns.
PerfIOPSTuningStorage Cost Control
Lifecycle policies, cold tiers, log ingestion vs budget, data egress awareness.
FinOpsLifecycleEgressManaged Relational DB
PostgreSQL/MySQL managed instances: HA mindset, backups, upgrades, monitoring, safe migrations.
DBPostgreSQLManagedServerless SQL
Serverless database principles: scale-to-zero, connection patterns, pooling, latency trade-offs.
ServerlessSQLTrade-offsManaged Redis
Low-latency caching, persistence choices, eviction policy, HA, session storage design.
RedisCacheLatencyManaged NoSQL
Document-oriented DB patterns: indexing, TTL, schema evolution, backups, scaling strategy.
NoSQLDocumentsModelingAnalytics (Warehouse)
Analytical workloads: ingestion, partitioning, materialized views, cost control, governance basics.
AnalyticsWarehouseCostSearch / OpenSearch
Index design, shards/replicas, ingestion pipelines, query profiling, retention, observability.
SearchOpenSearchTuningCockpit (Observability)
Unified observability: metrics, logs, dashboards, alert routing, and operational visibility.
ObservabilityMetricsLogsCentral Logging
Log taxonomy, retention tiers, sampling, PII discipline, cost-aware ingestion strategy.
LogsCostRetentionAPM & Tracing
Distributed tracing, correlation IDs, RED/USE metrics, latency SLOs, dependency maps.
APMTracingSLOAlerting System
Actionable alerts only: ownership, severity, runbooks, escalation, noise control.
AlertsNoiseRunbooksSRE Workflow
Incident lifecycle, postmortems, error budgets, continuous improvement loops.
SREPostmortemLoopObservability Cost Control
High-volume telemetry: sampling, drop rules, archive, and “signal-first” dashboards.
FinOpsSamplingArchiveIdentity & Access
Service accounts/tokens, RBAC mapping, least privilege, secretless runtime access patterns.
IAMRBACTokensSecrets & Key Management
Secrets lifecycle, rotation, injection patterns, auditability, and “no secrets in code” rules.
SecretsRotationAuditSecurity Baseline
Hardening, patch cadence, vulnerability management, supply chain controls, secure defaults.
BaselineHardeningScanEdge Security
TLS, WAF-like controls, rate limiting, bot mitigation, incident playbooks at the edge.
EdgeTLSRate limitCompliance & Data Protection
Data classification, retention, encryption posture, audit evidence, operational controls.
ComplianceRetentionEvidenceSecurity Incident Response
Containment steps, credential rotation, forensic preservation, timeline, and prevention work.
IRForensicsPlaybookFinOps Core
Budgets, tags/labels, showback, cost anomalies, and monthly optimization routines.
FinOpsBudgetsAnomalyCompute Cost Playbook
Rightsizing, autoscaling, reserved/commit models, environment shutdown, batch windows.
OptimizeScaleShutdownStorage Cost Playbook
Lifecycle rules, cold tiers, archive policies, versioning, and egress awareness.
StorageLifecycleEgressData Services Cost
Replica strategy, HA tiers, backup retention, scaling triggers, and query-cost discipline.
DataHARetentionObservability Spend
Control ingestion, sampling, keep only high-value logs hot, archive the rest.
LogsSamplingArchiveFinOps KPIs
Unit cost metrics, cost per request, cost per tenant, cost per GB stored, cost per deploy.
KPIsUnit costGovernScope & environment model
Recommended environments - sandbox (experiments) - dev (integration) - staging (release candidate) - prod (strict guardrails) Core principles - separate environments by projects and access controls - define naming standards and ownership labels - enforce defaults via IaC templates
Production rules (non-negotiable)
- Everything deployable via IaC (no “clickops” drift).
- Central logs + alerts from day one (no blind spots).
- Secrets not stored in app config or repos (rotation required).
- Least privilege access, time-bound where possible.
- Backup/restore drills are scheduled and measured.
Service selection framework
| Need | Default choice | Escalate to |
|---|---|---|
| Fast web API | managed containers / K8s | VMs for special cases |
| Batch jobs | serverless containers | dedicated compute for heavy IO/GPU |
| Relational DB | managed DB | bare metal for extreme constraints |
| Object data | object storage | archive tiers and lifecycle rules |
Topology blueprint
Edge (public) - reverse proxy / ingress - TLS termination - rate limiting + bot protection Private networks - app subnet(s) - data subnet(s) - admin subnet(s) (bastion-like access) Shared - central logging - secrets + rotation - CI/CD runners (if self-hosted) - artifact registry
Shared services (platform subscription equivalent)
- Central observability workspace (metrics/logs) and alert routing.
- Secrets store + rotation workflow (and incident “break glass” policy).
- Container registry and artifact promotion rules.
- Network egress control points and DNS/naming conventions.
Guardrails (policy-as-code mindset)
- Enforce naming/labels and ownership on resources.
- Block direct public exposure of data services unless explicitly approved.
- Mandatory logging configuration for compute and platforms.
- Minimum baseline for TLS, credentials, and patching.
Ops evidence: what you must be able to prove
| Evidence | How | Why |
|---|---|---|
| Who deployed what | CI logs + artifact digests | auditability |
| Security posture | scan reports + patch reports | risk control |
| Recoverability | restore drill results | real DR |
| SLO compliance | dashboards + incidents | customer trust |
Terraform workflow (gold standard)
Stages 1) fmt + validate 2) plan (saved plan) 3) policy checks (custom) 4) approval gate (prod) 5) apply 6) smoke tests + monitoring hooks
Drift control
- Scheduled plan to detect drift.
- Alert on out-of-band changes.
- Either reconcile (apply) or revert (incident).
- Track “exceptions” explicitly and time-bound them.
3-tier baseline (private-first)
Internet -> Edge (TLS + routing + rate limits)
-> App (containers / K8s / VMs in private networks)
-> Data (managed DB + object storage)
Observability + secrets + backups are platform-wide.K8s platform baseline
- Separate system and workload node pools.
- GitOps deployment with environment overlays.
- Network policy + minimal service exposure.
- Supply chain gates: scan + SBOM + signature verification.
- Observability: metrics + logs + traces as default.
Event-driven serverless baseline
Triggers -> Serverless Functions / Containers -> durable storage (DB/object) -> dead-letter strategy + alerts -> idempotency keys for every handler
Data platform sketch
Ingest -> Object storage (raw) Transform -> compute (batch / containers) Serve -> warehouse / search index / APIs Govern -> access model + retention + audit trail
Automation conventions
- Prefer API/IaC over console for repeatability.
- Store credentials securely; rotate and audit.
- Every script must be idempotent and log its actions.
- Keep a “break glass” playbook, but isolate it.
Script contract - inputs validated - dry-run supported - logs to stdout in structured lines - exit codes reliable - safe retries
Platform checklist
Landing zone - private networks segmentation - edge entry points minimal - centralized observability + alert routing - secrets lifecycle + rotation - backups + restore drills - IaC modules + CI gates - supply chain controls for containers
Serverless checklist
Serverless reliability - idempotency keys - bounded retries - dead-letter strategy + alerts - timeouts sized per workload - concurrency limits - structured logs + tracing
Cost checklist
FinOps loop (monthly) - top 10 spenders review - rightsizing candidates - storage lifecycle enforcement - log ingestion reduction - idle resources cleanup - unit cost KPIs (per request / per tenant)
Incident shortcut
Triage steps 1) user impact scope (SLO breach?) 2) recent deployments 3) saturation signals (CPU/mem/IO/conn) 4) network/DNS failures 5) data errors (locks/slow queries) 6) rollback or mitigation 7) postmortem actions
Sizing method (no guessing)
| Signal | What to watch | Action |
|---|---|---|
| CPU | p95 utilization + steal | rightsize / scale out |
| Memory | pressure + OOM risk | increase RAM / reduce footprint |
| Disk | IOPS/throughput + queue | move to faster volume / shard |
| Network | pps + retransmits | tune edge / improve routing |
Disk strategy (DB-grade thinking)
- Separate OS disk from data disk when needed.
- For databases: isolate WAL/redo logs if possible; measure IOPS and fsync latency.
- Snapshots are not backups unless restore is tested and retention is enforced.
- Use filesystem options aligned with workload (barriers, journaling choices).
Ops baseline
- SSH via controlled entry (no open world access).
- Patching cadence + emergency patch process.
- Central logs and metrics with alerts on saturation.
- Immutable infrastructure mindset where possible (rebuild over patch drift).
Decision criteria
| Constraint | Why metal | Mitigation if not |
|---|---|---|
| Extreme IO | lowest latency, dedicated throughput | sharding + caching |
| Licensing | per-core constraints | optimize core counts |
| Isolation | strict tenancy needs | strong security baseline |
| GPU intensive | dedicated accelerators | batch windows + scaling |
GPU platform patterns
- Prefer containers for reproducibility (drivers/toolkit pinned).
- Separate training vs inference: different scheduling and scaling models.
- Use batch windows and auto-shutdown for idle GPU time.
Cost controls (mandatory)
- Define maximum concurrency and max runtime per job.
- Track cost per 1k inferences / per training epoch.
- Cache model artifacts in object storage with versioning.
Golden image contract
Golden image must include - base hardening (sshd settings, firewall defaults) - monitoring agent install step - log forwarding configuration - time sync and DNS defaults - minimal packages only cloud-init responsibilities - inject host keys safely - configure app runtime - register into monitoring - pull secrets from secure store
| Tier | Target | Design | Verification |
|---|---|---|---|
| Tier 0 | minutes | multi-AZ + replication | game day drills |
| Tier 1 | hours | snapshots + managed backups | monthly restores |
| Tier 2 | day | object backups + manual | quarterly audits |
Operational loop
Daily - check SLO dashboards - review alerts + top errors - confirm backup jobs Weekly - patch window for non-prod - capacity review (CPU/mem/IO) - vulnerability scan review Monthly - cost review + rightsizing - restore drill - postmortem action items verification
Cluster foundation
- Separate system and workload pools.
- Define ingress strategy and TLS as a platform standard.
- Use autoscaling carefully: HPA + cluster autoscaler with safe limits.
- Pin base images and enforce immutable deployments.
Security essentials
- Network policies: default deny + allow by service needs.
- RBAC: least privilege, separate admin from deploy roles.
- Pod security and runtime constraints (no privileged by default).
- Supply chain: scan + SBOM + signature validation in CI/CD.
Operations
- Upgrades: staged, maintenance windows, canary cluster if needed.
- Observability: cluster + node + workload dashboards.
- Backups: stateful systems are backed up outside the cluster; configs are GitOps.
Resilience
Resilience checklist - readiness/liveness probes - pod disruption budgets - multi-node spread (anti-affinity) - rate limits at ingress - graceful shutdown - chaos-style drills (optional but valuable)
Supply chain gates
| Gate | What it checks | Block on |
|---|---|---|
| Vuln scan | CVEs in OS/libs | high/critical |
| SBOM | dependency inventory | missing SBOM |
| Signature | image provenance | unsigned images |
| Policy | base image allowlist | unapproved base |
Best for
- Stateless web APIs and job-like workloads.
- Scale-to-zero services with bursty traffic.
- Event-driven handlers packaged as containers.
Reliability checklist
Reliability - strict request timeout budgeting - bounded concurrency - retry policy aligned with idempotency - dead-letter handling for async patterns - structured logs + correlation IDs
Cost discipline
- Track cost per request and cost per job.
- Cap max scale for “runaway traffic” scenarios.
- Use caching and edge rate limits to avoid amplification.
Golden rules
- Idempotency is mandatory for event handlers.
- Use deterministic retry strategy (max attempts, backoff, time budget).
- Write logs as structured events with correlation IDs.
- Separate “poison messages” to a dead-letter stream and alert on it.
Handler skeleton (concept) - validate payload - compute idempotency key - check processed marker - process business logic - persist result atomically - return success - on error: classify retryable vs non-retryable
| Concern | Edge control | Notes |
|---|---|---|
| TLS | terminate + rotate certs | enforce modern ciphers |
| Routing | L7 rules | path-based and host-based |
| Abuse | rate limits + IP rules | prevent traffic amplification |
| Private services | internal routing | avoid public endpoints |
Release patterns
| Pattern | Best for | Requirement |
|---|---|---|
| Blue/green | safe cutover | traffic switch + fast rollback |
| Canary | risk reduction | metric-based promotion |
| Rings | enterprise | progressive exposure |
Segmentation blueprint
Network zones - edge (public entry) - app (private workloads) - data (private databases) - admin (restricted access) - shared (observability, registry, secrets)
Public entry rules
- Terminate TLS at a controlled edge layer.
- Rate limit by IP and by identity where possible.
- Implement request timeouts and size limits.
- Log edge events and alert on anomalies.
| Control | Goal | Common failure |
|---|---|---|
| Ingress rules | allow only required ports | 0.0.0.0/0 to admin ports |
| Egress rules | prevent data exfil | allow all outbound by default |
| Service isolation | contain compromise | flat network with shared creds |
DNS rules
- Document resolution chain (who resolves what, where, and why).
- Use internal names for private services; keep external DNS minimal.
- For Kubernetes: standardize service discovery and ingress hostnames.
Hybrid contract
Hybrid must define - prefix plan (no overlaps) - routing ownership (who changes what) - failover behavior (tested) - change windows and rollback - monitoring for tunnel health
Triage checklist
| Symptom | Check | Action |
|---|---|---|
| Timeouts | edge logs + upstream latency | tighten timeouts, fix bottleneck |
| DNS failures | resolver health + TTL | stabilize DNS chain |
| Packet loss | retransmits, MTU | fix MTU or routing |
| Slow K8s | network policy + CNI | trace flows, simplify rules |
Bucket design
- Separate buckets by data classification and lifecycle needs.
- Define naming conventions and ownership labels.
- Prefer immutable object versions for critical artifacts.
Security rules
- Least privilege: scoped credentials and access review.
- Encrypt data and restrict cross-project access.
- Audit access and alert on anomalies.
Lifecycle policy (cost control)
Lifecycle example - day 0-30: hot - day 31-180: cool - day 181+: archive - delete markers and old versions per policy
DB-grade checklist
- Measure fsync latency and queue depth.
- Separate write-heavy volumes from OS when needed.
- Snapshots are not a substitute for logical backups.
- Test restore path and automate validation.
3-2-1 - 3 copies - 2 different media (block + object) - 1 offsite (separate project/zone) Operational must-haves - documented restore steps - monthly restore drill - retention and deletion protection
| Control | How | Outcome |
|---|---|---|
| Least privilege | scoped credentials | reduced blast radius |
| Access review | monthly audit | remove stale access |
| Encryption | standardize policy | consistent posture |
| Logging | central logs + alerts | detect anomalies |
Performance guidance
- Object storage: parallel uploads + multipart for big objects.
- Block storage: monitor queue depth and fsync latency for databases.
- Caching: avoid re-downloading artifacts; version them and cache safely.
Cost levers
- Lifecycle transitions and deletion policies.
- Archive old versions; keep only what you restore.
- Track egress drivers (CDN/edge caches reduce outbound).
HA/DR mindset
- Know your failure domains and design accordingly.
- Prefer managed HA where available; document failover behavior.
- Measure RPO/RTO and validate with drills.
Backups and safe migrations
Safe migration flow 1) backup + verify restore path 2) schema change in small steps 3) dual-write or compatibility window (if needed) 4) monitor errors + latency 5) cleanup after stabilization
Performance loop
DB tuning loop 1) capture slow queries 2) explain/analyze 3) index or rewrite 4) validate with p95/p99 5) regressions guardrails (tests + dashboards)
Serverless database pitfalls
- Cold start latency can hit first queries—budget for it.
- Connection storms are common: use pooling or connection limits.
- Long transactions reduce scalability—keep transactions short.
| Topic | Decision | Rule |
|---|---|---|
| TTL | per key class | no infinite TTL without justification |
| Eviction | policy choice | align with data criticality |
| Persistence | if needed | cache != source of truth |
| HA | replication | test failover behavior |
Modeling checklist
- Design queries first, then indexes.
- Use explicit version fields for schema evolution.
- TTL for ephemeral data and cost control.
- Backup strategy independent from the DB engine.
Warehouse rules - ingest in append-only patterns - partition by time and key dimensions - keep hot and cold datasets separate - track cost per query / per dashboard - implement retention and archiving
| Area | What matters | Action |
|---|---|---|
| Mapping | field types, analyzers | freeze mapping early |
| Shards | parallelism vs overhead | size shards sensibly |
| Ingestion | bulk + backpressure | avoid overload loops |
| Retention | index lifecycle | rollover + delete |
Observability model
Signals - metrics (fast, low cost) - logs (deep, higher cost) - traces (request path) System - dashboards for SLOs - alerts wired to runbooks - retention policies as code
Log taxonomy
Levels - audit (security relevant) - error (actionable failures) - warn (degradation) - info (operational events) - debug (short retention, controlled)
Minimum viable tracing
- Correlation ID across services and logs.
- Trace external dependencies (DB, cache, HTTP calls).
- Track p95/p99 latency and error rate for each service.
| Alert | Condition | Runbook |
|---|---|---|
| SLO breach | error rate or latency over threshold | rollback / mitigate / scale |
| Saturation | CPU/mem/IO high + queue | rightsize / scale / shard |
| Security | auth anomalies | rotate creds / block / investigate |
| Backup failure | job missing or error | repair + re-run + verify restore |
Incident lifecycle Detect -> Triage -> Mitigate -> Recover -> Postmortem Postmortem must include - timeline - root cause - contributing factors - detection gaps - action items with owners and deadlines
Cost levers
- Sampling for traces and high-volume logs.
- Keep audit/security logs high priority; reduce verbose app logs.
- Short hot retention, long archive retention.
Access model
Principles - least privilege by role - separate admin vs deploy vs read-only - time-bound access for sensitive actions - credential rotation policy - audit trail for privileged operations
Secrets lifecycle
| Phase | What | Control |
|---|---|---|
| Create | generate securely | no manual weak secrets |
| Store | secure vault | access logs |
| Inject | runtime fetch | no secrets in images |
| Rotate | scheduled | alert on failures |
| Revoke | incident response | fast containment |
Baseline checklist
- OS hardening and minimal packages.
- Patch cadence + emergency patch process.
- Container scanning + signed images.
- Runtime controls: least privileges and no privileged containers by default.
- Audit logs routed centrally and retained.
Edge controls
Controls - strict TLS configuration - request size limits - rate limit by IP and by token - allowlist for admin endpoints - anomaly detection from edge logs - fast block / unblock workflow
| Area | Control | Proof |
|---|---|---|
| Data class | labels + access policy | inventory report |
| Retention | policy-as-code | config snapshots |
| Encryption | standard posture | audit checks |
| Backups | drills | restore logs |
IR playbook 1) contain: block entry, isolate systems 2) preserve evidence: logs, snapshots 3) rotate credentials: tokens, DB creds, registry secrets 4) eradicate: patch, remove persistence 5) recover: restore services, monitor 6) learn: postmortem + guardrails
FinOps loop
Weekly - anomaly detection review - top spenders quick scan Monthly - rightsizing and idle cleanup - storage lifecycle enforcement - log ingestion reduction - unit cost KPI review
| Lever | Action | Proof |
|---|---|---|
| Rightsize | adjust CPU/RAM | utilization report |
| Scale | autoscale safely | SLO stability |
| Shutdown | stop non-prod nightly | schedule evidence |
| Batch | run heavy jobs in windows | cost per job |
Top storage wastes
- No lifecycle rules (everything stays hot forever).
- Unlimited versions and no cleanup.
- Unbounded logs in object storage with no retention.
- Unexpected egress due to lack of caching/edge.
Data cost drivers - always-on replicas - long retention for backups/logs - inefficient queries scanning too much data - overprovisioned instance sizes Controls - rightsizing reviews - query performance budgets - retention as policy
Signal-first policy
- Keep audit/security logs hot and long retention.
- Sample traces aggressively but keep “slow/error” traces.
- Archive bulk logs; keep dashboards based on SLO signals.
| KPI | Definition | Use |
|---|---|---|
| Cost / 1k requests | infra spend divided by traffic | scale economics |
| Cost / tenant | monthly spend per customer | pricing sanity |
| Cost / GB stored | storage + lifecycle efficiency | retention tuning |
| Cost / deploy | CI/CD + artifact + test spend | pipeline efficiency |
