☁️ Scaleway Cloud – Hyper-Dense Guide (Compute, K8s, Serverless, Storage, Data, Observability)

1.1 Foundations (Projects, API-first mindset, production rules)

Scope & environment model

Recommended environments
- sandbox (experiments)
- dev (integration)
- staging (release candidate)
- prod (strict guardrails)

Core principles
- separate environments by projects and access controls
- define naming standards and ownership labels
- enforce defaults via IaC templates

Rule: treat environments as products: consistent, repeatable, and auditable.

Production rules (non-negotiable)

Everything deployable via IaC (no “clickops” drift).
Central logs + alerts from day one (no blind spots).
Secrets not stored in app config or repos (rotation required).
Least privilege access, time-bound where possible.
Backup/restore drills are scheduled and measured.

Service selection framework

Need	Default choice	Escalate to
Fast web API	managed containers / K8s	VMs for special cases
Batch jobs	serverless containers	dedicated compute for heavy IO/GPU
Relational DB	managed DB	bare metal for extreme constraints
Object data	object storage	archive tiers and lifecycle rules

1.2 Reference Landing Zone (Network, shared services, guardrails)

Topology blueprint

Edge (public)
  - reverse proxy / ingress
  - TLS termination
  - rate limiting + bot protection

Private networks
  - app subnet(s)
  - data subnet(s)
  - admin subnet(s) (bastion-like access)

Shared
  - central logging
  - secrets + rotation
  - CI/CD runners (if self-hosted)
  - artifact registry

Rule: keep public surface minimal; everything else private by default.

Shared services (platform subscription equivalent)

Central observability workspace (metrics/logs) and alert routing.
Secrets store + rotation workflow (and incident “break glass” policy).
Container registry and artifact promotion rules.
Network egress control points and DNS/naming conventions.

Guardrails (policy-as-code mindset)

Enforce naming/labels and ownership on resources.
Block direct public exposure of data services unless explicitly approved.
Mandatory logging configuration for compute and platforms.
Minimum baseline for TLS, credentials, and patching.

Ops evidence: what you must be able to prove

Evidence	How	Why
Who deployed what	CI logs + artifact digests	auditability
Security posture	scan reports + patch reports	risk control
Recoverability	restore drill results	real DR
SLO compliance	dashboards + incidents	customer trust

1.3 IaC & Automation (Terraform-first, drift control, CI gates)

Terraform workflow (gold standard)

Stages
1) fmt + validate
2) plan (saved plan)
3) policy checks (custom)
4) approval gate (prod)
5) apply
6) smoke tests + monitoring hooks

Rule: no apply in prod without a reviewed plan.

Drift control

Scheduled plan to detect drift.
Alert on out-of-band changes.
Either reconcile (apply) or revert (incident).
Track “exceptions” explicitly and time-bound them.

1.4 Reference Architectures (web, container platform, serverless, data)

3-tier baseline (private-first)

Internet -> Edge (TLS + routing + rate limits)
  -> App (containers / K8s / VMs in private networks)
    -> Data (managed DB + object storage)
Observability + secrets + backups are platform-wide.

K8s platform baseline

Separate system and workload node pools.
GitOps deployment with environment overlays.
Network policy + minimal service exposure.
Supply chain gates: scan + SBOM + signature verification.
Observability: metrics + logs + traces as default.

Event-driven serverless baseline

Triggers -> Serverless Functions / Containers
  -> durable storage (DB/object)
  -> dead-letter strategy + alerts
  -> idempotency keys for every handler

Rule: retries without idempotency create data corruption.

Data platform sketch

Ingest -> Object storage (raw)
Transform -> compute (batch / containers)
Serve -> warehouse / search index / APIs
Govern -> access model + retention + audit trail

1.5 APIs & Tooling (Console/CLI/API, automation conventions)

Automation conventions

Prefer API/IaC over console for repeatability.
Store credentials securely; rotate and audit.
Every script must be idempotent and log its actions.
Keep a “break glass” playbook, but isolate it.

Script contract
- inputs validated
- dry-run supported
- logs to stdout in structured lines
- exit codes reliable
- safe retries

Cheat-sheet (Checklists, templates, incident shortcuts)

Platform checklist

Landing zone
- private networks segmentation
- edge entry points minimal
- centralized observability + alert routing
- secrets lifecycle + rotation
- backups + restore drills
- IaC modules + CI gates
- supply chain controls for containers

Serverless checklist

Serverless reliability
- idempotency keys
- bounded retries
- dead-letter strategy + alerts
- timeouts sized per workload
- concurrency limits
- structured logs + tracing

Cost checklist

FinOps loop (monthly)
- top 10 spenders review
- rightsizing candidates
- storage lifecycle enforcement
- log ingestion reduction
- idle resources cleanup
- unit cost KPIs (per request / per tenant)

Incident shortcut

Triage steps
1) user impact scope (SLO breach?)
2) recent deployments
3) saturation signals (CPU/mem/IO/conn)
4) network/DNS failures
5) data errors (locks/slow queries)
6) rollback or mitigation
7) postmortem actions

2.1 Instances (VMs) – Sizing, disks, images, patching, lifecycle

Sizing method (no guessing)

Signal	What to watch	Action
CPU	p95 utilization + steal	rightsize / scale out
Memory	pressure + OOM risk	increase RAM / reduce footprint
Disk	IOPS/throughput + queue	move to faster volume / shard
Network	pps + retransmits	tune edge / improve routing

Disk strategy (DB-grade thinking)

Separate OS disk from data disk when needed.
For databases: isolate WAL/redo logs if possible; measure IOPS and fsync latency.
Snapshots are not backups unless restore is tested and retention is enforced.
Use filesystem options aligned with workload (barriers, journaling choices).

Rule: treat storage latency as a primary production KPI.

Ops baseline

SSH via controlled entry (no open world access).
Patching cadence + emergency patch process.
Central logs and metrics with alerts on saturation.
Immutable infrastructure mindset where possible (rebuild over patch drift).

2.2 Elastic Metal & Dedicated – When bare metal is justified

Decision criteria

Constraint	Why metal	Mitigation if not
Extreme IO	lowest latency, dedicated throughput	sharding + caching
Licensing	per-core constraints	optimize core counts
Isolation	strict tenancy needs	strong security baseline
GPU intensive	dedicated accelerators	batch windows + scaling

Rule: bare metal increases operational responsibility—plan automation and monitoring first.

2.3 GPU & AI – Container strategy, batch inference, cost discipline

GPU platform patterns

Prefer containers for reproducibility (drivers/toolkit pinned).
Separate training vs inference: different scheduling and scaling models.
Use batch windows and auto-shutdown for idle GPU time.

Cost controls (mandatory)

Define maximum concurrency and max runtime per job.
Track cost per 1k inferences / per training epoch.
Cache model artifacts in object storage with versioning.

2.4 Images & Bootstrapping – Golden images + cloud-init baseline

Golden image contract

Golden image must include
- base hardening (sshd settings, firewall defaults)
- monitoring agent install step
- log forwarding configuration
- time sync and DNS defaults
- minimal packages only

cloud-init responsibilities
- inject host keys safely
- configure app runtime
- register into monitoring
- pull secrets from secure store

Rule: rebuild is safer than mutate. Keep servers disposable.

2.5 Backup & DR – RPO/RTO tiers, snapshots, restore drills

Tier	Target	Design	Verification
Tier 0	minutes	multi-AZ + replication	game day drills
Tier 1	hours	snapshots + managed backups	monthly restores
Tier 2	day	object backups + manual	quarterly audits

Rule: backups are only real if you restore and measure recovery time.

2.6 Operations Playbook – Access, patching, monitoring, incidents

Operational loop

Daily
- check SLO dashboards
- review alerts + top errors
- confirm backup jobs

Weekly
- patch window for non-prod
- capacity review (CPU/mem/IO)
- vulnerability scan review

Monthly
- cost review + rightsizing
- restore drill
- postmortem action items verification

3.1 Kubernetes (Kapsule / Kosmos) – Cluster design, security, upgrades

Cluster foundation

Separate system and workload pools.
Define ingress strategy and TLS as a platform standard.
Use autoscaling carefully: HPA + cluster autoscaler with safe limits.
Pin base images and enforce immutable deployments.

Security essentials

Network policies: default deny + allow by service needs.
RBAC: least privilege, separate admin from deploy roles.
Pod security and runtime constraints (no privileged by default).
Supply chain: scan + SBOM + signature validation in CI/CD.

Operations

Upgrades: staged, maintenance windows, canary cluster if needed.
Observability: cluster + node + workload dashboards.
Backups: stateful systems are backed up outside the cluster; configs are GitOps.

Resilience

Resilience checklist
- readiness/liveness probes
- pod disruption budgets
- multi-node spread (anti-affinity)
- rate limits at ingress
- graceful shutdown
- chaos-style drills (optional but valuable)

3.2 Container Registry – Promotion by digest, SBOM, signing, scanning gates

Supply chain gates

Gate	What it checks	Block on
Vuln scan	CVEs in OS/libs	high/critical
SBOM	dependency inventory	missing SBOM
Signature	image provenance	unsigned images
Policy	base image allowlist	unapproved base

Rule: deploy by digest, not mutable tags.

3.3 Serverless Containers – Stateless workloads, scaling, timeouts, reliability

Best for

Stateless web APIs and job-like workloads.
Scale-to-zero services with bursty traffic.
Event-driven handlers packaged as containers.

Rule: keep state in managed services (DB/object), never on ephemeral filesystem.

Reliability checklist

Reliability
- strict request timeout budgeting
- bounded concurrency
- retry policy aligned with idempotency
- dead-letter handling for async patterns
- structured logs + correlation IDs

Cost discipline

Track cost per request and cost per job.
Cap max scale for “runaway traffic” scenarios.
Use caching and edge rate limits to avoid amplification.

3.4 Serverless Functions – Triggers, retries, idempotency, dead-letters

Golden rules

Idempotency is mandatory for event handlers.
Use deterministic retry strategy (max attempts, backoff, time budget).
Write logs as structured events with correlation IDs.
Separate “poison messages” to a dead-letter stream and alert on it.

Handler skeleton (concept)
- validate payload
- compute idempotency key
- check processed marker
- process business logic
- persist result atomically
- return success
- on error: classify retryable vs non-retryable

3.5 Ingress & Exposure – TLS, routing, rate limits, private services

Concern	Edge control	Notes
TLS	terminate + rotate certs	enforce modern ciphers
Routing	L7 rules	path-based and host-based
Abuse	rate limits + IP rules	prevent traffic amplification
Private services	internal routing	avoid public endpoints

Rule: if it does not need to be public, do not make it public.

3.6 GitOps & Delivery – Environments, progressive rollouts, rollback playbooks

Release patterns

Pattern	Best for	Requirement
Blue/green	safe cutover	traffic switch + fast rollback
Canary	risk reduction	metric-based promotion
Rings	enterprise	progressive exposure

Rule: rollout strategy requires SLO dashboards and rollback automation.

4.1 Network Core – Segmentation, routing, service boundaries

Segmentation blueprint

Network zones
- edge (public entry)
- app (private workloads)
- data (private databases)
- admin (restricted access)
- shared (observability, registry, secrets)

Rule: segmentation is an incident containment tool, not a checkbox.

4.2 Public Exposure – IP strategy, NAT/egress control, resilience

Public entry rules

Terminate TLS at a controlled edge layer.
Rate limit by IP and by identity where possible.
Implement request timeouts and size limits.
Log edge events and alert on anomalies.

Rule: edge defenses must be measurable (traffic, blocks, latency impact).

4.3 Network Security – Security groups, egress allowlists, isolation

Control	Goal	Common failure
Ingress rules	allow only required ports	0.0.0.0/0 to admin ports
Egress rules	prevent data exfil	allow all outbound by default
Service isolation	contain compromise	flat network with shared creds

Rule: outbound traffic control is often your strongest last-line defense.

4.4 DNS Patterns – Split-horizon, internal naming, cluster DNS

DNS rules

Document resolution chain (who resolves what, where, and why).
Use internal names for private services; keep external DNS minimal.
For Kubernetes: standardize service discovery and ingress hostnames.

Rule: most “mysterious outages” are DNS + routing + timeouts combined.

4.5 Hybrid Connectivity – IP planning, routing rules, failover ownership

Hybrid contract

Hybrid must define
- prefix plan (no overlaps)
- routing ownership (who changes what)
- failover behavior (tested)
- change windows and rollback
- monitoring for tunnel health

4.6 Network Troubleshooting – Structured triage for latency, loss, DNS

Triage checklist

Symptom	Check	Action
Timeouts	edge logs + upstream latency	tighten timeouts, fix bottleneck
DNS failures	resolver health + TTL	stabilize DNS chain
Packet loss	retransmits, MTU	fix MTU or routing
Slow K8s	network policy + CNI	trace flows, simplify rules

5.1 Object Storage (S3-compatible) – Lifecycle, versioning, access control

Bucket design

Separate buckets by data classification and lifecycle needs.
Define naming conventions and ownership labels.
Prefer immutable object versions for critical artifacts.

Security rules

Least privilege: scoped credentials and access review.
Encrypt data and restrict cross-project access.
Audit access and alert on anomalies.

Lifecycle policy (cost control)

Lifecycle example
- day 0-30: hot
- day 31-180: cool
- day 181+: archive
- delete markers and old versions per policy

Rule: lifecycle policies are your best storage cost lever.

5.2 Block Storage – SSD volumes, DB workloads, snapshots, resize

DB-grade checklist

Measure fsync latency and queue depth.
Separate write-heavy volumes from OS when needed.
Snapshots are not a substitute for logical backups.
Test restore path and automate validation.

Rule: if IO latency spikes, your entire platform degrades.

5.3 Storage Backup Strategy – 3-2-1, immutability, restore drills

3-2-1
- 3 copies
- 2 different media (block + object)
- 1 offsite (separate project/zone)

Operational must-haves
- documented restore steps
- monthly restore drill
- retention and deletion protection

5.4 Storage Security – Least privilege, audits, encryption posture

Control	How	Outcome
Least privilege	scoped credentials	reduced blast radius
Access review	monthly audit	remove stale access
Encryption	standardize policy	consistent posture
Logging	central logs + alerts	detect anomalies

5.5 Storage Performance – IOPS, throughput, multipart, caching

Performance guidance

Object storage: parallel uploads + multipart for big objects.
Block storage: monitor queue depth and fsync latency for databases.
Caching: avoid re-downloading artifacts; version them and cache safely.

5.6 Storage Cost Control – Lifecycle, cold tiers, egress awareness

Cost levers

Lifecycle transitions and deletion policies.
Archive old versions; keep only what you restore.
Track egress drivers (CDN/edge caches reduce outbound).

Rule: storage costs explode due to “forgotten” data and missing lifecycle rules.

6.1 Managed Relational DB – HA thinking, backups, upgrades, tuning

HA/DR mindset

Know your failure domains and design accordingly.
Prefer managed HA where available; document failover behavior.
Measure RPO/RTO and validate with drills.

Backups and safe migrations

Safe migration flow
1) backup + verify restore path
2) schema change in small steps
3) dual-write or compatibility window (if needed)
4) monitor errors + latency
5) cleanup after stabilization

Performance loop

DB tuning loop
1) capture slow queries
2) explain/analyze
3) index or rewrite
4) validate with p95/p99
5) regressions guardrails (tests + dashboards)

6.2 Serverless SQL – Connection patterns, pooling, latency trade-offs

Serverless database pitfalls

Cold start latency can hit first queries—budget for it.
Connection storms are common: use pooling or connection limits.
Long transactions reduce scalability—keep transactions short.

Rule: serverless DB is an application architecture decision, not only a DB decision.

6.3 Managed Redis – Caching strategy, persistence, eviction, HA

Topic	Decision	Rule
TTL	per key class	no infinite TTL without justification
Eviction	policy choice	align with data criticality
Persistence	if needed	cache != source of truth
HA	replication	test failover behavior

6.4 Managed NoSQL – Indexing, schema evolution, TTL, backups

Modeling checklist

Design queries first, then indexes.
Use explicit version fields for schema evolution.
TTL for ephemeral data and cost control.
Backup strategy independent from the DB engine.

6.5 Analytics (Warehouse) – Partitioning, ingestion, cost discipline

Warehouse rules
- ingest in append-only patterns
- partition by time and key dimensions
- keep hot and cold datasets separate
- track cost per query / per dashboard
- implement retention and archiving

Rule: analytics cost is dominated by data scanned and query concurrency.

6.6 Search / OpenSearch – Index design, shards, ingestion, tuning

Area	What matters	Action
Mapping	field types, analyzers	freeze mapping early
Shards	parallelism vs overhead	size shards sensibly
Ingestion	bulk + backpressure	avoid overload loops
Retention	index lifecycle	rollover + delete

Rule: search performance is mapping + shard sizing + ingestion control.

7.1 Cockpit – Unified observability (metrics, logs, dashboards, alerting)

Observability model

Signals
- metrics (fast, low cost)
- logs (deep, higher cost)
- traces (request path)

System
- dashboards for SLOs
- alerts wired to runbooks
- retention policies as code

Rule: if you cannot see it, you cannot operate it.

7.2 Central Logging – Taxonomy, retention tiers, sampling, PII discipline

Log taxonomy

Levels
- audit (security relevant)
- error (actionable failures)
- warn (degradation)
- info (operational events)
- debug (short retention, controlled)

Rule: log retention is a cost and a compliance requirement—treat it as policy.

7.3 APM & Tracing – Correlation IDs, RED/USE, latency SLOs

Minimum viable tracing

Correlation ID across services and logs.
Trace external dependencies (DB, cache, HTTP calls).
Track p95/p99 latency and error rate for each service.

Rule: trace sampling must preserve high-error and high-latency requests.

7.4 Alerting System – Actionable alerts, ownership, runbooks, escalation

Alert	Condition	Runbook
SLO breach	error rate or latency over threshold	rollback / mitigate / scale
Saturation	CPU/mem/IO high + queue	rightsize / scale / shard
Security	auth anomalies	rotate creds / block / investigate
Backup failure	job missing or error	repair + re-run + verify restore

Rule: if an alert cannot be acted upon, it is noise.

7.5 SRE Workflow – Incidents, postmortems, error budgets

Incident lifecycle
Detect -> Triage -> Mitigate -> Recover -> Postmortem

Postmortem must include
- timeline
- root cause
- contributing factors
- detection gaps
- action items with owners and deadlines

Rule: postmortems are improvement engines, not blame tools.

7.6 Observability Cost Control – Sampling, drop rules, archive strategy

Cost levers

Sampling for traces and high-volume logs.
Keep audit/security logs high priority; reduce verbose app logs.
Short hot retention, long archive retention.

Rule: keep the signal hot; archive the history.

8.1 Identity & Access – Tokens, RBAC mapping, least privilege

Access model

Principles
- least privilege by role
- separate admin vs deploy vs read-only
- time-bound access for sensitive actions
- credential rotation policy
- audit trail for privileged operations

Rule: credentials without rotation are liabilities.

8.2 Secrets & Key Management – Rotation, injection, auditability

Secrets lifecycle

Phase	What	Control
Create	generate securely	no manual weak secrets
Store	secure vault	access logs
Inject	runtime fetch	no secrets in images
Rotate	scheduled	alert on failures
Revoke	incident response	fast containment

Rule: design secrets rotation before production launch.

8.3 Security Baseline – Hardening, patching, scanning, secure defaults

Baseline checklist

OS hardening and minimal packages.
Patch cadence + emergency patch process.
Container scanning + signed images.
Runtime controls: least privileges and no privileged containers by default.
Audit logs routed centrally and retained.

8.4 Edge Security – TLS, rate limits, bot mitigation, incident playbooks

Edge controls

Controls
- strict TLS configuration
- request size limits
- rate limit by IP and by token
- allowlist for admin endpoints
- anomaly detection from edge logs
- fast block / unblock workflow

Rule: your edge is your blast radius boundary.

8.5 Compliance & Data Protection – Classification, retention, encryption, evidence

Area	Control	Proof
Data class	labels + access policy	inventory report
Retention	policy-as-code	config snapshots
Encryption	standard posture	audit checks
Backups	drills	restore logs

8.6 Security Incident Response – Containment, rotation, forensics, prevention

IR playbook
1) contain: block entry, isolate systems
2) preserve evidence: logs, snapshots
3) rotate credentials: tokens, DB creds, registry secrets
4) eradicate: patch, remove persistence
5) recover: restore services, monitor
6) learn: postmortem + guardrails

Rule: rotate creds early; attackers love long-lived secrets.

9.1 FinOps Core – Budgets, showback, anomalies, monthly routines

FinOps loop

Weekly
- anomaly detection review
- top spenders quick scan

Monthly
- rightsizing and idle cleanup
- storage lifecycle enforcement
- log ingestion reduction
- unit cost KPI review

Rule: cost is an engineering metric. Make it visible to teams.

9.2 Compute Cost – Rightsize, autoscale, shutdown schedules, batch windows

Lever	Action	Proof
Rightsize	adjust CPU/RAM	utilization report
Scale	autoscale safely	SLO stability
Shutdown	stop non-prod nightly	schedule evidence
Batch	run heavy jobs in windows	cost per job

9.3 Storage Cost – Lifecycle rules, retention, egress reduction

Top storage wastes

No lifecycle rules (everything stays hot forever).
Unlimited versions and no cleanup.
Unbounded logs in object storage with no retention.
Unexpected egress due to lack of caching/edge.

Rule: lifecycle without enforcement is only documentation.

9.4 Data Services Cost – HA tiers, backups, scaling triggers, query discipline

Data cost drivers
- always-on replicas
- long retention for backups/logs
- inefficient queries scanning too much data
- overprovisioned instance sizes

Controls
- rightsizing reviews
- query performance budgets
- retention as policy

9.5 Observability Spend – Ingestion control, sampling, archive tiers

Signal-first policy

Keep audit/security logs hot and long retention.
Sample traces aggressively but keep “slow/error” traces.
Archive bulk logs; keep dashboards based on SLO signals.

9.6 FinOps KPIs – Unit economics for cloud

KPI	Definition	Use
Cost / 1k requests	infra spend divided by traffic	scale economics
Cost / tenant	monthly spend per customer	pricing sanity
Cost / GB stored	storage + lifecycle efficiency	retention tuning
Cost / deploy	CI/CD + artifact + test spend	pipeline efficiency

Rule: unit costs drive architecture decisions more than raw “monthly spend”.

☁️ Scaleway Cloud – Hyper-Dense Guide

Foundations

Reference Landing Zone

IaC & Automation

Reference Architectures

APIs & Tooling

Cheat-sheet

Instances (Virtual Machines)

Elastic Metal & Dedicated

GPU & AI Workloads

Images & Bootstrapping

Backup & DR

Operations Playbook

Kubernetes (Kapsule / Kosmos)

Container Registry

Serverless Containers

Serverless Functions

Ingress & Exposure

GitOps & Delivery

Network Core

Public Exposure

Network Security

DNS Patterns

Hybrid Connectivity

Network Troubleshooting

Object Storage (S3-compatible)

Block Storage

Storage Backup Strategy

Storage Security

Storage Performance

Storage Cost Control

Managed Relational DB

Serverless SQL

Managed Redis

Managed NoSQL

Analytics (Warehouse)

Search / OpenSearch

Cockpit (Observability)

Central Logging

APM & Tracing

Alerting System

SRE Workflow

Observability Cost Control

Identity & Access

Secrets & Key Management

Security Baseline

Edge Security

Compliance & Data Protection

Security Incident Response

FinOps Core

Compute Cost Playbook

Storage Cost Playbook

Data Services Cost

Observability Spend

FinOps KPIs

Scope & environment model

Production rules (non-negotiable)

Service selection framework

Topology blueprint

Shared services (platform subscription equivalent)

Guardrails (policy-as-code mindset)

Ops evidence: what you must be able to prove

Terraform workflow (gold standard)

Drift control

3-tier baseline (private-first)

K8s platform baseline

Event-driven serverless baseline

Data platform sketch

Automation conventions

Platform checklist

Serverless checklist

Cost checklist

Incident shortcut

Sizing method (no guessing)

Disk strategy (DB-grade thinking)

Ops baseline

Decision criteria

GPU platform patterns

Cost controls (mandatory)

Golden image contract