🛡️ Oracle Data Guard – Guide complet (DR, Broker, Runbooks)
Architecture, redo transport, protection modes, broker, switchover/failover, FSFO, monitoring et troubleshooting.
Vue d ensemble
Pourquoi Data Guard, principes, vocabulaire, composants.
ConceptsDRRPO/RTOArchitecture
Primary/Standby, redo, services, SYNC/ASYNC, LGWR/ARCH.
RedoTransportApplyProtection modes
Maximum Protection / Availability / Performance, impact RPO.
SYNCRPORiskPrerequis
Force logging, standby redo logs, archivelog, network, listener.
SetupSRLARCHBuild Standby
RMAN duplicate, active duplicate, restore from backup, standby controlfile.
RMANDuplicateAutomationRedo transport
LOG_ARCHIVE_DEST_n, VALID_FOR, DB_UNIQUE_NAME, FAL, NET_TIMEOUT.
LOG_ARCHIVE_DESTFALTuningRedo apply
MRP, real-time apply, delay, apply lag, SRL usage.
MRPReal-timeLagData Guard Broker
DGMGRL, config, observer, validation, role transitions.
BrokerDGMGRLOperationsSwitchover / Failover
Runbooks, checks, sessions, services, DNS/VIP, app cutover.
RunbookRTOValidationFSFO
Fast-Start Failover, observer, thresholds, safety, split brain guard.
FSFOObserverSafetyMonitoring
V$DATAGUARD_STATS, apply/transport lag, broker status, alerts.
ViewsMetricsAlertingTuning & perf
SYNC vs ASYNC, NET_TIMEOUT, compression, parallel apply, I/O, SRL sizing.
PerfNetworkApplySecurite
Encryption in transit, wallets, password files, least privilege, ports.
TLSWalletHardeningTests DR
Game days, chaos drills, lag injection, role transition rehearsal.
DR DrillGame DayProofTroubleshooting
Gap, ORA- errors, stuck apply, network drops, broker warnings.
IncidentsGapFixCheat-sheet
Commands SQL/DGMGRL, quick checks, decision tree.
CommandsChecksBroker Wizard
Generate a safe baseline config (DGMGRL + init params).
WizardBaselineRunbook complet
Checklist pre/post, comms, rollback, cutover app, KPIs.
RunbookOpsAuditScenario
Primary and physical standby are healthy. Redo transport unexpectedly drops for 12 minutes due to a network change. During the outage, primary continues generating redo (ASYNC transport). When network is restored, standby reports a gap. Objective: restore transport/apply fast, quantify impact, and produce evidence for postmortem.
Environment assumptions
- Physical standby with SRLs and real-time apply enabled
- Broker present (optional)
- Transport mode ASYNC (primary continues during outage)
- Monitoring on v$dataguard_stats + v$archive_dest_status
Incident objectives
- Restore redo transport reliability
- Resolve gap and restart apply
- Measure peak lag and time-to-catch-up
- Produce clear timeline and corrective actions
Initial triage commands
-- Transport error signal
SELECT dest_id, status, error, destination, transmit_mode
FROM v$archive_dest_status
WHERE status <> 'INACTIVE'
ORDER BY dest_id;
-- Lag snapshot
SELECT name, value, unit, time_computed
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag','apply finish time');
-- Apply processes (standby)
SELECT process, status, client_process, thread#, sequence#
FROM v$managed_standby
ORDER BY process;Timeline (minute-by-minute)
| Time | Observation | Evidence (what to capture) | Decision / Action |
|---|---|---|---|
| T+00 | Alert: transport lag increases, archive dest shows error | v$archive_dest_status, alert log snippet, network ping/RTT | Declare incident, assign roles, freeze changes |
| T+02 | Primary continues OK (ASYNC), standby apply slows/stops | v$dataguard_stats, v$managed_standby | Confirm business impact: none yet, DR posture degraded |
| T+04 | Transport remains down, redo backlog grows | Archived log rate on primary, lag trend | Escalate to network team, request revert/rollback |
| T+08 | Network team identifies firewall/route change | Change ticket ID, firewall counters (drops), traceroute | Request immediate fix and confirm expected ETA |
| T+12 | Transport restored, but standby shows gap | v$archived_log missing sequences, dest status now VALID | Start gap resolution (FAL or manual) |
| T+14 | Standby fetches missing logs (FAL) / or manual copy begins | RFS activity, new archivelogs appear, applied advances | Keep apply running, monitor catch-up rate |
| T+18 | Apply catches up, lag returns within SLO | transport/apply lag near baseline | Close incident, open postmortem, document actions |
Evidence bundle checklist
- Transport status snapshots (before, during, after)
- Lag snapshots (peak + recovery curve)
- Missing sequence range (gap start/end)
- Change reference (network) and remediation steps
Gap resolution playbook
After transport returns, standby may still be missing archived logs. If FAL is configured, it can auto-fetch. Otherwise, copy missing logs manually and register them. Then restart apply and verify catch-up.
Step 1: identify missing sequences
-- Standby: last received and last applied per thread
SELECT thread#,
MAX(sequence#) AS last_received,
MAX(CASE WHEN applied='YES' THEN sequence# END) AS last_applied
FROM v$archived_log
GROUP BY thread#
ORDER BY thread#;
-- Standby: list gaps (manual inspection approach)
SELECT thread#, sequence#, applied
FROM v$archived_log
WHERE first_time > SYSDATE - 1
ORDER BY thread#, sequence#;Step 2: verify transport state
SELECT dest_id, status, error, destination
FROM v$archive_dest_status
WHERE status <> 'INACTIVE'
ORDER BY dest_id;
-- Broker (optional)
-- DGMGRL> SHOW CONFIGURATION;
-- DGMGRL> SHOW DATABASE STBY;Step 3a: FAL auto-fetch (preferred if configured)
-- Standby: confirm FAL settings
SHOW PARAMETER fal_server;
SHOW PARAMETER fal_client;
-- If needed (example)
ALTER SYSTEM SET fal_server='PRIM' SCOPE=BOTH;
ALTER SYSTEM SET fal_client='STBY' SCOPE=BOTH;Step 3b: manual copy + register (fallback)
-- Standby: register manually copied archivelogs
ALTER DATABASE REGISTER LOGFILE '/path/to/arch_1_12345_*.arc';
ALTER DATABASE REGISTER LOGFILE '/path/to/arch_1_12346_*.arc';Step 4: restart apply and validate
-- Restart apply (real-time)
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;
-- Validate lag
SELECT name, value, unit
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
-- Apply processes
SELECT process, status, thread#, sequence#
FROM v$managed_standby
ORDER BY process;KPIs to report (what leadership cares about)
| KPI | Definition | How to compute | Target |
|---|---|---|---|
| Transport downtime | Duration transport was failing | T_restore - T_fail | As low as possible |
| Peak transport lag | Max shipping delay observed | Max(v$dataguard_stats.transport lag) | Within SLO |
| Peak apply lag | Max apply delay observed | Max(v$dataguard_stats.apply lag) | Within SLO |
| Time to catch up | Time from restore to lag back to baseline | T_baseline - T_restore | Minutes, not hours |
| Gap size | Missing sequences range | (last_received - last_applied) | Small |
Postmortem structure (template)
Root cause
- Network change impacted transport path (firewall/route policy)
- Monitoring detected lag growth within X minutes
- Transport restored after rollback/fix at T+12
Corrective actions
- Add change guardrail: replication path validation pre-change
- Add alert: transport errors + lag slope threshold
- Run quarterly drill: gap resolution under load
- Document escalation runbook and ownership
One-shot summary commands (copy/paste)
-- Transport status
SELECT dest_id, status, error, destination, transmit_mode
FROM v$archive_dest_status
ORDER BY dest_id;
-- Lag snapshot
SELECT name, value, unit, time_computed
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag','apply finish time');
-- Last applied per thread
SELECT thread#,
MAX(CASE WHEN applied='YES' THEN sequence# END) AS last_applied
FROM v$archived_log
GROUP BY thread#
ORDER BY thread#;Redo transport network on Exadata
In Exadata environments, redo transport performance is often limited not by storage, but by network latency and packet stability between primary and standby sites. Especially in SYNC mode, commit latency is directly impacted by network RTT.
Critical network factors
- Round-trip latency (RTT) — affects SYNC commits
- Packet loss — kills effective throughput
- Bandwidth headroom — must exceed peak redo rate
- Jitter stability — redo bursts amplify jitter issues
Design patterns
- Use dedicated replication network path if possible
- Measure p95/p99 latency, not only average
- Align SYNC mode with realistic latency budget
- Validate firewall throughput under sustained load
Redo generation vs network capacity
-- Approx redo generation (MB per minute)
SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24:MI') AS minute,
ROUND(SUM(blocks*block_size)/1024/1024) AS mb
FROM v$archived_log
WHERE first_time > SYSDATE - (1/24)
GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24:MI')
ORDER BY minute;Standby Redo Logs on ASM (Exadata)
On Exadata, redo and standby redo logs typically reside in ASM disk groups. Placement and redundancy choice impact redo write latency and apply stability.
| Item | Recommendation | Why |
|---|---|---|
| Disk group | +DATA or dedicated redo diskgroup | Low latency write path |
| Redundancy | Normal or High (per policy) | Balance between protection and latency |
| Size parity | Match primary redo size | Consistent log switch behavior |
Example: create SRL on ASM
-- Example using OMF in ASM
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;
-- Verify placement
SELECT group#, thread#, bytes/1024/1024 AS size_mb, status
FROM gv$standby_log
ORDER BY thread#, group#;IO baseline for standby on Exadata
Active Data Guard + heavy reporting can stress IO. Establish a measurable IO baseline to ensure redo apply never starves.
Primary baseline metrics
- Redo generation rate (MB/s)
- Log file sync latency
- Log switch frequency
Standby baseline metrics
- Apply rate (MB/s)
- SRL write latency
- Apply lag stability under load
Apply visibility
-- Apply rate visibility
SELECT process, status, thread#, sequence#
FROM gv$managed_standby
ORDER BY process;
-- Lag metrics
SELECT name, value, unit
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');Common Exadata + Data Guard pitfalls
| Symptom | Likely cause | First check | Action |
|---|---|---|---|
| Commit latency spikes in SYNC | Network RTT instability | Latency p95/p99 | Re-evaluate SYNC or network path |
| Apply lag under reporting load | Standby IO/CPU contention | OS + apply metrics | Resource governance, limit reporting concurrency |
| Frequent log switches | Redo logs too small | v$log switch frequency | Increase redo size |
| SRL not used in real-time | Misconfiguration | gv$standby_log status | Fix SRL count/size per thread |
Quick health snapshot
SELECT database_role, open_mode, protection_level
FROM v$database;
SELECT name, value, unit
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
SELECT dest_id, status, error, transmit_mode
FROM v$archive_dest_status
ORDER BY dest_id;Definition (simple)
Active Data Guard allows a physical standby database to be open read-only while it is still applying redo. This enables real-time reporting, offload queries, and read-only services without stopping recovery.
Typical use cases
- Reporting / BI offload (heavy reads)
- Read-only APIs (analytics endpoints)
- Batch extracts and exports
- Backups on standby to offload primary (strategy dependent)
What ADG is not
- Not a multi-master solution
- Not a replacement for application read/write splitting discipline
- Not a free feature: usage is tied to licensing
Key operational invariant
Standby is open read-only
AND redo apply is still running
=> queries and apply share CPU/IO, so resource governance matters.Licensing caveats (practical)
Active Data Guard is a separately licensed option in many Oracle distributions. What matters operationally: do not assume read-only open + apply is covered by "basic Data Guard". Align with your Oracle licensing policy and your contract terms.
| Feature | Why it matters | Risk if ignored | Mitigation |
|---|---|---|---|
| Read-only open + redo apply | This is the ADG capability | Non-compliance exposure | Confirm entitlement before enabling |
| Offload backups on standby | Common architecture choice | Audit questions | Document usage and licensing position |
| Real-time query services | Apps may start using standby heavily | Unplanned cost and capacity needs | Define service policies and guardrails |
Enable ADG (standby operations)
ADG requires the standby to be opened read-only, and apply to continue. A typical flow: start managed recovery, open read-only (or open then start apply, depending on version and policy).
Standby: open read-only
-- Standby: if mounted
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
-- Open read-only
ALTER DATABASE OPEN READ ONLY;Standby: start apply while open
-- Start redo apply while open read-only (ADG)
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE
USING CURRENT LOGFILE
DISCONNECT FROM SESSION;Validate state
-- Role and open mode
SELECT database_role, open_mode
FROM v$database;
-- Apply status
SELECT process, status, client_process, thread#, sequence#
FROM v$managed_standby
ORDER BY process;
-- Lag metrics
SELECT name, value, unit
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');Best practices (battle-tested)
Performance & stability
- Separate services: APP_RW on primary, APP_RO on standby
- Resource governance: ensure MRP always has CPU/IO headroom
- Monitor lag trend: alert on increasing apply lag, not only threshold
- SRL hygiene: correct size/count per thread (RAC)
- Storage latency: redo apply is sensitive to IO jitter
Query design on standby
- Avoid unbounded scans that saturate IO
- Prefer indexed access paths, limit concurrency
- Schedule heavy reports away from peak redo rates
- Use read-only routing and explicit timeouts
- Keep stats strategy consistent (do not destabilize plans)
Operational guardrails
| Guardrail | Why | How | Success signal |
|---|---|---|---|
| Dedicated read-only service | Prevents accidental writes and isolates routing | srvctl service + app config | No write attempts on standby |
| Lag SLO | Ensures DR readiness | Alert on apply lag and trend | Lag stays within budget |
| Capacity headroom | Avoids apply starvation | CPU/IO reservation | MRP stable under load |
| Change control | Standby is production-grade | Patch windows + drills | No surprises during switchover |
Common pitfalls
- Opening standby read-only for reporting without resource controls => apply lag spikes.
- Routing too much traffic to standby => standby becomes critical path and hurts DR posture.
- Missing SRLs (or wrong SRL sizing) => real-time apply not effective.
- Not rehearsing service cutover => app outages during planned switchovers.
Quick monitoring block (copy/paste)
-- ADG: verify open read-only and apply running
SELECT database_role, open_mode FROM v$database;
SELECT name, value, unit
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
SELECT process, status, thread#, sequence#
FROM v$managed_standby
ORDER BY process;
SELECT dest_id, status, error, transmit_mode
FROM v$archive_dest_status
ORDER BY dest_id;RAC + Data Guard: core mental model
In RAC, each instance has its own redo thread. Data Guard transport/apply must handle multiple threads consistently. Your goal is to keep all threads shipping and applying without gaps, and to ensure services move cleanly during role transitions.
Key objects
- Redo threads: one per RAC instance (THREAD#)
- Online redo logs: per thread, on primary
- Standby redo logs (SRL): per thread, on standby
- RFS: one receiver per incoming stream
- MRP: applies redo; must see all threads
What breaks first (typical)
- Missing SRLs on one thread => no real-time apply for that thread
- SRLs too few/small => frequent switches, bursts, lag growth
- Services not pinned => application cutover chaos
- Thread disable/enable drift after maintenance
Quick visibility checks
-- Threads and instances (primary)
SELECT thread#, instance_name, host_name
FROM gv$instance
ORDER BY thread#;
-- Redo log groups per thread (primary)
SELECT thread#, group#, bytes/1024/1024 AS size_mb, status
FROM gv$log
ORDER BY thread#, group#;
-- Standby logs per thread (standby)
SELECT thread#, group#, bytes/1024/1024 AS size_mb, status
FROM gv$standby_log
ORDER BY thread#, group#;SRL per thread: sizing and count
Baseline rule: for each thread on the standby, create SRLs with the same size as the primary online redo logs, and with a count of (#online redo log groups per thread + 1). This enables real-time apply and reduces lag.
| Item | Per thread? | Baseline | Notes |
|---|---|---|---|
| SRL size | Yes | Match primary online redo size | Keep consistent across all threads |
| SRL count | Yes | #online redo groups + 1 | Increase if redo rate is high or switches are frequent |
| SRL placement | Yes | Fast, stable storage | SRL write latency affects apply readiness |
Example: create SRLs for 2-thread RAC
Example assumes each thread has 3 online redo groups on primary; create 4 SRLs per thread on standby. Adjust SIZE to match your environment.
-- Standby: create SRLs for THREAD 1
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;
-- Standby: create SRLs for THREAD 2
ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;SRL health checks
-- Standby: SRL inventory and state
SELECT thread#, group#, sequence#, bytes/1024/1024 AS size_mb, status
FROM gv$standby_log
ORDER BY thread#, group#;
-- Standby: apply processes visibility
SELECT inst_id, process, status, thread#, sequence#
FROM gv$managed_standby
ORDER BY inst_id, process;Services: clean application cutover
In RAC, client connectivity should be service-based. During role transitions, services must be started on the new primary and stopped on the new standby. Prefer cluster-managed services (srvctl) and align service placement with instance roles.
Recommended approach
- Use SCAN/VIP + service names, not node hostnames
- Define service goals (TP/THROUGHPUT) and preferred instances
- Ensure service is enabled only for PRIMARY_ROLE (policy)
- Keep a separate read-only service for standby (if ADG)
Typical service layout
- APP_RW: primary read-write (production)
- APP_RO: standby read-only (reporting) if Active Data Guard
- BATCH: batch windows, can be pinned to specific instances
Service inspection (database)
-- Services visible inside the database
SELECT name, network_name, pdb
FROM dba_services
ORDER BY name;
-- Role/open mode check (for gating service changes)
SELECT database_role, open_mode
FROM v$database;Service control (cluster - example)
# Example commands (run as grid/oracle depending on setup)
# List services
srvctl status service -d
# Start/stop service
srvctl start service -d -s APP_RW
srvctl stop service -d -s APP_RW Switchover gates (RAC-specific)
| Gate | Check | Expected | If not OK |
|---|---|---|---|
| Thread readiness | gv$instance + gv$log | All threads enabled, consistent log config | Fix thread enablement and log group parity |
| SRLs per thread | gv$standby_log | SRLs exist for every thread, correct size/count | Create/resize SRLs |
| Lag across threads | v$dataguard_stats + gv$managed_standby | Transport/apply lag stable, no gaps | Resolve gap, investigate standby bottlenecks |
| Service plan | srvctl status service | Services mapped and ready on target primary | Fix srvctl definitions and preferred instances |
Thread and gap quick checks
-- Standby: confirm sequences are being received/applied for all threads
SELECT thread#,
MAX(sequence#) AS last_seq,
MAX(CASE WHEN applied='YES' THEN sequence# END) AS last_applied_seq
FROM v$archived_log
GROUP BY thread#
ORDER BY thread#;
-- Lag metrics
SELECT name, value, unit
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');Pourquoi Data Guard
Data Guard = solution native Oracle pour HA/DR via replication basee sur redo. Objectif: survivre a une panne site / stockage / corruption / erreur humaine, en gardant un chemin clair de role transition (switchover/failover) avec verifications.
Ce que Data Guard apporte
- Protection du primaire par un standby synchronise.
- RPO configurable: zero data loss possible (selon mode, distance, latence).
- RTO reduit: bascule proceduree et repetable.
- Offload lecture/reporting sur standby (Active Data Guard si licence).
Ce que Data Guard ne remplace pas
- Backup strategy (RMAN, immutability, testing restores).
- Application-level resilience (retry, idempotency, timeouts).
- Operational discipline (runbooks, drills, comms).
Mental model
Redo is the truth.
Primary generates redo.
Transport ships redo.
Apply replays redo.
Role transition is controlled and validated.RPO / RTO (pratique)
| Metric | Meaning | How Data Guard impacts it | Typical lever |
|---|---|---|---|
| RPO | Data loss tolerance | SYNC modes can achieve near/zero data loss | Protection mode + SYNC/ASYNC + network latency |
| RTO | Time to restore service | Broker + runbooks can cut downtime | Automation, DNS/VIP, service management, rehearsals |
Composants cle
- Primary: production, generates redo (LGWR/ARCH).
- Standby: physical standby (recovery apply) or logical (SQL apply, legacy).
- Redo transport: network shipping via LOG_ARCHIVE_DEST_n.
- Apply: MRP (Managed Recovery Process) + real-time apply with SRLs.
- Broker: DGMGRL framework for config + monitoring + transitions.
- Observer: FSFO external process watching health to auto-failover.
Key data guard views
V$DATABASE
V$ARCHIVE_DEST_STATUS
V$ARCHIVED_LOG
V$DATAGUARD_STATS
V$MANAGED_STANDBY
V$STANDBY_LOGPatterns frequents
| Pattern | When | Notes |
|---|---|---|
| 1 primary + 1 physical standby | Classic DR | Keep it simple, automate transitions |
| Multi-standby | DR + reporting | One standby can be far, another near |
| Cascaded standby | Network constraints | Standby ships redo to another standby |
Le pipeline redo
Primary
LGWR writes redo to online redo logs
ARCH archives (optional) to archive logs
LOG_ARCHIVE_DEST_n ships redo to Standby
Standby
RFS receives redo
SRLs store redo (recommended)
MRP applies redo (real-time apply possible)Processes (quick)
- LGWR: redo writer on primary
- ARCH: archive logs creation
- RFS: remote file server on standby receiving redo
- MRP: managed recovery applying redo
Real-time apply
SYNC vs ASYNC (what actually changes)
| Mode | Primary commit waits? | Typical RPO | Tradeoff |
|---|---|---|---|
| SYNC | Yes (ACK required) | Near/zero | Commit latency depends on network |
| ASYNC | No (best effort) | Small loss possible | Better latency, higher throughput |
Standby Redo Logs sizing (SRL)
SRLs should match online redo log size and count. General baseline: (#online redo log groups + 1) per thread on the standby. For RAC: do it per thread.
-- Example (single instance)
-- If primary has 3 online redo log groups, create 4 SRLs
ALTER DATABASE ADD STANDBY LOGFILE SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE SIZE 1024M;
ALTER DATABASE ADD STANDBY LOGFILE SIZE 1024M;Network checklist (ops)
- Latency budget measured (p95/p99), not guessed.
- Stable DNS/SCAN or static endpoints for TNS.
- Firewall rules: listener ports + broker if used.
- Packet loss monitoring (loss kills SYNC effectiveness).
- Time sync (NTP/chrony) on all hosts.
What you are really choosing
| Mode | Transport | Behavior | Risk / Notes |
|---|---|---|---|
| Maximum Protection | SYNC + AFFIRM (typical) | Zero data loss target, primary may stop if no standby ack | Strongest RPO, requires robust network/standby availability |
| Maximum Availability | SYNC (with fallback) | Zero/near-zero, but primary continues if standby temporarily unavailable | Balanced, common for critical systems |
| Maximum Performance | ASYNC | Primary does not wait for standby | Best latency, some data loss possible on disaster |
Quick SQL checks
SELECT database_role, protection_mode, protection_level
FROM v$database;Primary must-have
-- Archivelog
ARCHIVE LOG LIST;
-- Force logging (recommended)
ALTER DATABASE FORCE LOGGING;
-- Standby file management
ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH;
-- Optional: flashback for fast reinstatement
-- (depends on licensing/strategy)
SELECT flashback_on FROM v$database;Parameter baseline (typical)
-- Use unique name
ALTER SYSTEM SET db_unique_name='PRIM' SCOPE=SPFILE;
-- If using Data Guard Broker
ALTER SYSTEM SET dg_broker_start=TRUE SCOPE=BOTH;
ALTER SYSTEM SET dg_broker_config_file1='+DATA' SCOPE=SPFILE;
ALTER SYSTEM SET dg_broker_config_file2='+FRA' SCOPE=SPFILE;Standby must-have
- Same DBID / same DB_NAME, distinct DB_UNIQUE_NAME
- Compatible init params (memory, processes, etc.)
- SRLs sized and provisioned
- Listener/TNS connectivity validated both ways
- Password file synced (or wallet strategy)
Validation checklist
| Check | How | Expected |
|---|---|---|
| Connectivity | tnsping + sqlplus | Stable connect |
| Archivelog | ARCHIVE LOG LIST | Enabled |
| Force logging | v$database.force_logging | YES |
| SRLs | v$standby_log | Present |
RMAN active duplicate (common)
Requires network connectivity and auxiliary instance in NOMOUNT. This is often the fastest path for initial sync.
-- On standby host (auxiliary)
-- 1) Create init.ora/spfile with db_name same as primary, db_unique_name different
-- 2) Start instance nomount
STARTUP NOMOUNT;
-- RMAN (from a client that can reach both)
rman target sys@PRIM auxiliary sys@STBY
DUPLICATE TARGET DATABASE
FOR STANDBY
FROM ACTIVE DATABASE
DORECOVER
NOFILENAMECHECK;RMAN duplicate from backup
-- 1) Take backup on primary (including archivelogs)
rman target /
BACKUP DATABASE PLUS ARCHIVELOG;
-- 2) Restore on standby host, then duplicate
rman target sys@PRIM auxiliary /
DUPLICATE TARGET DATABASE
FOR STANDBY
BACKUP LOCATION '/backup/location'
NOFILENAMECHECK;Controlfile method (manual baseline)
Useful as a mental model or last resort. RMAN is recommended for correctness and speed.
-- Primary
ALTER DATABASE CREATE STANDBY CONTROLFILE AS '/tmp/stby.ctl';
-- Copy controlfile + datafiles via backup/restore strategy
-- On standby: mount with standby controlfilePost-build steps
- Set LOG_ARCHIVE_DEST_n on both ends.
- Create SRLs on standby.
- Start managed recovery (MRP).
- Verify apply/transport lag is stable.
-- Standby: start apply
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;
-- Check MRP/RFS
SELECT process, status, thread#, sequence# FROM v$managed_standby;Minimal transport example
-- Primary
ALTER SYSTEM SET log_archive_config='DG_CONFIG=(PRIM,STBY)' SCOPE=BOTH;
ALTER SYSTEM SET log_archive_dest_2=
'service=STBY
ASYNC
valid_for=(ONLINE_LOGFILES,PRIMARY_ROLE)
db_unique_name=STBY' SCOPE=BOTH;
ALTER SYSTEM SET log_archive_dest_state_2=ENABLE SCOPE=BOTH;SYNC example (higher protection)
ALTER SYSTEM SET log_archive_dest_2=
'service=STBY
SYNC AFFIRM
net_timeout=30
reopen=10
valid_for=(ONLINE_LOGFILES,PRIMARY_ROLE)
db_unique_name=STBY' SCOPE=BOTH;Gap detection and fetch
Gaps happen with network outages or standby restarts. FAL can auto-fetch missing logs.
-- Standby
ALTER SYSTEM SET fal_server='PRIM' SCOPE=BOTH;
ALTER SYSTEM SET fal_client='STBY' SCOPE=BOTH;
-- Inspect archive sequences
SELECT thread#, sequence#, applied
FROM v$archived_log
ORDER BY thread#, sequence#;Transport status
SELECT dest_id, status, error, destination, target, transmit_mode
FROM v$archive_dest_status
WHERE dest_id IN (1,2);Start / stop apply
-- Start (real-time apply)
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;
-- Stop
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;Verify MRP
SELECT process, status, client_process, thread#, sequence#
FROM v$managed_standby
ORDER BY process;Lag metrics (official)
SELECT name, value, unit, time_computed, datum_time
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag','apply finish time');Real-time apply requirements
- Standby redo logs present
- Transport configured correctly
- MRP started with USING CURRENT LOGFILE
SELECT group#, thread#, sequence#, bytes/1024/1024 AS size_mb, status
FROM v$standby_log
ORDER BY thread#, group#;Common symptoms and first checks
| Symptom | Likely cause | First check | Action |
|---|---|---|---|
| Apply lag grows | Standby IO/CPU bottleneck | v$managed_standby, OS metrics | Scale standby, tune storage, review parallelism |
| Transport errors | Network/TNS/listener | v$archive_dest_status.error | Fix connectivity, validate service names |
| Gap detected | Missing archivelogs | v$archived_log sequences | FAL, manual copy/register logs |
Enable broker (both sides)
ALTER SYSTEM SET dg_broker_start=TRUE SCOPE=BOTH;
DGMGRL connect
dgmgrl sys@PRIM
dgmgrl sys@STBYCreate broker configuration
DGMGRL> CREATE CONFIGURATION dg_cfg AS
PRIMARY DATABASE IS PRIM
CONNECT IDENTIFIER IS PRIM;
DGMGRL> ADD DATABASE STBY AS
CONNECT IDENTIFIER IS STBY
MAINTAINED AS PHYSICAL;
DGMGRL> ENABLE CONFIGURATION;Validate and inspect
DGMGRL> SHOW CONFIGURATION;
DGMGRL> SHOW DATABASE PRIM;
DGMGRL> SHOW DATABASE STBY;
DGMGRL> VALIDATE DATABASE PRIM;
DGMGRL> VALIDATE DATABASE STBY;Operational commands
-- Switchover (planned)
DGMGRL> SWITCHOVER TO STBY;
-- Failover (unplanned)
DGMGRL> FAILOVER TO STBY;
-- Reinstate old primary (after failover)
DGMGRL> REINSTATE DATABASE PRIM;
-- Start/stop apply via broker
DGMGRL> EDIT DATABASE STBY SET STATE='APPLY-ON';
DGMGRL> EDIT DATABASE STBY SET STATE='APPLY-OFF';Switchover (planned) — minimal safe sequence
- Confirm standby is synchronized (transport/apply lag near 0).
- Confirm no unresolved gaps.
- Freeze risky ops (DDL windows), coordinate with app teams.
- Execute switchover via broker (preferred).
- Validate services, open mode, and client connectivity.
-- Gate checks
SELECT name, value FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
-- Broker switchover
DGMGRL> SWITCHOVER TO STBY;
DGMGRL> SHOW CONFIGURATION;Failover (unplanned) — what matters
- Decide: failover now vs attempt to recover primary.
- If potential data loss is unacceptable, stop and escalate.
- Prefer broker failover if available; else manual.
-- Broker failover
DGMGRL> FAILOVER TO STBY;
-- After failover:
DGMGRL> SHOW CONFIGURATION;
-- Then evaluate REINSTATE for old primary when ready.Go/No-go gates (template)
| Gate | Check | Go criteria | No-go action |
|---|---|---|---|
| Lag | v$dataguard_stats | apply lag small and stable | Investigate standby perf, network |
| Gaps | v$archived_log | No missing sequences | Resolve gap (FAL/manual) |
| Protection | v$database | Expected protection level | Align transport mode / broker props |
| Business | Ops comms | Approved window | Reschedule |
Application cutover (practical)
- Prefer service-based routing (DB services) rather than hard-coded hostnames.
- Use DNS TTL discipline or VIP/SCAN abstractions.
- Plan connection draining and retry logic on app side.
-- Example: verify role and open mode
SELECT database_role, open_mode FROM v$database;
-- Example: services (if used)
SELECT name, network_name, pdb FROM dba_services;What FSFO does
- Observer watches primary health via broker.
- If conditions match, broker triggers automatic failover.
- Goal: reduce RTO dramatically for hard failures.
Key concepts
| Item | Meaning |
|---|---|
| Observer | External process, should run on a third host |
| Threshold | How long primary must be unreachable before failover |
| Protection | FSFO can be configured with failover conditions aligned to RPO |
Minimal FSFO commands
DGMGRL> ENABLE FAST_START FAILOVER;
DGMGRL> SHOW FAST_START FAILOVER;
-- Start observer (run on third host)
DGMGRL> START OBSERVER;Core metrics (copy/paste)
-- Role and mode
SELECT database_role, open_mode, protection_mode, protection_level
FROM v$database;
-- Lag metrics
SELECT name, value, unit, time_computed
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag','apply finish time');
-- Transport status
SELECT dest_id, status, error, destination, target, transmit_mode
FROM v$archive_dest_status
WHERE status <> 'INACTIVE'
ORDER BY dest_id;
-- Apply processes
SELECT process, status, client_process, thread#, sequence#
FROM v$managed_standby
ORDER BY process;
-- Applied archive logs
SELECT thread#, sequence#, first_time, next_time, applied
FROM v$archived_log
ORDER BY thread#, sequence#;What to alert on
| Signal | Threshold idea | Interpretation | Action |
|---|---|---|---|
| Transport lag | p95 > X seconds | Network/shipping issue | Check v$archive_dest_status, network loss, listener |
| Apply lag | growing trend | Standby apply cannot keep up | Investigate IO/CPU, SRLs, parallelism |
| Broker warning | any warning | Config drift or gap | SHOW CONFIGURATION / VALIDATE DATABASE |
Transport levers
- SYNC/ASYNC: choose based on latency and RPO
- NET_TIMEOUT: avoid long commit stalls in SYNC
- REOPEN: retry interval on errors
- COMPRESSION: reduce bandwidth (version/licensing dependent)
-- Example of stability-oriented SYNC settings
ALTER SYSTEM SET log_archive_dest_2=
'service=STBY SYNC AFFIRM
net_timeout=30 reopen=10
valid_for=(ONLINE_LOGFILES,PRIMARY_ROLE)
db_unique_name=STBY' SCOPE=BOTH;Apply levers
- Standby IO: SRLs + fast redo write path
- CPU headroom for MRP
- Avoid noisy neighbors on standby (dedicate resources)
- Keep storage latency predictable
Redo rate quick check (approx)
-- Approx redo generation rate by time window
SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24:MI') AS minute,
ROUND(SUM(blocks*block_size)/1024/1024) AS mb
FROM v$archived_log
WHERE first_time > SYSDATE - (1/24)
GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24:MI')
ORDER BY minute;Hardening checklist
- Restrict listener to expected sources (firewall + valid node checking if applicable).
- Encrypt traffic where possible (TLS/TCPS strategy).
- Use strong password file practices, rotate credentials.
- Monitor auth failures and unexpected network peers.
- Ensure backups are protected (immutability, access controls).
Drill plan (template)
| Drill | Frequency | Goal | Evidence |
|---|---|---|---|
| Switchover rehearsal | Quarterly | Prove planned cutover works | RTO, checklist, logs |
| Failover simulation | 2x/year | Prove disaster path | Decision record, RPO estimate |
| Gap injection | Quarterly | Validate gap resolution | MTTR on gap |
| Observer fail test (FSFO) | 2x/year | Validate automation safety | Observer logs |
First 5 minutes (triage)
-- Role/mode
SELECT database_role, open_mode FROM v$database;
-- Lag
SELECT name, value, unit FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
-- Transport errors
SELECT dest_id, status, error, destination
FROM v$archive_dest_status
WHERE status <> 'INACTIVE'
ORDER BY dest_id;
-- Apply status
SELECT process, status, thread#, sequence#
FROM v$managed_standby
ORDER BY process;Gap playbook (manual path)
- Identify missing sequences on standby.
- Fetch missing archivelogs from primary (copy or FAL).
- Register logs if needed.
- Restart MRP.
-- Find missing sequences (example approach)
SELECT thread#, MIN(sequence#) AS min_seq, MAX(sequence#) AS max_seq
FROM v$archived_log
GROUP BY thread#
ORDER BY thread#;
-- Register log on standby if manually copied
ALTER DATABASE REGISTER LOGFILE '/path/to/archivelog.arc';
-- Restart apply
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;Apply stuck (common causes)
| Cause | Signal | Check | Fix |
|---|---|---|---|
| Standby IO bottleneck | Apply lag grows | OS storage latency | Improve storage, isolate standby workload |
| SRL issues | No real-time apply | v$standby_log status | Create/resize SRLs |
| Archive gap | MRP waits for log | v$archived_log applied | Resolve gap |
Broker warnings (pattern)
DGMGRL> SHOW CONFIGURATION;
DGMGRL> SHOW DATABASE STBY;
DGMGRL> VALIDATE DATABASE STBY;SQL quick checks
-- Role
SELECT database_role, open_mode FROM v$database;
-- Protection
SELECT protection_mode, protection_level FROM v$database;
-- Lag
SELECT name, value, unit FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
-- Transport
SELECT dest_id, status, error, transmit_mode
FROM v$archive_dest_status
ORDER BY dest_id;
-- Apply
SELECT process, status, thread#, sequence#
FROM v$managed_standby
ORDER BY process;DGMGRL essentials
dgmgrl sys@PRIM
SHOW CONFIGURATION;
SHOW DATABASE PRIM;
SHOW DATABASE STBY;
VALIDATE DATABASE PRIM;
VALIDATE DATABASE STBY;
SWITCHOVER TO STBY;
FAILOVER TO STBY;
REINSTATE DATABASE PRIM;
EDIT DATABASE STBY SET STATE='APPLY-ON';
EDIT DATABASE STBY SET STATE='APPLY-OFF';Decision tree (fast)
If transport lag is high:
Check v$archive_dest_status.error
Validate TNS/listener/firewall
Check packet loss / latency
If apply lag is high:
Check standby IO/CPU
Validate SRLs and real-time apply
Look for gaps and MRP status
If broker shows warnings:
Run VALIDATE DATABASE
Fix SRL/config drift
Re-check lag and statesInputs
Generated script
Copy/paste into a terminal (DGMGRL). Adjust as needed.
Click "Generate" to produce the script.
Switchover runbook (template)
Pre-checks
- Confirm lag near zero (transport + apply)
- Confirm no gaps
- Confirm broker config healthy (if used)
- Confirm maintenance window approved and comms sent
- Prepare rollback plan (timeboxed)
Execution
- Drain app connections (or switch services)
- Broker switchover
- Validate new primary open mode and services
- Resume app traffic
Post-checks
-- Validate role and protection
SELECT database_role, open_mode, protection_level FROM v$database;
-- Validate lag
SELECT name, value, unit FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
-- Broker health
DGMGRL> SHOW CONFIGURATION;