Project Oxygen & Ideo-LabIDEO LAB Dashboard 2026

🛡️ Oracle Data Guard – Guide complet (DR, Broker, Runbooks)

Architecture, redo transport, protection modes, broker, switchover/failover, FSFO, monitoring et troubleshooting.

1.1

Vue d ensemble

Pourquoi Data Guard, principes, vocabulaire, composants.

ConceptsDRRPO/RTO
1.2

Architecture

Primary/Standby, redo, services, SYNC/ASYNC, LGWR/ARCH.

RedoTransportApply
1.3

Protection modes

Maximum Protection / Availability / Performance, impact RPO.

SYNCRPORisk
2.1

Prerequis

Force logging, standby redo logs, archivelog, network, listener.

SetupSRLARCH
2.2

Build Standby

RMAN duplicate, active duplicate, restore from backup, standby controlfile.

RMANDuplicateAutomation
2.3

Redo transport

LOG_ARCHIVE_DEST_n, VALID_FOR, DB_UNIQUE_NAME, FAL, NET_TIMEOUT.

LOG_ARCHIVE_DESTFALTuning
3.1

Redo apply

MRP, real-time apply, delay, apply lag, SRL usage.

MRPReal-timeLag
3.2

Data Guard Broker

DGMGRL, config, observer, validation, role transitions.

BrokerDGMGRLOperations
3.3

Switchover / Failover

Runbooks, checks, sessions, services, DNS/VIP, app cutover.

RunbookRTOValidation
4.1

FSFO

Fast-Start Failover, observer, thresholds, safety, split brain guard.

FSFOObserverSafety
4.2

Monitoring

V$DATAGUARD_STATS, apply/transport lag, broker status, alerts.

ViewsMetricsAlerting
4.3

Tuning & perf

SYNC vs ASYNC, NET_TIMEOUT, compression, parallel apply, I/O, SRL sizing.

PerfNetworkApply
5.1

Securite

Encryption in transit, wallets, password files, least privilege, ports.

TLSWalletHardening
5.2

Tests DR

Game days, chaos drills, lag injection, role transition rehearsal.

DR DrillGame DayProof
5.3

Troubleshooting

Gap, ORA- errors, stuck apply, network drops, broker warnings.

IncidentsGapFix
6.1

Cheat-sheet

Commands SQL/DGMGRL, quick checks, decision tree.

CommandsChecks
6.2

Broker Wizard

Generate a safe baseline config (DGMGRL + init params).

WizardBaseline
6.3

Runbook complet

Checklist pre/post, comms, rollback, cutover app, KPIs.

RunbookOpsAudit
Incident timeline — transport down 12 minutes (gap, recovery, KPIs)
Scenario

Primary and physical standby are healthy. Redo transport unexpectedly drops for 12 minutes due to a network change. During the outage, primary continues generating redo (ASYNC transport). When network is restored, standby reports a gap. Objective: restore transport/apply fast, quantify impact, and produce evidence for postmortem.

Environment assumptions
  • Physical standby with SRLs and real-time apply enabled
  • Broker present (optional)
  • Transport mode ASYNC (primary continues during outage)
  • Monitoring on v$dataguard_stats + v$archive_dest_status
Incident objectives
  • Restore redo transport reliability
  • Resolve gap and restart apply
  • Measure peak lag and time-to-catch-up
  • Produce clear timeline and corrective actions
Initial triage commands
-- Transport error signal
                    SELECT dest_id, status, error, destination, transmit_mode
                    FROM v$archive_dest_status
                    WHERE status <> 'INACTIVE'
                    ORDER BY dest_id;

                    -- Lag snapshot
                    SELECT name, value, unit, time_computed
                    FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag','apply finish time');

                    -- Apply processes (standby)
                    SELECT process, status, client_process, thread#, sequence#
                    FROM v$managed_standby
                    ORDER BY process;
Timeline (minute-by-minute)
TimeObservationEvidence (what to capture)Decision / Action
T+00Alert: transport lag increases, archive dest shows errorv$archive_dest_status, alert log snippet, network ping/RTTDeclare incident, assign roles, freeze changes
T+02Primary continues OK (ASYNC), standby apply slows/stopsv$dataguard_stats, v$managed_standbyConfirm business impact: none yet, DR posture degraded
T+04Transport remains down, redo backlog growsArchived log rate on primary, lag trendEscalate to network team, request revert/rollback
T+08Network team identifies firewall/route changeChange ticket ID, firewall counters (drops), tracerouteRequest immediate fix and confirm expected ETA
T+12Transport restored, but standby shows gapv$archived_log missing sequences, dest status now VALIDStart gap resolution (FAL or manual)
T+14Standby fetches missing logs (FAL) / or manual copy beginsRFS activity, new archivelogs appear, applied advancesKeep apply running, monitor catch-up rate
T+18Apply catches up, lag returns within SLOtransport/apply lag near baselineClose incident, open postmortem, document actions
Evidence bundle checklist
  • Transport status snapshots (before, during, after)
  • Lag snapshots (peak + recovery curve)
  • Missing sequence range (gap start/end)
  • Change reference (network) and remediation steps
Gap resolution playbook

After transport returns, standby may still be missing archived logs. If FAL is configured, it can auto-fetch. Otherwise, copy missing logs manually and register them. Then restart apply and verify catch-up.

Step 1: identify missing sequences
-- Standby: last received and last applied per thread
                            SELECT thread#,
                            MAX(sequence#) AS last_received,
                            MAX(CASE WHEN applied='YES' THEN sequence# END) AS last_applied
                            FROM v$archived_log
                            GROUP BY thread#
                            ORDER BY thread#;

                            -- Standby: list gaps (manual inspection approach)
                            SELECT thread#, sequence#, applied
                            FROM v$archived_log
                            WHERE first_time > SYSDATE - 1
                            ORDER BY thread#, sequence#;
Step 2: verify transport state
SELECT dest_id, status, error, destination
                            FROM v$archive_dest_status
                            WHERE status <> 'INACTIVE'
                            ORDER BY dest_id;

                            -- Broker (optional)
                            -- DGMGRL> SHOW CONFIGURATION;
                            -- DGMGRL> SHOW DATABASE STBY;
Step 3a: FAL auto-fetch (preferred if configured)
-- Standby: confirm FAL settings
                    SHOW PARAMETER fal_server;
                    SHOW PARAMETER fal_client;

                    -- If needed (example)
                    ALTER SYSTEM SET fal_server='PRIM' SCOPE=BOTH;
                    ALTER SYSTEM SET fal_client='STBY' SCOPE=BOTH;
Step 3b: manual copy + register (fallback)
-- Standby: register manually copied archivelogs
                    ALTER DATABASE REGISTER LOGFILE '/path/to/arch_1_12345_*.arc';
                    ALTER DATABASE REGISTER LOGFILE '/path/to/arch_1_12346_*.arc';
Step 4: restart apply and validate
-- Restart apply (real-time)
                    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
                    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;

                    -- Validate lag
                    SELECT name, value, unit
                    FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag');

                    -- Apply processes
                    SELECT process, status, thread#, sequence#
                    FROM v$managed_standby
                    ORDER BY process;
If apply lag does not shrink after gap is resolved, suspect standby resource contention (IO/CPU) or SRL issues.
KPIs to report (what leadership cares about)
KPIDefinitionHow to computeTarget
Transport downtimeDuration transport was failingT_restore - T_failAs low as possible
Peak transport lagMax shipping delay observedMax(v$dataguard_stats.transport lag)Within SLO
Peak apply lagMax apply delay observedMax(v$dataguard_stats.apply lag)Within SLO
Time to catch upTime from restore to lag back to baselineT_baseline - T_restoreMinutes, not hours
Gap sizeMissing sequences range(last_received - last_applied)Small
Postmortem structure (template)
Root cause
  • Network change impacted transport path (firewall/route policy)
  • Monitoring detected lag growth within X minutes
  • Transport restored after rollback/fix at T+12
Corrective actions
  • Add change guardrail: replication path validation pre-change
  • Add alert: transport errors + lag slope threshold
  • Run quarterly drill: gap resolution under load
  • Document escalation runbook and ownership
One-shot summary commands (copy/paste)
-- Transport status
                    SELECT dest_id, status, error, destination, transmit_mode
                    FROM v$archive_dest_status
                    ORDER BY dest_id;

                    -- Lag snapshot
                    SELECT name, value, unit, time_computed
                    FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag','apply finish time');

                    -- Last applied per thread
                    SELECT thread#,
                    MAX(CASE WHEN applied='YES' THEN sequence# END) AS last_applied
                    FROM v$archived_log
                    GROUP BY thread#
                    ORDER BY thread#;
When you write the incident report, keep it numeric: exact downtime, peak lag, catch-up time, and concrete preventions.
Exadata + Data Guard — redo transport network, SRL on ASM, IO baseline
Redo transport network on Exadata

In Exadata environments, redo transport performance is often limited not by storage, but by network latency and packet stability between primary and standby sites. Especially in SYNC mode, commit latency is directly impacted by network RTT.

Critical network factors
  • Round-trip latency (RTT) — affects SYNC commits
  • Packet loss — kills effective throughput
  • Bandwidth headroom — must exceed peak redo rate
  • Jitter stability — redo bursts amplify jitter issues
Design patterns
  • Use dedicated replication network path if possible
  • Measure p95/p99 latency, not only average
  • Align SYNC mode with realistic latency budget
  • Validate firewall throughput under sustained load
Redo generation vs network capacity
-- Approx redo generation (MB per minute)
                    SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24:MI') AS minute,
                    ROUND(SUM(blocks*block_size)/1024/1024) AS mb
                    FROM v$archived_log
                    WHERE first_time > SYSDATE - (1/24)
                    GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24:MI')
                    ORDER BY minute;
Rule: network sustained throughput must be comfortably higher than peak redo rate, otherwise transport lag grows during spikes.
Standby Redo Logs on ASM (Exadata)

On Exadata, redo and standby redo logs typically reside in ASM disk groups. Placement and redundancy choice impact redo write latency and apply stability.

ItemRecommendationWhy
Disk group+DATA or dedicated redo diskgroupLow latency write path
RedundancyNormal or High (per policy)Balance between protection and latency
Size parityMatch primary redo sizeConsistent log switch behavior
Example: create SRL on ASM
-- Example using OMF in ASM
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;

                    -- Verify placement
                    SELECT group#, thread#, bytes/1024/1024 AS size_mb, status
                    FROM gv$standby_log
                    ORDER BY thread#, group#;
On Exadata, redo latency is usually low. If apply lag grows, suspect CPU contention or network before blaming storage.
IO baseline for standby on Exadata

Active Data Guard + heavy reporting can stress IO. Establish a measurable IO baseline to ensure redo apply never starves.

Primary baseline metrics
  • Redo generation rate (MB/s)
  • Log file sync latency
  • Log switch frequency
Standby baseline metrics
  • Apply rate (MB/s)
  • SRL write latency
  • Apply lag stability under load
Apply visibility
-- Apply rate visibility
                    SELECT process, status, thread#, sequence#
                    FROM gv$managed_standby
                    ORDER BY process;

                    -- Lag metrics
                    SELECT name, value, unit
                    FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag');
Best practice: standby apply rate should exceed peak redo generation rate. Otherwise lag accumulates during business spikes.
Common Exadata + Data Guard pitfalls
SymptomLikely causeFirst checkAction
Commit latency spikes in SYNCNetwork RTT instabilityLatency p95/p99Re-evaluate SYNC or network path
Apply lag under reporting loadStandby IO/CPU contentionOS + apply metricsResource governance, limit reporting concurrency
Frequent log switchesRedo logs too smallv$log switch frequencyIncrease redo size
SRL not used in real-timeMisconfigurationgv$standby_log statusFix SRL count/size per thread
Exadata gives you strong IO performance, but Data Guard safety still depends primarily on correct network design and disciplined operations.
Quick health snapshot
SELECT database_role, open_mode, protection_level
                    FROM v$database;

                    SELECT name, value, unit
                    FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag');

                    SELECT dest_id, status, error, transmit_mode
                    FROM v$archive_dest_status
                    ORDER BY dest_id;
Active Data Guard — read-only access while redo apply is running
Definition (simple)

Active Data Guard allows a physical standby database to be open read-only while it is still applying redo. This enables real-time reporting, offload queries, and read-only services without stopping recovery.

Typical use cases
  • Reporting / BI offload (heavy reads)
  • Read-only APIs (analytics endpoints)
  • Batch extracts and exports
  • Backups on standby to offload primary (strategy dependent)
What ADG is not
  • Not a multi-master solution
  • Not a replacement for application read/write splitting discipline
  • Not a free feature: usage is tied to licensing
Key operational invariant
Standby is open read-only
                    AND redo apply is still running
                    => queries and apply share CPU/IO, so resource governance matters.
Licensing caveats (practical)

Active Data Guard is a separately licensed option in many Oracle distributions. What matters operationally: do not assume read-only open + apply is covered by "basic Data Guard". Align with your Oracle licensing policy and your contract terms.

FeatureWhy it mattersRisk if ignoredMitigation
Read-only open + redo applyThis is the ADG capabilityNon-compliance exposureConfirm entitlement before enabling
Offload backups on standbyCommon architecture choiceAudit questionsDocument usage and licensing position
Real-time query servicesApps may start using standby heavilyUnplanned cost and capacity needsDefine service policies and guardrails
Compliance habit: keep a simple internal doc stating (1) what ADG features are enabled, (2) where, (3) why, and (4) who approved it. This saves days during audits.
Enable ADG (standby operations)

ADG requires the standby to be opened read-only, and apply to continue. A typical flow: start managed recovery, open read-only (or open then start apply, depending on version and policy).

Standby: open read-only
-- Standby: if mounted
                            ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

                            -- Open read-only
                            ALTER DATABASE OPEN READ ONLY;
Standby: start apply while open
-- Start redo apply while open read-only (ADG)
                            ALTER DATABASE RECOVER MANAGED STANDBY DATABASE
                            USING CURRENT LOGFILE
                            DISCONNECT FROM SESSION;
Validate state
-- Role and open mode
                    SELECT database_role, open_mode
                    FROM v$database;

                    -- Apply status
                    SELECT process, status, client_process, thread#, sequence#
                    FROM v$managed_standby
                    ORDER BY process;

                    -- Lag metrics
                    SELECT name, value, unit
                    FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag');
If apply does not run while open, check SRLs, broker state (if used), and alert log for recovery errors.
Best practices (battle-tested)
Performance & stability
  • Separate services: APP_RW on primary, APP_RO on standby
  • Resource governance: ensure MRP always has CPU/IO headroom
  • Monitor lag trend: alert on increasing apply lag, not only threshold
  • SRL hygiene: correct size/count per thread (RAC)
  • Storage latency: redo apply is sensitive to IO jitter
Query design on standby
  • Avoid unbounded scans that saturate IO
  • Prefer indexed access paths, limit concurrency
  • Schedule heavy reports away from peak redo rates
  • Use read-only routing and explicit timeouts
  • Keep stats strategy consistent (do not destabilize plans)
Operational guardrails
GuardrailWhyHowSuccess signal
Dedicated read-only servicePrevents accidental writes and isolates routingsrvctl service + app configNo write attempts on standby
Lag SLOEnsures DR readinessAlert on apply lag and trendLag stays within budget
Capacity headroomAvoids apply starvationCPU/IO reservationMRP stable under load
Change controlStandby is production-gradePatch windows + drillsNo surprises during switchover
Common pitfalls
  • Opening standby read-only for reporting without resource controls => apply lag spikes.
  • Routing too much traffic to standby => standby becomes critical path and hurts DR posture.
  • Missing SRLs (or wrong SRL sizing) => real-time apply not effective.
  • Not rehearsing service cutover => app outages during planned switchovers.
Quick monitoring block (copy/paste)
-- ADG: verify open read-only and apply running
                    SELECT database_role, open_mode FROM v$database;

                    SELECT name, value, unit
                    FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag');

                    SELECT process, status, thread#, sequence#
                    FROM v$managed_standby
                    ORDER BY process;

                    SELECT dest_id, status, error, transmit_mode
                    FROM v$archive_dest_status
                    ORDER BY dest_id;
RAC + Data Guard — threads, SRL per thread, services and role transitions
RAC + Data Guard: core mental model

In RAC, each instance has its own redo thread. Data Guard transport/apply must handle multiple threads consistently. Your goal is to keep all threads shipping and applying without gaps, and to ensure services move cleanly during role transitions.

Key objects
  • Redo threads: one per RAC instance (THREAD#)
  • Online redo logs: per thread, on primary
  • Standby redo logs (SRL): per thread, on standby
  • RFS: one receiver per incoming stream
  • MRP: applies redo; must see all threads
What breaks first (typical)
  • Missing SRLs on one thread => no real-time apply for that thread
  • SRLs too few/small => frequent switches, bursts, lag growth
  • Services not pinned => application cutover chaos
  • Thread disable/enable drift after maintenance
Quick visibility checks
-- Threads and instances (primary)
                    SELECT thread#, instance_name, host_name
                    FROM gv$instance
                    ORDER BY thread#;

                    -- Redo log groups per thread (primary)
                    SELECT thread#, group#, bytes/1024/1024 AS size_mb, status
                    FROM gv$log
                    ORDER BY thread#, group#;

                    -- Standby logs per thread (standby)
                    SELECT thread#, group#, bytes/1024/1024 AS size_mb, status
                    FROM gv$standby_log
                    ORDER BY thread#, group#;
SRL per thread: sizing and count

Baseline rule: for each thread on the standby, create SRLs with the same size as the primary online redo logs, and with a count of (#online redo log groups per thread + 1). This enables real-time apply and reduces lag.

ItemPer thread?BaselineNotes
SRL sizeYesMatch primary online redo sizeKeep consistent across all threads
SRL countYes#online redo groups + 1Increase if redo rate is high or switches are frequent
SRL placementYesFast, stable storageSRL write latency affects apply readiness
Example: create SRLs for 2-thread RAC

Example assumes each thread has 3 online redo groups on primary; create 4 SRLs per thread on standby. Adjust SIZE to match your environment.

-- Standby: create SRLs for THREAD 1
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 SIZE 1024M;

                    -- Standby: create SRLs for THREAD 2
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 SIZE 1024M;
Validation tip: after starting managed recovery with real-time apply, you should see SRLs cycling through ACTIVE/UNASSIGNED states.
SRL health checks
-- Standby: SRL inventory and state
                    SELECT thread#, group#, sequence#, bytes/1024/1024 AS size_mb, status
                    FROM gv$standby_log
                    ORDER BY thread#, group#;

                    -- Standby: apply processes visibility
                    SELECT inst_id, process, status, thread#, sequence#
                    FROM gv$managed_standby
                    ORDER BY inst_id, process;
Services: clean application cutover

In RAC, client connectivity should be service-based. During role transitions, services must be started on the new primary and stopped on the new standby. Prefer cluster-managed services (srvctl) and align service placement with instance roles.

Recommended approach
  • Use SCAN/VIP + service names, not node hostnames
  • Define service goals (TP/THROUGHPUT) and preferred instances
  • Ensure service is enabled only for PRIMARY_ROLE (policy)
  • Keep a separate read-only service for standby (if ADG)
Typical service layout
  • APP_RW: primary read-write (production)
  • APP_RO: standby read-only (reporting) if Active Data Guard
  • BATCH: batch windows, can be pinned to specific instances
Service inspection (database)
-- Services visible inside the database
                    SELECT name, network_name, pdb
                    FROM dba_services
                    ORDER BY name;

                    -- Role/open mode check (for gating service changes)
                    SELECT database_role, open_mode
                    FROM v$database;
Service control (cluster - example)
# Example commands (run as grid/oracle depending on setup)
                    # List services
                    srvctl status service -d 

                    # Start/stop service
                    srvctl start service -d  -s APP_RW
                    srvctl stop  service -d  -s APP_RW
Operational principle: switch data role first, then move services, then validate app connectivity and error rate.
Switchover gates (RAC-specific)
GateCheckExpectedIf not OK
Thread readinessgv$instance + gv$logAll threads enabled, consistent log configFix thread enablement and log group parity
SRLs per threadgv$standby_logSRLs exist for every thread, correct size/countCreate/resize SRLs
Lag across threadsv$dataguard_stats + gv$managed_standbyTransport/apply lag stable, no gapsResolve gap, investigate standby bottlenecks
Service plansrvctl status serviceServices mapped and ready on target primaryFix srvctl definitions and preferred instances
Thread and gap quick checks
-- Standby: confirm sequences are being received/applied for all threads
                    SELECT thread#,
                    MAX(sequence#) AS last_seq,
                    MAX(CASE WHEN applied='YES' THEN sequence# END) AS last_applied_seq
                    FROM v$archived_log
                    GROUP BY thread#
                    ORDER BY thread#;

                    -- Lag metrics
                    SELECT name, value, unit
                    FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag');
If one thread lags behind, treat it as a first-class incident: RAC + DG safety requires consistent multi-thread apply.
1.1 Overview — Oracle Data Guard (concepts, goals, mental model)
Pourquoi Data Guard

Data Guard = solution native Oracle pour HA/DR via replication basee sur redo. Objectif: survivre a une panne site / stockage / corruption / erreur humaine, en gardant un chemin clair de role transition (switchover/failover) avec verifications.

Ce que Data Guard apporte
  • Protection du primaire par un standby synchronise.
  • RPO configurable: zero data loss possible (selon mode, distance, latence).
  • RTO reduit: bascule proceduree et repetable.
  • Offload lecture/reporting sur standby (Active Data Guard si licence).
Ce que Data Guard ne remplace pas
  • Backup strategy (RMAN, immutability, testing restores).
  • Application-level resilience (retry, idempotency, timeouts).
  • Operational discipline (runbooks, drills, comms).
Mental model
Redo is the truth.
                    Primary generates redo.
                    Transport ships redo.
                    Apply replays redo.
                    Role transition is controlled and validated.
RPO / RTO (pratique)
MetricMeaningHow Data Guard impacts itTypical lever
RPOData loss toleranceSYNC modes can achieve near/zero data lossProtection mode + SYNC/ASYNC + network latency
RTOTime to restore serviceBroker + runbooks can cut downtimeAutomation, DNS/VIP, service management, rehearsals
Rule of thumb: RPO is negotiated with physics (latency, distance) and RTO with discipline (automation + drills).
Composants cle
  • Primary: production, generates redo (LGWR/ARCH).
  • Standby: physical standby (recovery apply) or logical (SQL apply, legacy).
  • Redo transport: network shipping via LOG_ARCHIVE_DEST_n.
  • Apply: MRP (Managed Recovery Process) + real-time apply with SRLs.
  • Broker: DGMGRL framework for config + monitoring + transitions.
  • Observer: FSFO external process watching health to auto-failover.
Key data guard views
V$DATABASE
                    V$ARCHIVE_DEST_STATUS
                    V$ARCHIVED_LOG
                    V$DATAGUARD_STATS
                    V$MANAGED_STANDBY
                    V$STANDBY_LOG
Patterns frequents
PatternWhenNotes
1 primary + 1 physical standbyClassic DRKeep it simple, automate transitions
Multi-standbyDR + reportingOne standby can be far, another near
Cascaded standbyNetwork constraintsStandby ships redo to another standby
1.2 Architecture — redo generation, transport, apply, SRLs
Le pipeline redo
Primary
                    LGWR writes redo to online redo logs
                    ARCH archives (optional) to archive logs
                    LOG_ARCHIVE_DEST_n ships redo to Standby

                    Standby
                    RFS receives redo
                    SRLs store redo (recommended)
                    MRP applies redo (real-time apply possible)
Processes (quick)
  • LGWR: redo writer on primary
  • ARCH: archive logs creation
  • RFS: remote file server on standby receiving redo
  • MRP: managed recovery applying redo
Real-time apply
Real-time apply uses SRLs so apply can start before log is archived. This reduces apply lag and improves RTO readiness.
SYNC vs ASYNC (what actually changes)
ModePrimary commit waits?Typical RPOTradeoff
SYNCYes (ACK required)Near/zeroCommit latency depends on network
ASYNCNo (best effort)Small loss possibleBetter latency, higher throughput
Important: "SYNC" in LOG_ARCHIVE_DEST means the primary commit can wait for transport ack. Choose your protection mode based on latency budget.
Standby Redo Logs sizing (SRL)

SRLs should match online redo log size and count. General baseline: (#online redo log groups + 1) per thread on the standby. For RAC: do it per thread.

-- Example (single instance)
                    -- If primary has 3 online redo log groups, create 4 SRLs
                    ALTER DATABASE ADD STANDBY LOGFILE SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE SIZE 1024M;
                    ALTER DATABASE ADD STANDBY LOGFILE SIZE 1024M;
If SRLs are too small, log switches increase and network/apply can become bursty.
Network checklist (ops)
  • Latency budget measured (p95/p99), not guessed.
  • Stable DNS/SCAN or static endpoints for TNS.
  • Firewall rules: listener ports + broker if used.
  • Packet loss monitoring (loss kills SYNC effectiveness).
  • Time sync (NTP/chrony) on all hosts.
1.3 Protection modes — Maximum Protection / Availability / Performance
What you are really choosing
ModeTransportBehaviorRisk / Notes
Maximum ProtectionSYNC + AFFIRM (typical)Zero data loss target, primary may stop if no standby ackStrongest RPO, requires robust network/standby availability
Maximum AvailabilitySYNC (with fallback)Zero/near-zero, but primary continues if standby temporarily unavailableBalanced, common for critical systems
Maximum PerformanceASYNCPrimary does not wait for standbyBest latency, some data loss possible on disaster
Recommendation: decide with business owners using a simple table: latency budget, acceptable data loss, and disaster probability.
Quick SQL checks
SELECT database_role, protection_mode, protection_level
                FROM v$database;
2.1 Prerequisites — primary baseline (archivelog, force logging, SRL plan)
Primary must-have
-- Archivelog
                        ARCHIVE LOG LIST;

                        -- Force logging (recommended)
                        ALTER DATABASE FORCE LOGGING;

                        -- Standby file management
                        ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH;

                        -- Optional: flashback for fast reinstatement
                        -- (depends on licensing/strategy)
                        SELECT flashback_on FROM v$database;
Parameter baseline (typical)
-- Use unique name
                        ALTER SYSTEM SET db_unique_name='PRIM' SCOPE=SPFILE;

                        -- If using Data Guard Broker
                        ALTER SYSTEM SET dg_broker_start=TRUE SCOPE=BOTH;
                        ALTER SYSTEM SET dg_broker_config_file1='+DATA' SCOPE=SPFILE;
                        ALTER SYSTEM SET dg_broker_config_file2='+FRA'  SCOPE=SPFILE;
Standby must-have
  • Same DBID / same DB_NAME, distinct DB_UNIQUE_NAME
  • Compatible init params (memory, processes, etc.)
  • SRLs sized and provisioned
  • Listener/TNS connectivity validated both ways
  • Password file synced (or wallet strategy)
Validation checklist
CheckHowExpected
Connectivitytnsping + sqlplusStable connect
ArchivelogARCHIVE LOG LISTEnabled
Force loggingv$database.force_loggingYES
SRLsv$standby_logPresent
2.2 Build standby — RMAN duplicate (active / backup-based)
RMAN active duplicate (common)

Requires network connectivity and auxiliary instance in NOMOUNT. This is often the fastest path for initial sync.

-- On standby host (auxiliary)
                    -- 1) Create init.ora/spfile with db_name same as primary, db_unique_name different
                    -- 2) Start instance nomount
                    STARTUP NOMOUNT;

                    -- RMAN (from a client that can reach both)
                    rman target sys@PRIM auxiliary sys@STBY

                    DUPLICATE TARGET DATABASE
                    FOR STANDBY
                    FROM ACTIVE DATABASE
                    DORECOVER
                    NOFILENAMECHECK;
Tip: plan your file name conversions (db_file_name_convert, log_file_name_convert) if not using OMF/ASM.
RMAN duplicate from backup
-- 1) Take backup on primary (including archivelogs)
                    rman target /
                    BACKUP DATABASE PLUS ARCHIVELOG;

                    -- 2) Restore on standby host, then duplicate
                    rman target sys@PRIM auxiliary /

                    DUPLICATE TARGET DATABASE
                    FOR STANDBY
                    BACKUP LOCATION '/backup/location'
                    NOFILENAMECHECK;
Controlfile method (manual baseline)

Useful as a mental model or last resort. RMAN is recommended for correctness and speed.

-- Primary
                    ALTER DATABASE CREATE STANDBY CONTROLFILE AS '/tmp/stby.ctl';

                    -- Copy controlfile + datafiles via backup/restore strategy
                    -- On standby: mount with standby controlfile
Post-build steps
  • Set LOG_ARCHIVE_DEST_n on both ends.
  • Create SRLs on standby.
  • Start managed recovery (MRP).
  • Verify apply/transport lag is stable.
-- Standby: start apply
                    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;

                    -- Check MRP/RFS
                    SELECT process, status, thread#, sequence# FROM v$managed_standby;
2.3 Redo transport — LOG_ARCHIVE_DEST_n patterns and gap handling
Minimal transport example
-- Primary
                        ALTER SYSTEM SET log_archive_config='DG_CONFIG=(PRIM,STBY)' SCOPE=BOTH;

                        ALTER SYSTEM SET log_archive_dest_2=
                        'service=STBY
                        ASYNC
                        valid_for=(ONLINE_LOGFILES,PRIMARY_ROLE)
                        db_unique_name=STBY' SCOPE=BOTH;

                        ALTER SYSTEM SET log_archive_dest_state_2=ENABLE SCOPE=BOTH;
SYNC example (higher protection)
ALTER SYSTEM SET log_archive_dest_2=
                        'service=STBY
                        SYNC AFFIRM
                        net_timeout=30
                        reopen=10
                        valid_for=(ONLINE_LOGFILES,PRIMARY_ROLE)
                        db_unique_name=STBY' SCOPE=BOTH;
Gap detection and fetch

Gaps happen with network outages or standby restarts. FAL can auto-fetch missing logs.

-- Standby
                        ALTER SYSTEM SET fal_server='PRIM' SCOPE=BOTH;
                        ALTER SYSTEM SET fal_client='STBY' SCOPE=BOTH;

                        -- Inspect archive sequences
                        SELECT thread#, sequence#, applied
                        FROM v$archived_log
                        ORDER BY thread#, sequence#;
Transport status
SELECT dest_id, status, error, destination, target, transmit_mode
                        FROM v$archive_dest_status
                        WHERE dest_id IN (1,2);
If you see transport errors: validate TNS, listener, firewall, and that DB_UNIQUE_NAME matches DG_CONFIG.
3.1 Redo apply — MRP, real-time apply, lag, delay, recovery states
Start / stop apply
-- Start (real-time apply)
                    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;

                    -- Stop
                    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
Verify MRP
SELECT process, status, client_process, thread#, sequence#
                    FROM v$managed_standby
                    ORDER BY process;
Lag metrics (official)
SELECT name, value, unit, time_computed, datum_time
                    FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag','apply finish time');
Monitor both: transport lag (shipping) and apply lag (recovery). Apply lag can be caused by IO/CPU pressure on standby.
Real-time apply requirements
  • Standby redo logs present
  • Transport configured correctly
  • MRP started with USING CURRENT LOGFILE
SELECT group#, thread#, sequence#, bytes/1024/1024 AS size_mb, status
                    FROM v$standby_log
                    ORDER BY thread#, group#;
Common symptoms and first checks
SymptomLikely causeFirst checkAction
Apply lag growsStandby IO/CPU bottleneckv$managed_standby, OS metricsScale standby, tune storage, review parallelism
Transport errorsNetwork/TNS/listenerv$archive_dest_status.errorFix connectivity, validate service names
Gap detectedMissing archivelogsv$archived_log sequencesFAL, manual copy/register logs
3.2 Data Guard Broker — DGMGRL, config, validation, transitions
Enable broker (both sides)
ALTER SYSTEM SET dg_broker_start=TRUE SCOPE=BOTH;
DGMGRL connect
dgmgrl sys@PRIM
                    dgmgrl sys@STBY
Create broker configuration
DGMGRL> CREATE CONFIGURATION dg_cfg AS
                    PRIMARY DATABASE IS PRIM
                    CONNECT IDENTIFIER IS PRIM;

                    DGMGRL> ADD DATABASE STBY AS
                    CONNECT IDENTIFIER IS STBY
                    MAINTAINED AS PHYSICAL;

                    DGMGRL> ENABLE CONFIGURATION;
Validate and inspect
DGMGRL> SHOW CONFIGURATION;
                    DGMGRL> SHOW DATABASE PRIM;
                    DGMGRL> SHOW DATABASE STBY;

                    DGMGRL> VALIDATE DATABASE PRIM;
                    DGMGRL> VALIDATE DATABASE STBY;
Validation catches common misconfigurations (SRLs, connectivity, protection mismatch, lag).
Operational commands
-- Switchover (planned)
                    DGMGRL> SWITCHOVER TO STBY;

                    -- Failover (unplanned)
                    DGMGRL> FAILOVER TO STBY;

                    -- Reinstate old primary (after failover)
                    DGMGRL> REINSTATE DATABASE PRIM;

                    -- Start/stop apply via broker
                    DGMGRL> EDIT DATABASE STBY SET STATE='APPLY-ON';
                    DGMGRL> EDIT DATABASE STBY SET STATE='APPLY-OFF';
3.3 Switchover / Failover — runbooks, gates, app cutover
Switchover (planned) — minimal safe sequence
  1. Confirm standby is synchronized (transport/apply lag near 0).
  2. Confirm no unresolved gaps.
  3. Freeze risky ops (DDL windows), coordinate with app teams.
  4. Execute switchover via broker (preferred).
  5. Validate services, open mode, and client connectivity.
-- Gate checks
                    SELECT name, value FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag');

                    -- Broker switchover
                    DGMGRL> SWITCHOVER TO STBY;
                    DGMGRL> SHOW CONFIGURATION;
Failover (unplanned) — what matters
  • Decide: failover now vs attempt to recover primary.
  • If potential data loss is unacceptable, stop and escalate.
  • Prefer broker failover if available; else manual.
-- Broker failover
                    DGMGRL> FAILOVER TO STBY;

                    -- After failover:
                    DGMGRL> SHOW CONFIGURATION;
                    -- Then evaluate REINSTATE for old primary when ready.
Go/No-go gates (template)
GateCheckGo criteriaNo-go action
Lagv$dataguard_statsapply lag small and stableInvestigate standby perf, network
Gapsv$archived_logNo missing sequencesResolve gap (FAL/manual)
Protectionv$databaseExpected protection levelAlign transport mode / broker props
BusinessOps commsApproved windowReschedule
Application cutover (practical)
  • Prefer service-based routing (DB services) rather than hard-coded hostnames.
  • Use DNS TTL discipline or VIP/SCAN abstractions.
  • Plan connection draining and retry logic on app side.
-- Example: verify role and open mode
                    SELECT database_role, open_mode FROM v$database;

                    -- Example: services (if used)
                    SELECT name, network_name, pdb FROM dba_services;
4.1 FSFO — Fast-Start Failover (observer, thresholds, safety)
What FSFO does
  • Observer watches primary health via broker.
  • If conditions match, broker triggers automatic failover.
  • Goal: reduce RTO dramatically for hard failures.
Key concepts
ItemMeaning
ObserverExternal process, should run on a third host
ThresholdHow long primary must be unreachable before failover
ProtectionFSFO can be configured with failover conditions aligned to RPO
Minimal FSFO commands
DGMGRL> ENABLE FAST_START FAILOVER;
                        DGMGRL> SHOW FAST_START FAILOVER;

                        -- Start observer (run on third host)
                        DGMGRL> START OBSERVER;
Safety note: FSFO requires deep governance. Test it during drills and make sure split-brain prevention is sound (network + quorum design).
4.2 Monitoring — SQL views, broker status, alerting signals
Core metrics (copy/paste)
-- Role and mode
                SELECT database_role, open_mode, protection_mode, protection_level
                FROM v$database;

                -- Lag metrics
                SELECT name, value, unit, time_computed
                FROM v$dataguard_stats
                WHERE name IN ('transport lag','apply lag','apply finish time');

                -- Transport status
                SELECT dest_id, status, error, destination, target, transmit_mode
                FROM v$archive_dest_status
                WHERE status <> 'INACTIVE'
                ORDER BY dest_id;

                -- Apply processes
                SELECT process, status, client_process, thread#, sequence#
                FROM v$managed_standby
                ORDER BY process;

                -- Applied archive logs
                SELECT thread#, sequence#, first_time, next_time, applied
                FROM v$archived_log
                ORDER BY thread#, sequence#;
What to alert on
SignalThreshold ideaInterpretationAction
Transport lagp95 > X secondsNetwork/shipping issueCheck v$archive_dest_status, network loss, listener
Apply laggrowing trendStandby apply cannot keep upInvestigate IO/CPU, SRLs, parallelism
Broker warningany warningConfig drift or gapSHOW CONFIGURATION / VALIDATE DATABASE
4.3 Tuning — transport/apply performance and stability
Transport levers
  • SYNC/ASYNC: choose based on latency and RPO
  • NET_TIMEOUT: avoid long commit stalls in SYNC
  • REOPEN: retry interval on errors
  • COMPRESSION: reduce bandwidth (version/licensing dependent)
-- Example of stability-oriented SYNC settings
                        ALTER SYSTEM SET log_archive_dest_2=
                        'service=STBY SYNC AFFIRM
                        net_timeout=30 reopen=10
                        valid_for=(ONLINE_LOGFILES,PRIMARY_ROLE)
                        db_unique_name=STBY' SCOPE=BOTH;
Apply levers
  • Standby IO: SRLs + fast redo write path
  • CPU headroom for MRP
  • Avoid noisy neighbors on standby (dedicate resources)
  • Keep storage latency predictable
Practical approach: measure redo rate (MB/s) on primary and ensure standby sustain rate is higher, otherwise apply lag will grow under load.
Redo rate quick check (approx)
-- Approx redo generation rate by time window
                SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24:MI') AS minute,
                ROUND(SUM(blocks*block_size)/1024/1024) AS mb
                FROM v$archived_log
                WHERE first_time > SYSDATE - (1/24)
                GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24:MI')
                ORDER BY minute;
5.1 Security — transport hardening, credentials, ports, least privilege
Hardening checklist
  • Restrict listener to expected sources (firewall + valid node checking if applicable).
  • Encrypt traffic where possible (TLS/TCPS strategy).
  • Use strong password file practices, rotate credentials.
  • Monitor auth failures and unexpected network peers.
  • Ensure backups are protected (immutability, access controls).
Security posture: DR is not only availability; it is also a privileged path into your database. Treat it like production-grade connectivity.
5.2 DR tests — drills, game days, evidence, and continuous readiness
Drill plan (template)
DrillFrequencyGoalEvidence
Switchover rehearsalQuarterlyProve planned cutover worksRTO, checklist, logs
Failover simulation2x/yearProve disaster pathDecision record, RPO estimate
Gap injectionQuarterlyValidate gap resolutionMTTR on gap
Observer fail test (FSFO)2x/yearValidate automation safetyObserver logs
Always record: timeline, what failed, what was slow, what was confusing. Then update runbook and automate the painful steps.
5.3 Troubleshooting — gap, apply stuck, transport errors, broker warnings
First 5 minutes (triage)
-- Role/mode
                    SELECT database_role, open_mode FROM v$database;

                    -- Lag
                    SELECT name, value, unit FROM v$dataguard_stats
                    WHERE name IN ('transport lag','apply lag');

                    -- Transport errors
                    SELECT dest_id, status, error, destination
                    FROM v$archive_dest_status
                    WHERE status <> 'INACTIVE'
                    ORDER BY dest_id;

                    -- Apply status
                    SELECT process, status, thread#, sequence#
                    FROM v$managed_standby
                    ORDER BY process;
Decide quickly: is it a transport problem (network) or an apply problem (standby resources)?
Gap playbook (manual path)
  1. Identify missing sequences on standby.
  2. Fetch missing archivelogs from primary (copy or FAL).
  3. Register logs if needed.
  4. Restart MRP.
-- Find missing sequences (example approach)
                    SELECT thread#, MIN(sequence#) AS min_seq, MAX(sequence#) AS max_seq
                    FROM v$archived_log
                    GROUP BY thread#
                    ORDER BY thread#;

                    -- Register log on standby if manually copied
                    ALTER DATABASE REGISTER LOGFILE '/path/to/archivelog.arc';

                    -- Restart apply
                    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
                    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;
Apply stuck (common causes)
CauseSignalCheckFix
Standby IO bottleneckApply lag growsOS storage latencyImprove storage, isolate standby workload
SRL issuesNo real-time applyv$standby_log statusCreate/resize SRLs
Archive gapMRP waits for logv$archived_log appliedResolve gap
Broker warnings (pattern)
DGMGRL> SHOW CONFIGURATION;
                    DGMGRL> SHOW DATABASE STBY;
                    DGMGRL> VALIDATE DATABASE STBY;
Broker warnings often correlate with SRL mis-sizing, apply state mismatch, or transport connectivity.
6.1 Cheat-sheet — SQL + DGMGRL quick checks and decision tree
SQL quick checks
-- Role
                        SELECT database_role, open_mode FROM v$database;

                        -- Protection
                        SELECT protection_mode, protection_level FROM v$database;

                        -- Lag
                        SELECT name, value, unit FROM v$dataguard_stats
                        WHERE name IN ('transport lag','apply lag');

                        -- Transport
                        SELECT dest_id, status, error, transmit_mode
                        FROM v$archive_dest_status
                        ORDER BY dest_id;

                        -- Apply
                        SELECT process, status, thread#, sequence#
                        FROM v$managed_standby
                        ORDER BY process;
DGMGRL essentials
dgmgrl sys@PRIM

                        SHOW CONFIGURATION;
                        SHOW DATABASE PRIM;
                        SHOW DATABASE STBY;

                        VALIDATE DATABASE PRIM;
                        VALIDATE DATABASE STBY;

                        SWITCHOVER TO STBY;
                        FAILOVER TO STBY;

                        REINSTATE DATABASE PRIM;

                        EDIT DATABASE STBY SET STATE='APPLY-ON';
                        EDIT DATABASE STBY SET STATE='APPLY-OFF';
Decision tree (fast)
If transport lag is high:
                Check v$archive_dest_status.error
                Validate TNS/listener/firewall
                Check packet loss / latency

                If apply lag is high:
                Check standby IO/CPU
                Validate SRLs and real-time apply
                Look for gaps and MRP status

                If broker shows warnings:
                Run VALIDATE DATABASE
                Fix SRL/config drift
                Re-check lag and states
6.2 Broker Wizard — generate baseline DGMGRL script
Inputs
This wizard outputs a safe baseline. You still must align protection mode and LOG_ARCHIVE_DEST parameters.
Generated script

Copy/paste into a terminal (DGMGRL). Adjust as needed.

Click "Generate" to produce the script.
6.3 Runbook complet — prechecks, execution, postchecks, rollback
Switchover runbook (template)
Pre-checks
  1. Confirm lag near zero (transport + apply)
  2. Confirm no gaps
  3. Confirm broker config healthy (if used)
  4. Confirm maintenance window approved and comms sent
  5. Prepare rollback plan (timeboxed)
Execution
  1. Drain app connections (or switch services)
  2. Broker switchover
  3. Validate new primary open mode and services
  4. Resume app traffic
Post-checks
-- Validate role and protection
                SELECT database_role, open_mode, protection_level FROM v$database;

                -- Validate lag
                SELECT name, value, unit FROM v$dataguard_stats
                WHERE name IN ('transport lag','apply lag');

                -- Broker health
                DGMGRL> SHOW CONFIGURATION;
Rollback concept
Rollback for a planned switchover is typically "switch back" if issues are detected within a short window. Make rollback criteria explicit (KPIs, error budget, max downtime).