10 July, 2021

Failover to a Standby with Data Loss

 One scenario I consider when testing a Standby Database -- actually testing a Disaster Recovery  site --  is to simulate complete loss of the Primary Database (at the Production site)

Previously, I have demonstrated two methods of "Destructive" Disaster Recovery site testing :

1. Being able to Switchover to the Standby (at a Disaster Recovery site), verifying that transactions created at the Disaster Recovery site do actually flow back to the Production site and, finally, reverting both databases to their pre-test image)

2. Opening a Standby Database as a Snapshot Standby, verify that it is writable and, finally, reverting it to being a "normal" Standby


But what if you want to simulate complete loss of the Production site (Primary database) -- signifying that you may have even lost some transactions, but needing to open the Standby for Read-Write operations ?  OR you actually encounter such a situation where the Primary goes down when the Standby is lagging ? 

On my Primary I had these configured for MAXIMUM AVAILABILITY (with SYNC and AFFIRM for log_archive_dest_2) :

SQL>alter system set log_archive_dest_2='SERVICE=STDBYDB SYNC AFFIRM VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=STDBYDB';

System altered.

SQL>
SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.

Total System Global Area 1207955552 bytes
Fixed Size                  9134176 bytes
Variable Size             822083584 bytes
Database Buffers          369098752 bytes
Redo Buffers                7639040 bytes
Database mounted.
SQL>
SQL> alter database SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY ;

Database altered.

SQL> shutdown
ORA-01109: database not open


Database dismounted.
ORACLE instance shut down.
SQL> startup;
ORACLE instance started.

Total System Global Area 1207955552 bytes
Fixed Size                  9134176 bytes
Variable Size             822083584 bytes
Database Buffers          369098752 bytes
Redo Buffers                7639040 bytes
Database mounted.
Database opened.
SQL>
SQL> select protection_mode, open_mode, database_role from v$database;

PROTECTION_MODE      OPEN_MODE            DATABASE_ROLE
-------------------- -------------------- ----------------
MAXIMUM AVAILABILITY READ WRITE           PRIMARY

SQL>


This is information of the Primary Database (at the Production site)  when Primary and Standby are in sync :

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /opt/oracle/archivelog/ORCLCDB
Oldest online log sequence     235
Next log sequence to archive   237
Current log sequence           237
SQL>


And this is at the Standby Database (at the Disaster Recovery site) :
 
 rfs (PID:4415): Changing standby controlfile to MAXIMUM AVAILABILITY mode
 rfs (PID:4415): Changing standby controlfile to RESYNCHRONIZATION level
 rfs (PID:4415): Standby controlfile consistent with primary
 
Recovery of Online Redo Log: Thread 1 Group 5 Seq 237 Reading mem 0
  Mem# 0: /opt/oracle/oradata/STDBYDB/stdbredo02.log



My Standby database / site goes down for a short while.  
So the Primary reports :

LGWR (PID:4280): Failed to archive LNO:1 T-1.S-238, error=3113
LGWR (PID:4280): Error 1041 disconnecting from LAD:2 standby host 'STDBYDB'
2021-07-10T17:58:05.264661+08:00
Thread 1 advanced to log sequence 239 (LGWR switch)
  Current log# 2 seq# 239 mem# 0: /opt/oracle/oradata/ORCLCDB/redo02.log
2021-07-10T17:58:05.390312+08:00
ARC1 (PID:4345): Archived Log entry 524 added for T-1.S-238 ID 0xa7521ccd LAD:1


and sometime later

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /opt/oracle/archivelog/ORCLCDB
Oldest online log sequence     242
Next log sequence to archive   244
Current log sequence           244
SQL> select current_scn from v$database;

CURRENT_SCN
-----------
   12768459

SQL>


We now know that the Standby (last applying Online Redo Sequence#237 [it went on to 238 before the Instance/Server crashed]) is lagging behind the Primary (current Redo Log 244, SCN 12768459 .

Reminder : The configuration is in MAXIMUM AVAILABILITY mode (not MAXIMUM PROTECTION).

Now, my Primary goes down.

So the FAILOVER must be issued from the Standby when I bring it up.  Let my try that now

SQL> startup mount;
ORACLE instance started.

Total System Global Area 1207955552 bytes
Fixed Size                  9134176 bytes
Variable Size             436207616 bytes
Database Buffers          754974720 bytes
Redo Buffers                7639040 bytes
Database mounted.
SQL> create restore point before_failover guarantee flashback database;

Restore point created.

SQL>
SQL> select protection_mode, open_mode, database_role from v$database;

PROTECTION_MODE      OPEN_MODE            DATABASE_ROLE
-------------------- -------------------- ----------------
MAXIMUM AVAILABILITY MOUNTED              PHYSICAL STANDBY

SQL> show parameter db_unique_name

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_unique_name                       string      STDBYDB
SQL>
SQL> alter database failover to STDBYDB;

Database altered.

SQL>
SQL> select protection_mode, open_mode, database_role from v$database;

PROTECTION_MODE      OPEN_MODE            DATABASE_ROLE
-------------------- -------------------- ----------------
MAXIMUM PERFORMANCE  MOUNTED              PRIMARY

SQL> alter database open;

Database altered.

SQL>


Let's see what the alert log on the Standby shows

ARC1 (PID:6205): FAL: Error 12541 connecting to ORCLCDB for fetching gap sequence

2021-07-10T18:04:34.507622+08:00
RVWR started with pid=45, OS id=11261
2021-07-10T18:04:38.704492+08:00
Allocating 8388608 bytes in shared pool for flashback generation buffer.
Allocated 5430064 bytes in shared pool for flashback generation buffer
2021-07-10T18:04:39.802983+08:00
Created guaranteed restore point BEFORE_FAILOVER

2021-07-10T18:05:50.402988+08:00
alter database failover to STDBYDB
2021-07-10T18:05:50.403086+08:00
.... (PID:6240): The Time Management Interface (TMI) is being enabled for role transition
.... (PID:6240): information.  This will result in messages beingoutput to the alert log
.... (PID:6240): file with the prefix 'TMI: '.  This is being enabled to make the timing of
.... (PID:6240): the various stages of the role transition available for diagnostic purposes.
.... (PID:6240): This output will end when the role transition is complete.
TMI: dbsdrv failover to target BEGIN 2021-07-10 18:05:50.403742
Terminal Recovery requested in process 6240
TMI: adbdrv termRecovery BEGIN 2021-07-10 18:05:50.408509
2021-07-10T18:05:50.494984+08:00
Attempt to do a Terminal Recovery (STDBYDB)
TMI: adbdrv termRecovery END 2021-07-10 18:05:50.495053
2021-07-10T18:05:50.495642+08:00
Media Recovery Start: Managed Standby Recovery (STDBYDB)
 Started logmerger process
2021-07-10T18:05:50.532463+08:00
PR00 (PID:11393): Managed Standby Recovery not using Real Time Apply
max_pdb is 5
2021-07-10T18:05:51.094491+08:00
Parallel Media Recovery started with 2 slaves
2021-07-10T18:05:51.233396+08:00
stopping change tracking
2021-07-10T18:05:51.583393+08:00
PR00 (PID:11393): Begin: SRL archival
PR00 (PID:11393): End: SRL archival
PR00 (PID:11393): Terminal Recovery timestamp is '07/10/2021 18:05:51'
PR00 (PID:11393): Terminal Recovery: applying standby redo logs.
PR00 (PID:11393): Terminal Recovery: thread 1 seq# 238 redo required
2021-07-10T18:05:51.682050+08:00
PR00 (PID:11393): Terminal Recovery:
2021-07-10T18:05:51.685886+08:00
Recovery of Online Redo Log: Thread 1 Group 4 Seq 238 Reading mem 0
  Mem# 0: /opt/oracle/oradata/STDBYDB/stdbredo01.log
Terminal Recovery finished with No-Data-Loss
2021-07-10T18:05:51.985957+08:00
Incomplete Recovery applied until change 12767732 time 07/10/2021 17:57:59
2021-07-10T18:05:52.074427+08:00
Media Recovery Complete (STDBYDB)
2021-07-10T18:05:52.655358+08:00
Terminal Recovery: successful completion
PR00 (PID:11393): Forcing ARSCN to IRSCN for TR SCN:0x0000000000c2d1f4
PR00 (PID:11393): Attempt to set limbo arscn SCN:0x0000000000c2d1f4 irscn SCN:0x0000000000c2d1f4
PR00 (PID:11393): Resetting standby activation ID 2807176397 (0xa7521ccd)
stopping change tracking
2021-07-10T18:05:52.873082+08:00
ALTER DATABASE SWITCHOVER TO PRIMARY (STDBYDB)
2021-07-10T18:05:52.873684+08:00
Maximum wait for role transition is 15 minutes.
TMI: kcv_commit_to_so_to_primary wait for MRP to finish BEGIN 2021-07-10 18:05:52.873693
TMI: kcv_commit_to_so_to_primary wait for MRP to finish END 2021-07-10 18:05:52.873771
TMI: kcv_commit_to_so_to_primary Switchover from physical BEGIN 2021-07-10 18:05:52.874148
Backup controlfile written to trace file /opt/oracle/diag/rdbms/stdbydb/STDBYDB/trace/STDBYDB_ora_6240.trc
2021-07-10T18:05:53.571015+08:00
Standby terminal recovery start SCN: 12767670
RESETLOGS after complete recovery through change 12767732
NET  (PID:6240): ORL pre-clearing operation disabled by switchover
Online log /opt/oracle/oradata/STDBYDB/redo01.log: Thread 1 Group 1 was previously cleared
Online log /opt/oracle/oradata/STDBYDB/redo02.log: Thread 1 Group 2 was previously cleared
Online log /opt/oracle/oradata/STDBYDB/redo03.log: Thread 1 Group 3 was previously cleared
2021-07-10T18:05:53.907884+08:00
Standby became primary SCN: 12767669
2021-07-10T18:05:53.908157+08:00
Setting recovery target incarnation to 4
2021-07-10T18:05:53.966494+08:00
NET  (PID:6240): RT: Role transition work is not done
NET  (PID:6240): The Time Management Interface (TMI) is being enabled for role transition
NET  (PID:6240): information.  This will result in messages beingoutput to the alert log
NET  (PID:6240): file with the prefix 'TMI: '.  This is being enabled to make the timing of
NET  (PID:6240): the various stages of the role transition available for diagnostic purposes.
NET  (PID:6240): This output will end when the role transition is complete.
NET  (PID:6240): Redo network throttle feature is disabled at mount time
2021-07-10T18:05:54.087343+08:00
NET  (PID:6240): Database role cleared from PHYSICAL STANDBY [kcvs.c:1030]
Switchover: Complete - Database mounted as primary
TMI: kcv_commit_to_so_to_primary Switchover from physical END 2021-07-10 18:05:54.088069
TMI: dbsdrv failover to target END 2021-07-10 18:05:54.088135
Failover completed with No-Data-Loss.
Completed: alter database failover to STDBYDB
2021-07-10T18:06:33.754882+08:00
alter database open
2021-07-10T18:06:33.755022+08:00
TMI: adbdrv open database BEGIN 2021-07-10 18:06:33.754898
Ping without log force is disabled:
  instance mounted in exclusive mode.
2021-07-10T18:06:33.945688+08:00
Crash Recovery excluding pdb 2 which was cleanly closed.
2021-07-10T18:06:33.945807+08:00
Crash Recovery excluding pdb 4 which was cleanly closed.
Endian type of dictionary set to little
2021-07-10T18:06:33.999216+08:00
Assigning activation ID 2848626420 (0xa9ca96f4)
2021-07-10T18:06:34.009104+08:00
LGWR (PID:6045): LAD:2 is UNSYNCHRONIZED
2021-07-10T18:06:34.503600+08:00
Thread 1 advanced to log sequence 2 (thread open)
2021-07-10T18:06:34.503635+08:00
ARC3 (PID:6209): Becoming the 'no SRL' ARCH
2021-07-10T18:06:34.503832+08:00
Redo log for group 2, sequence 2 is not located on DAX storage
Thread 1 opened at log sequence 2
  Current log# 2 seq# 2 mem# 0: /opt/oracle/oradata/STDBYDB/redo02.log
Successful open of redo thread 1
2021-07-10T18:06:34.521347+08:00
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
stopping change tracking
2021-07-10T18:06:34.662361+08:00
TT03 (PID:11461): Sleep 5 seconds and then try to clear SRLs in 2 time(s)
2021-07-10T18:06:34.883461+08:00
ARC0 (PID:6199): Archived Log entry 259 added for T-1.S-1 ID 0xa9ca96f4 LAD:1
2021-07-10T18:06:35.784688+08:00
Undo initialization recovery: err:0 start: 1407469 end: 1407660 diff: 191 ms (0.2 seconds)
[6240] Successfully onlined Undo Tablespace 2.
Undo initialization online undo segments: err:0 start: 1407660 end: 1408339 diff: 679 ms (0.7 seconds)
Undo initialization finished serial:0 start:1407469 end:1408406 diff:937 ms (0.9 seconds)
Dictionary check beginning
2021-07-10T18:06:36.927538+08:00
Dictionary check complete
Verifying minimum file header compatibility for tablespace encryption for pdb 1..
Verifying file header compatibility for tablespace encryption completed for pdb 1
Database Characterset is AL32UTF8
No Resource Manager plan active
2021-07-10T18:06:39.177858+08:00


TT00 (PID:6201): Error 12541 received logging on to the standby
TT00 (PID:6201): Check whether the listener is up and running.
2021-07-10T18:06:39.185611+08:00
TT03 (PID:11461): Sleep 5 seconds and then try to clear SRLs in 3 time(s)
2021-07-10T18:06:39.733886+08:00
joxcsys_required_dirobj_exists: directory object exists with required path /opt/oracle/product/19c/dbhome_1/javavm/admin/, pid 6240 cid 1
replication_dependency_tracking turned off (no async multimaster replication found)
2021-07-10T18:06:42.855357+08:00
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Starting background process AQPC
2021-07-10T18:06:43.050601+08:00
AQPC started with pid=50, OS id=11483
PDB$SEED(2):Autotune of undo retention is turned on.
2021-07-10T18:06:44.060956+08:00
PDB$SEED(2):Endian type of dictionary set to little
2021-07-10T18:06:44.314613+08:00
TT03 (PID:11461): Enabling archival of deferred physical standby SRLs
2021-07-10T18:06:44.532641+08:00
TT03 (PID:11461): Archived Log entry 260 added for T-1.S-238 ID 0xa7521ccd LAD:1
2021-07-10T18:06:44.681960+08:00
PDB$SEED(2):Undo initialization finished serial:0 start:1416557 end:1416557 diff:0 ms (0.0 seconds)
PDB$SEED(2):Pluggable database PDB$SEED dictionary check beginning
2021-07-10T18:06:48.986311+08:00
PDB$SEED(2):Pluggable Database PDB$SEED Dictionary check complete
PDB$SEED(2):Database Characterset for PDB$SEED is AL32UTF8
2021-07-10T18:07:04.127503+08:00
PDB$SEED(2):Opening pdb with no Resource Manager plan active
2021-07-10T18:07:05.037256+08:00
QPI: opatch file present, opatch
QPI: qopiprep.bat file present
2021-07-10T18:07:06.501677+08:00
Starting background process CJQ0
2021-07-10T18:07:06.511870+08:00
CJQ0 started with pid=47, OS id=11538
Completed: alter database open


Note the messages 

 PR00 (PID:11393): Terminal Recovery:
2021-07-10T18:05:51.685886+08:00
Recovery of Online Redo Log: Thread 1 Group 4 Seq 238 Reading mem 0
  Mem# 0: /opt/oracle/oradata/STDBYDB/stdbredo01.log
Terminal Recovery finished with No-Data-Loss
2021-07-10T18:05:51.985957+08:00
Incomplete Recovery applied until change 12767732 time 07/10/2021 17:57:59
2021-07-10T18:05:52.074427+08:00
Media Recovery Complete (STDBYDB)
2021-07-10T18:05:52.655358+08:00
Terminal Recovery: successful completion

Standby terminal recovery start SCN: 12767670

Standby became primary SCN: 12767669

Failover completed with No-Data-Loss.
Completed: alter database failover to STDBYDB


The Standby completed Recovery to Sequence#238 and SCN 12767669 which are much lower then the last known ArchiveLog (Sequence#243) and SCN (something higher than 12768459).

I had expected to see an ORA-16472 error requiring the usage of the "ALTER DATABASE ACTIVATE STANDBY DATABASE" command.  Unfortunately, in this case, the Standby assumes that there are no transactions from the Primary and assumes that there is no Data Loss !

So, the next time you do a Failover, you have to be careful to identify whether there is actual Data Loss if you are unable to access the Primary database.

Now, at the Standby I can query

SQL> select scn_to_timestamp(12767669) from dual;

SCN_TO_TIMESTAMP(12767669)
---------------------------------------------------------------------------
10-JUL-21 05.48.25.000000000 PM

SQL>


which certainly tells me that the last transaction at the Standby is as of  05:48pm (approximately,  the SCN_TO_TIMESTAMP provides us an approximate time only), even though I startup the Standby at 06:04pm.  
Fortunately, when I can access the Primary and read the alert.log, I find that the Primary went down after 06:02pm, with the last ArchiveLog generated at 6:00pm


2021-07-10T18:00:12.509208+08:00
Thread 1 advanced to log sequence 244 (LGWR switch)
  Current log# 1 seq# 244 mem# 0: /opt/oracle/oradata/ORCLCDB/redo01.log
2021-07-10T18:00:12.634091+08:00
ARC2 (PID:4347): Archived Log entry 529 added for T-1.S-243 ID 0xa7521ccd LAD:1


So I can confirm that there IS Data Loss !

FYI , ORA-16472 is :


oracle19c>oerr ora 16472
16472, 0000, "failover failed due to data loss"
// *Cause:  The database was in MaxAvailability or MaxProtection mode but
//          terminal recovery finished with data loss.
// *Action: For a data loss failover, issue ALTER DATABASE ACTIVATE STANDBY
//          DATABASE command.
oracle19c>


This signifies that a FAILOVER command has not succeeded and an ALTER DATABASE ACTIVATE STANDBY DATABASE must be issued.

No comments: