Oracle GoldenGate Coordinated Replicat unsynchronized mode troubleshooting

According to the Oracle GoldenGate Coordinated Replicat documentation if the Replicat is stopped in an unclean manner the threads may be unsynchronized. Different threads may reach different checkpoint positions.

Let’s find out what what might be the consequences of this situation.

The ggsci STOP REPLICAT commands causes all Coordinated Replicat threads to synchronize all threads before actually stopping them. Finally all threads have equal checkpoints. The same operation (equal checkpoints) can be achieved through the SYNCHRONIZE REPLICAT command.

According to the documentation any PARAM file modification should be done in a synchronized state.

Let’s analyze the following scenarios:

  1. Stop the Replicat unsynchronized (kill -9), modify the PARAM file, start the Replicat,
  2. Disaster recovery scenario: the target database is recovered to a point in time, the replication has to be resumed from that point (assuming that we do have all necessary trail files).

In the second scenario the database is loaded to some point in time. At any randomly chosen point in time the Replicat checkpoints would be unsynchronized.

1. Environment

All scenarios described in this post are run on x86_64 Linux machine with:

  1. OGG 12.3.0.1.2 + patch 171208
  2. Database 12.2.0.1 + RU 12.2.0.1.180417

I am not running the test multiple times (with many different versions) since the results should rather similar. This is just plain basic OGG functionality and I do not expect any difference in older versions.

For the purpose of the test I am using the following replication:

Source tables are:

CREATE TABLE adam.TAB1(
  PK1 numeric not null
);
ALTER TABLE adam.TAB1 ADD CONSTRAINT TAB1_PK PRIMARY KEY(PK1);

CREATE TABLE adam.TAB2(
  PK2 numeric not null,
  FK1 numeric not null
);
ALTER TABLE adam.TAB2 ADD CONSTRAINT TAB2_PK PRIMARY KEY(PK2);

Target tables are:

CREATE TABLE anna.TAB1(
  PK1 numeric not null
);
ALTER TABLE anna.TAB1 ADD CONSTRAINT TAB1_PK PRIMARY KEY(PK1);
GRANT SELECT, INSERT, UPDATE, DELETE ON anna.TAB1 TO ggadmin;

CREATE TABLE anna.TAB2(
  PK2 numeric not null,
  FK1 numeric not null
);
ALTER TABLE anna.TAB2 ADD CONSTRAINT TAB2_PK PRIMARY KEY(PK2);
GRANT SELECT, INSERT, UPDATE, DELETE ON anna.TAB2 TO ggadmin;

Extract param significant options:

TABLE adam.TAB1;
TABLE adam.TAB2;

Replicat param significant options:

GROUPTRANSOPS 1
MAP adam.TAB1, TARGET anna.TAB1, COLMAP(PK1 = PK1), THREADRANGE(1-10, PK1);
MAP adam.TAB2, TARGET anna.TAB2, COLMAP(PK2 = PK2, FK1 = FK1), THREADRANGE(1-10, PK2);

The GROUPTRANSOPS is not necessary for the replication but helps to make the test quicker. It should not influence the nature of Coordinated Replicat.

2. PARAM change scenario

Let’s imagine a scenario, where the Replicat would stop in an unclean fashion (like kill -9) and then before starting the Replicat process again, the PARAM file would be modified. According to the documentation the Replicat should be synchronized before modifying the PARAM file. But this kind scenarios can happen. Let’s find out what are the consequences.

In our example someone would modify the PARAM file, and then there was a reboot and after restarting the machine the Replicat would continue its work using new values in PARAM file. To simulate this scenario I am going to generate a lot of transactions.

First let’s stop the Replicat process:

stop replicat repl

And on the source database let’s create a lot of load:

BEGIN 
  FOR x IN 1..100000 LOOP
    insert into adam.TAB1 select x * 10 + 0, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 0, x * 10 + 1, current_scn from v$database;
    insert into adam.TAB1 select x * 10 + 1, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 1, x * 10 + 2, current_scn from v$database;
    insert into adam.TAB1 select x * 10 + 2, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 2, x * 10 + 3, current_scn from v$database;
    insert into adam.TAB1 select x * 10 + 3, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 3, x * 10 + 4, current_scn from v$database;
    insert into adam.TAB1 select x * 10 + 4, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 4, x * 10 + 5, current_scn from v$database;
    insert into adam.TAB1 select x * 10 + 5, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 5, x * 10 + 6, current_scn from v$database;
    insert into adam.TAB1 select x * 10 + 6, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 6, x * 10 + 7, current_scn from v$database;
    insert into adam.TAB1 select x * 10 + 7, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 7, x * 10 + 8, current_scn from v$database;
    insert into adam.TAB1 select x * 10 + 8, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 8, x * 10 + 9, current_scn from v$database;
    insert into adam.TAB1 select x * 10 + 9, current_scn from v$database;
    insert into adam.TAB2 select x * 10 + 9, x * 10 + 0, current_scn from v$database;
    commit;
  END LOOP;
END;
/

The Replicat process is configured to use 10 threads. Every transaction (at the source database) which makes 20 INSERT operations would be divided (at the target database) into about 10 transactions. Of course since the Coordinated Replicat uses some hashing function to choose the Replicat processes we are not sure if 10 consecutive values would divide our load exactly to 10 distinct processes, but this is rough assumption which can be used.

Both insert operations (to adam.TAB1 and adam.TAB2 tables with the same PK1 or PK2 column should be handled by the same thread. As a results those operations should be always processed together.

On the target database I have configured flashback database for the ease of testing.

Let’s start the Replicat process:

start replicat repl

And let’s look how the Replicat is working:

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
     708484

SQL> select count(*) from anna.TAB1;

COUNT(*)
----------
     48834

select count(*) from anna.TAB2;

COUNT(*)
----------
     48838

SQL> select GROUP_NAME, LOG_CMPLT_CSN from ggadmin.CKPTAB order by GROUP_NAME;

GROUP_NAME               LOG_CMPLT_CSN
------------------------ ------------------------------
REPL
REPL001                  740762
REPL002                  742029
REPL003                  741960
REPL004                  740463
REPL005                  740900
REPL006                  740555
REPL007                  741638
REPL008                  744011
REPL009                  742629
REPL010                  740969

11 rows selected.

The thread commit checkpoint SCN’s are uneven but that is exactly what we have expected here.

Now, lets kill the Replicat processes:

killall -9 replicat

And let’s check again what exactly has been replicated:

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
     737290

SQL> select count(*) from anna.TAB1;

COUNT(*)
----------
     67208

SQL> select count(*) from anna.TAB2;

COUNT(*)
----------
     67208

SQL> select GROUP_NAME, LOG_CMPLT_CSN from ggadmin.CKPTAB order by GROUP_NAME;

GROUP_NAME               LOG_CMPLT_CSN
------------------------ ------------------------------
REPL
REPL001                  782382
REPL002                  783810
REPL003                  785100
REPL004                  784134
REPL005                  783879
REPL006                  782612
REPL007                  784065
REPL008                  785907
REPL009                  783994
REPL010                  781876

11 rows selected.

Now we can modify the PARAM file to:

MAP adam.TAB1, TARGET anna.TAB1, COLMAP(PK1 = PK1, SRC_SCN = @GETENV('TRANSACTION','CSN')), THREADRANGE(1-10, PK1);
MAP adam.TAB2, TARGET anna.TAB2, COLMAP(PK2 = PK2, FK1 = FK1, SRC_SCN = @GETENV('TRANSACTION','CSN')), THREADRANGE(1-10, FK1);

I have only modified the last parameter in the second line from PK2 to FK1.

And start the Replicat process.

As expected, it would fail with ABEND with the following log:

2018-05-11T18:26:50.226+0200 WARNING OGG-01004 Oracle GoldenGate Delivery for Oracle, repl.prm: Aborted grouped transaction on ANNA.TAB2, Database error 1 (OCI Error ORA-00001: unique constraint (ANNA.TAB2_PK) violated (status = 1), SQL <INSERT INTO "ANNA"."TAB2" ("PK2","FK1","SRC_SCN") VALUES (:a0,:a1,:a2)>).
2018-05-11T18:26:50.226+0200 WARNING OGG-01003 Oracle GoldenGate Delivery for Oracle, repl.prm: Repositioning to rba 15410875 in seqno 0.
2018-05-11T18:26:50.226+0200 WARNING OGG-01154 Oracle GoldenGate Delivery for Oracle, repl.prm: SQL error 1 mapping ADAM.TAB2 to ANNA.TAB2 OCI Error ORA-00001: unique constraint (ANNA.TAB2_PK) violated (status = 1), SQL <INSERT INTO "ANNA"."TAB2" ("PK2","FK1","SRC_SCN") VALUES (:a0,:a1,:a2)>.
2018-05-11T18:26:50.226+0200 ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, repl.prm: Error mapping from ADAM.TAB2 to ANNA.TAB2.
2018-05-11T18:26:50.226+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.290+0200 WARNING OGG-01004 Oracle GoldenGate Delivery for Oracle, repl.prm: Aborted grouped transaction on ANNA.TAB2, Database error 1 (OCI Error ORA-00001: unique constraint (ANNA.TAB2_PK) violated (status = 1), SQL <INSERT INTO "ANNA"."TAB2" ("PK2","FK1","SRC_SCN") VALUES (:a0,:a1,:a2)>).
2018-05-11T18:26:50.290+0200 WARNING OGG-01003 Oracle GoldenGate Delivery for Oracle, repl.prm: Repositioning to rba 15413169 in seqno 0.
2018-05-11T18:26:50.290+0200 WARNING OGG-01154 Oracle GoldenGate Delivery for Oracle, repl.prm: SQL error 1 mapping ADAM.TAB2 to ANNA.TAB2 OCI Error ORA-00001: unique constraint (ANNA.TAB2_PK) violated (status = 1), SQL <INSERT INTO "ANNA"."TAB2" ("PK2","FK1","SRC_SCN") VALUES (:a0,:a1,:a2)>.
2018-05-11T18:26:50.251+0200 ERROR OGG-06004 Oracle GoldenGate Delivery for Oracle, repl.prm: Coordinated Replicat thread REPL009 exited unexpectedly. Please check the report file of REPL009 for more details.
2018-05-11T18:26:50.290+0200 ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, repl.prm: Error mapping from ADAM.TAB2 to ANNA.TAB2.
2018-05-11T18:26:50.290+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.251+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.259+0200 WARNING OGG-06048 Oracle GoldenGate Delivery for Oracle, repl.prm: Error in Replicat communication: Broken pipe.
2018-05-11T18:26:50.282+0200 ERROR OGG-06051 Oracle GoldenGate Delivery for Oracle, repl.prm: Connection to Master unexpectedly terminated.
2018-05-11T18:26:50.282+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.306+0200 WARNING OGG-06048 Oracle GoldenGate Delivery for Oracle, repl.prm: Error in Replicat communication: Broken pipe.
2018-05-11T18:26:50.265+0200 WARNING OGG-06048 Oracle GoldenGate Delivery for Oracle, repl.prm: Error in Replicat communication: Broken pipe.
2018-05-11T18:26:50.308+0200 WARNING OGG-06048 Oracle GoldenGate Delivery for Oracle, repl.prm: Error in Replicat communication: Broken pipe.
2018-05-11T18:26:50.309+0200 ERROR OGG-06052 Oracle GoldenGate Delivery for Oracle, repl.prm: Error communicating with Master.
2018-05-11T18:26:50.309+0200 ERROR OGG-06052 Oracle GoldenGate Delivery for Oracle, repl.prm: Error communicating with Master.
2018-05-11T18:26:50.310+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.265+0200 ERROR OGG-06052 Oracle GoldenGate Delivery for Oracle, repl.prm: Error communicating with Master.
2018-05-11T18:26:50.259+0200 ERROR OGG-06052 Oracle GoldenGate Delivery for Oracle, repl.prm: Error communicating with Master.
2018-05-11T18:26:50.259+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.310+0200 WARNING OGG-06048 Oracle GoldenGate Delivery for Oracle, repl.prm: Error in Replicat communication: Broken pipe.
2018-05-11T18:26:50.311+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.265+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.310+0200 ERROR OGG-06052 Oracle GoldenGate Delivery for Oracle, repl.prm: Error communicating with Master.
2018-05-11T18:26:50.310+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.260+0200 WARNING OGG-06048 Oracle GoldenGate Delivery for Oracle, repl.prm: Error in Replicat communication: Broken pipe.
2018-05-11T18:26:50.312+0200 WARNING OGG-06048 Oracle GoldenGate Delivery for Oracle, repl.prm: Error in Replicat communication: Broken pipe.
2018-05-11T18:26:50.260+0200 ERROR OGG-06052 Oracle GoldenGate Delivery for Oracle, repl.prm: Error communicating with Master.
2018-05-11T18:26:50.260+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.
2018-05-11T18:26:50.312+0200 ERROR OGG-06052 Oracle GoldenGate Delivery for Oracle, repl.prm: Error communicating with Master.
2018-05-11T18:26:50.312+0200 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, repl.prm: PROCESS ABENDING.

But still the Replicat was able to make some work before abending:

SQL> select * from anna.TAB1 where ...;

PK1 
----------
 67334
 67349
 67359

SQL> select * from anna.TAB2 where ...;

PK2        FK1
---------- ----------
 67333     67334
 67348     67349
 67358     67359

Now the target database is spoiled.

The checkpoint tables are also spoiled with messy data:

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
     737736

SQL> select count(*) from anna.TAB1;

COUNT(*)
----------
     67220

SQL> select count(*) from anna.TAB2;

COUNT(*)
----------
     67220

SQL> select GROUP_NAME, LOG_CMPLT_CSN from ggadmin.CKPTAB order by GROUP_NAME;

GROUP_NAME               LOG_CMPLT_CSN
------------------------ ------------------------------
REPL
REPL001                  782382
REPL002                  783856
REPL003                  785100
REPL004                  784180
REPL005                  783994
REPL006                  782612
REPL007                  784134
REPL008                  785907
REPL009                  784019
REPL010                  781876

11 rows selected.

Some transactions have been replicated using one kind of mapping, and some – using another kind. This is the worst possible scenario. It is now very hard to decode from the trail how the transactions have been divide between threads. All the work has to be done manually to undo the changes and achieve a synchronized state.

The conclusion of this test is that the Replicat process does not check if the PARAM file has been modified since the previous run. It always assumes that the PARAM file has not been modified.

A summary of this scenario, but for 2 threads can be visualized in the following picture:

Resulta of Coordinated Replicat change in param file after proces termination

So actually now the target database needs a resynchronization. So let’s move to the second scenario.

3. Disaster recovery scenario

The second scenario is quite simple. Something went wrong and the database has to be moved back in time to an earlier SCN.

There is even a MOS note 2199555.1 which describes this kind of scenario. Actually the MOS note says about recovering the checkpoint file in OGG. But … this is actually the same scenario. It is just like the database has been recovered to an earlier state. And the checkpoint table is not in sync with the checkpoint file. So now if we do “repair” the checkpoint file the Replicat should be able to resume work from the exact point in time as necessary.

The good thing is that after loading the database from backup also the checkpoint table is loaded and it should be absolutely in sync with the tables.

So, let’s move the database back to some earlier SCN. Since I have configured flashback database it should be easy:

SQL> connect / as sysdba
SQL> shutdown immediate;
SQL> startup mount;
SQL> flashback database to scn 708484;
SQL> alter database open resetlogs;

I have chosen some SCN from a time before my database got broken.

After restoring the database let’s look at the data and checkpoint information:

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
     709616

SQL> select count(*) from anna.TAB1;

COUNT(*)
----------
     48818

SQL> select count(*) from anna.TAB2;

COUNT(*)
----------
     48818

SQL> select GROUP_NAME, LOG_CMPLT_CSN from ggadmin.CKPTAB order by GROUP_NAME;

GROUP_NAME               LOG_CMPLT_CSN
------------------------ ------------------------------
REPL
REPL001                  740693
REPL002                  741845
REPL003                  741845
REPL004                  740394
REPL005                  740808
REPL006                  740486
REPL007                  741546
REPL008                  743919
REPL009                  742560
REPL010                  740831

11 rows selected.

According to the MOS note I should move the Replicat SCN back in time to some SCN BEFORE LOW WATERMARK. Low watermark for Coordinated Replicat would be the minimum of LOG_CMPLT_CSN for all processes. For example I can set it to 740000. Actually any number lower than 740808 is good  as long as we have the required trail files.

start replicat repl atcsn 740000

And … that’s all. The replicat resumed work correctly, and after replicating all rows the data is in sync. Work is done.

The only thing that I have to make sure that I use the same PARAM file as I have used at the time that this particular SCN had been replicated (the SCN which I have used for flashback database).

4. Conclusions

The conclusions are quire simple:

  1. Never modify the PARAM files while the Coordinated Replicat is in an unsynchronized state – a good practice would be to stop Replicat, make sure it is synchronized (if not, run synchronize), modify the PARAM file, an then start Replicat,
  2. If you modify the PARAM files always note the SCN’s of the database at the time when the modification takes place – in case of having many versions you will have to use the same PARAM files that has been used at point in time of the loaded database. So, always keep a full history of the Coordinated Replicat PARAM files.

This concludes the tests I have planned for Coordinated Replicat. If you are interested in troubleshooting and parallel aspects of Integrated and Parallel Replicat please come to POUG 2018 to Sopot (Poland) and to see my presentation on POUG 2018 Oracle International Conference. You will find many more interesting topics which I do not mention in my blog posts. And most interesting of them will appear first on POUG, and later on (maybe) also on this blog.

 

Tags: , , ,

LEAVE A REPLY

loading
×