Tuesday

Cloning leftovers

Today while trying to run netca in a cloned environment, I was greeted with the following error:

UnsatisfiedLinkError exception loading native library: njni10
Exception in thread "main" java.lang.UnsatisfiedLinkError: get
at oracle.net.common.NetGetEnv.get(Native Method)
at oracle.net.config.Config.getNetDir(Unknown Source)

I noticed that under LD_LIBRARY_PATH ($ORACLE_HOME/lib32) a bunch of symbolic links were still pointing to the location of the SRC environment I cloned from. Included in that list was the file "ldflags”.

The solution was to fix all of these broken symbolic links and relink. Relinking tho I hit errors:

ld: fatal: library -ljava: not found
ld: fatal: library -ljvm: not found
ld: fatal: File processing errors.

Luckily I stumbled upon Note: 367535.1, which pointed me to the file $ORACLE_HOME/sysman/lib/env_sysman.mk. Within that makefile JRE_LIB_DIR was pointing to the SRC environment I cloned from as well. I changed it to the correct location for this environment and executed the relink command again. This time it was successful.

So basically, FYI, that if you hit weird errors in a cloned environment, it could be the result of some invalid pointers that weren’t modified during Rapid Clone. If you know of a fix, or have a handy list of things to check, feel free to share. I’ll add this to my post clone checklist.

Friday

Recovery Scenario – Losing a Redo Log file.

I’m in the process of studying for the OCP exam and I personally find it much easier to remember something if I write it down.   What better way to do this than creating a blog post?  Unfortunately its some pretty basic stuff so it might not garner much interest.

My test environment has the following redo log configuration:

1  select group#, sequence#, bytes/1024/1024 as "Size (MB)", members, status


2* from v$log


L> /


 


  GROUP#  SEQUENCE#  Size (MB)    MEMBERS STATUS


-------- ---------- ---------- ---------- ----------------


       1          5         50          2 INACTIVE


       2          6         50          2 CURRENT


       3          4         50          2 INACTIVE






 




SQL> l


  1  select group#, member


  2  from v$logfile


  3* order by 1


SQL> /


 


    GROUP# MEMBER


---------- ----------------------------------------


         1 /home/oracle/oradata/orcl/redo01.log


         1 /home/oracle/oradata/orcl/redo01a.log


         2 /home/oracle/oradata/orcl/redo02a.log


         2 /home/oracle/oradata/orcl/redo02.log


         3 /home/oracle/oradata/orcl/redo03a.log


         3 /home/oracle/oradata/orcl/redo03.log




As you can see, there are 3 redo log groups with 2 members each.   In order to simulate a failure I am going to remove one of the members.




SQL> !rm /home/oracle/oradata/orcl/redo01a.log


 


SQL> alter system switch logfile;


 


System altered.


 


SQL> /


 


System altered.




Oracle will not detect a problem until it tries to use the missing redo log file so I had to issue two logfile switches before redo01a.log was accessed.  At that point the following errors show up in the alert.log.



Errors in file /home/oracle/product/10.2.0/db_1/admin/orcl/bdump/orcl_lgwr_2648.trc:

ORA-00313: open failed for members of log group 1 of thread 1


ORA-00312: online log 1 thread 1: '/home/oracle/oradata/orcl/redo01a.log'


ORA-27037: unable to obtain file status


Linux Error: 2: No such file or directory


Additional information: 3


Thu Jan 15 15:40:44 2009


Errors in file /home/oracle/product/10.2.0/db_1/admin/orcl/bdump/orcl_lgwr_2648.trc:


ORA-00321: log 1 of thread 1, cannot update log file header


ORA-00312: online log 1 thread 1: '/home/oracle/oradata/orcl/redo01a.log'


Thu Jan 15 15:40:45 2009


Errors in file /home/oracle/product/10.2.0/db_1/admin/orcl/bdump/orcl_lgwr_2648.trc:


ORA-00313: open failed for members of log group 1 of thread 1



As well, if we take another look at v$logfile, the missing redo log will have a status of INVALID:




SQL> 


    GROUP# MEMBER                                   STATUS


---------- ---------------------------------------- -------


         3 /home/oracle/oradata/orcl/redo03.log


         2 /home/oracle/oradata/orcl/redo02.log


         1 /home/oracle/oradata/orcl/redo01.log


         1 /home/oracle/oradata/orcl/redo01a.log    INVALID


         2 /home/oracle/oradata/orcl/redo02a.log


         3 /home/oracle/oradata/orcl/redo03a.log




In this particular case, the loss of redo01a.log file is not critical.  Since the group is multiplexed (more than 1 member) Oracle can operate without impacting users.  Each time Oracle tries to use the missing redo log file the errors will be displayed again in alert log.



There are two methods in order to recover form this scenario:




  1. Issue the CLEAR LOGFILE command.


  2. Drop the missing member and add a new one.



Which one should you choose?  That depends on the circumstances surrounding why the redo log is missing.  The CLEAR LOGFILE command reinitializes the logfile group, similiar to dropping and adding each member.  If the failure is due to a corrupted disk, then this command will fail as it recreates the group in place.   If this is the case, you will need to use the second method of dropping and adding a new member so you can specify a new location.



Lets try issuing the CLEAR LOGFILE command:




SQL> alter database clear logfile group 1;


 


Database altered.






Thats it, everything is back to normal.



Lets try the other method of dropping and adding a new member:




SQL>  !rm  /home/oracle/oradata/orcl/redo01a.log


 


SQL> alter database drop logfile member '/home/oracle/oradata/orcl/redo01a.log';


 


Database altered.


 


SQL> alter database add logfile member '/home/oracle/oradata/orcl/redo01a.log' to group 1;


 


Database altered.


 



So, what happens if you lose all members of a redo log group?  The steps you have to take are dependent on the groups status and whether or not the database is in archivelog mode. 



INACTIVE: 




  • If the affected redo log group has a status of INACTIVE it is no longer required for crash recovery.  If you are in NOARCHIVELOG mode issue either CLEAR LOGFILE or recreate the group manually.


  • If you are in ARCHIVELOG mode and the group has been archived, issue either of the steps above.


  • If you are in ARCHIVELOG mode and the group hasn’t been archived then issue CLEAR UNARCHIVED LOGFILE.  If you don’t specify the UNARCHIVED keyword you will receive an error. After you execute this command, you will have a gap in your archivelogs, so perform a complete backup (including the control file) ASAP.



ACTIVE:




  • If the redo log group has a status of ACTIVE, it is still required for crash recovery.  Issue the command ALTER SYSTEM CHECKPOINT, If successful then follow the steps above for INACTIVE.


  • If the checkpoint fails, then you need to perform recovery.  If you are in NOARCHIVELOG mode then you need to perform a complete recovery of the last cold backup.



  • If the checkpoint fails and you are in ARCHIVELOG mode then you can perform an INCOMPLETE recovery up to the previous log file.



CURRENT:



The current redo log group is the one which Oracle is currently writing to.   If you lose this group or files become corrupted, LGWR may terminate.  For example:



Corrupted files:



Fri Jan 30 14:14:55 2009

Errors in file /apps/oravis/db/tech_st/10.2.0/admin/test/bdump/test_lgwr_25599.trc:


ORA-00316: log 1 of thread 1, type 8944 in header is not log file


ORA-00312: online log 1 thread 1: '/oradata/dbf2/testdata/test/redo01.log'


Fri Jan 30 14:14:55 2009


LGWR: terminating instance due to error 316


Instance terminated by LGWR, pid = 25599



Missing files:



Fri Jan 30 15:04:34 2009

Errors in file /apps/oravis/db/tech_st/10.2.0/admin/test/bdump/test_lgwr_1434.trc:


ORA-00320: cannot read file header from log 1 of thread 1


ORA-00312: online log 1 thread 1: '/oradata/dbf2/testdata/test/redo01.log'


ORA-27091: unable to queue I/O


ORA-27072: File I/O error


SVR4 Error: 11: Resource temporarily unavailable


Additional information: 4


Additional information: 1


LGWR: terminating instance due to error 320


Instance terminated by LGWR, pid = 1434






  • If LGWR terminates you have to recover the database:

    • If you are in ARCHIVELOG mode, perform an incomplete recovery up to the previous log file.


    • If you are in NOARCHIVELOG, perform a complete recovery of the last cold backup.




  • If the database is still online, you can try and clear the logfile group. If the command is not successful then you will need to recover.