Wednesday

VIP Failover

I finally have my 10g RAC environment up and running in a vmware environment. After a sigh of relief I checked the environments status using crs_stat -t and noticed my VIP and a few other services such as the database instance, asm instance and listener are down. Over a period of time I noticed the VIP bouncing from node to node. It would failover from node1 to node2 and after a period of time fail back from node2 to node1 and in the process shutdown services dependent on it.

The VIP status check (handled by $ORA_CRS_BIN/racgvip) consists of 3 things, as I understand it.

1. On linux, the mii-tool is used to check the status of the Interface.
2. If check 1 fails, it pings the gateway via the interface it is checking.
3. If check 2 fails, the number of packets received by the interface is checked over a preset amount of time. Metalink says this is 2 seconds but I believe its 6 seconds.

If all 3 of these checks fail, it concludes the interface has failed and will initiate a VIP failover.


The first check fails in my environment. When mii-tool is executed on the command line I receive:


[root@raclinux1 /]#/sbin/mii-tool eth0
SIOCGMIIPHY on 'eth0' failed: Operation not supported


The mii-tool man page states this error is due to the fact the interface does not support MII queries.

The second check also fails in my environment. I am using private IP space for my vmware images and while at home I have a pingable gateway, at work I do not. The following note describes how to set FAIL_WHEN_DEFAULTGW_NOT_FOUND=0 in the $ORA_CRS_HOME/bin/racgvip script.

Subject: CRS-0215: Could not start resource 'ora..vip' Note:356535.1

However, this didn't work as I had expected. Even though my gateway isn't pingable, the VIP's would continue to failover. After looking at the script, the parameter FAIL_WHEN_DEFAULTGW_NOT_FOUND only applies if you don't have a gateway defined in your network setup. The racgvip script calls netstat to determine the gateway ip address. Since I have a gateway defined in case i'm experimenting from my home network, this parameter had no affect.

So that brings me to the last check. Before racgvip pings your gateway (check 2) it calls ifconfig and extracts the number of packets that your interface has received. Then it tries to ping your gateway. If the gateway doesn't respond or as in my case, not pingable, it calls ifconfig again and extracts the number of RX packets. If the number of RX packets hasn't changed, it concludes the interface has failed and initiates a VIP failover.

Even this check was failing in my environment. I edited the racgvip script to print out the number of RX packets each time it was checked. I noticed there wasn't much traffic during some of the checks and at times none, thus a failover was initiated. I believe this is due to a couple of factors. The first being that there is very little database activity since its a fresh install. The other is that I have a 2 node vmware cluster running on a 3 year old laptop, so at times it does freeze momentarily.

The solution to this problem is to extend the time between checks in the racgvip script. To do this, change the CHECK_TIMES variable from 2 to a higher value. I chose 20 but keep in mind, the higher you increase this value, the longer it will take for a failure to be detected.

If this was a proper RAC environment I don't believe I would have hit any of these issues. The benefit tho, is that now I have a better understanding of VIP failover.

Friday

RAC and Vmware issues

Over the past little while i've been playing around with RAC on Vmware. Specifically 10.2.0.1 on Oracle Enterprise Linux 5. Hopefully next week i'll be able to post the install steps but here are a couple of pointers in case you are working on it now:

1. Randomly at least one, sometimes both of my vmware instances would hang. At the time of the hang I could see that they were both accessing shared disk, specifically the voting and ocr disks.

In my vmware.log file I could see the following error:

Msg_Post: Error Mar 12 16:47:25: vmx| [msg.log.error.unrecoverable] VMware Server unrecoverable error: (vmx) Mar 12 16:47:25: vmx| NOT_IMPLEMENTED
C:/ob/bora-56528/pompeii2005/bora/devices/scsi/scsiDisk.c:2874 bugNr=41568


I searched in vain for a solution and finally sent an email to the Oracle-L list. Thankfully Edgar saw my post and provided me with the solution. If your running on a slow computer (I didn't think my brand spanking new laptop was slow ;) you could have locking issues. In each of your vmware configuration files put the following line:

reslck.timeout="1200"

2. You should read the following notes before you start and modify your steps accordingly:

Subject: 10gR2 RAC Install issues on Oracle EL5 or RHEL5 or SLES10 (VIPCA Failures) Doc ID: Note:414163.1


Subject: VIPCA FAILS COMPLAINING THAT INTERFACE IS NOT PUBLIC Doc ID: Note:316583.1



If you are trying to install 10g on Oracle Enterprise Linux 5, as I am, you will hit errors installing clusterware. The first note above describes how a workaround used for a Linux threading bug is no longer valid. So before you run the root.sh script you will need to modify some files.

Note: The first time I installed clusterware I didn't see any errors. It was only when I verified the install I noticed something was wrong and found this note.

The second metalink article describes how vipca (which is executed automatically when you run root.sh during the clusterware install) doesn't like Private Network IP's being used for your public interface. It describes how to execute vipca manually.

I'm not 100% finished yet so I may encounter more issues. I have to recover from a backup and start my clusterware install again. While I was installing the database software my laptop BSOD'd and it corrupted my shared disks.

Tuesday

Patch failed but the fix is another patch!

Fortunately this happens infrequently (in my experience) and is usually caught the first time patch(es) are applied in a sandbox or dev environment. There could be a few reasons why this happens, a couple off the top of my head:

- Your patch analysis wasn't complete. [If you have been in pre-req hell before, it can be very easy to miss a patch.]
- You've hit a bug and Oracle informs you another patch is required.
- Some large patches, especially family packs usually have functional prerequisites. These are usually in the README file as a chart which lists the product, feature and patches required. If you are using, or plan to use the products listed in the chart, you are required to apply the patch(es).

Just recently, we hit an issue in which we had to go back and apply a functional pre-req for a module we don't use.

So how do you resolve this? There are two options:

1. Cancel the patch, apply the new patch, then start the first one over again.
2. Backup the first patch, apply the new patch, restore the first patch and let it continue where it left off.

The benefit of option 2 is to save deployment time. Obviously if you are 2-3 hours into a large patch and you choose option 1 you have to start over.

So, the steps to backup the first patch are:

1. Use adctrl to shutdown all of the workers.
2. Backup the FND_INSTALL_PROCESSES and AD_DEFERRED_JOBS tables.
3. Backup the .rf9 files under $APPL_TOP/admin/[SID]/restart
4. Drop the tables in step 2.
5. Apply the new patch.
6. Restore the .rf9 files backed up in step 3.
7. Restore the tables backed up in step 2.
8. Recreate Synonyms under apps:
- create synonym AD_DEFERRED_JOBS for APPLSYS.AD_DEFERRED_JOBS;
- create synonym FND_INSTALL_PROCESSES FOR APPLSYS.FND_INSTALL_PROCESSES;
9. Start adpatch and continue.


Now, this only works if one of the workers fail while executing tasks for the first patch. If a fatal error happens elsewhere then you can't use the steps above.