I finally have my 10g RAC environment up and running in a vmware environment. After a sigh of relief I checked the environments status using crs_stat -t and noticed my VIP and a few other services such as the database instance, asm instance and listener are down. Over a period of time I noticed the VIP bouncing from node to node. It would failover from node1 to node2 and after a period of time fail back from node2 to node1 and in the process shutdown services dependent on it.
The VIP status check (handled by $ORA_CRS_BIN/racgvip) consists of 3 things, as I understand it.
1. On linux, the mii-tool is used to check the status of the Interface.
2. If check 1 fails, it pings the gateway via the interface it is checking.
3. If check 2 fails, the number of packets received by the interface is checked over a preset amount of time. Metalink says this is 2 seconds but I believe its 6 seconds.
If all 3 of these checks fail, it concludes the interface has failed and will initiate a VIP failover.
The first check fails in my environment. When mii-tool is executed on the command line I receive:
[root@raclinux1 /]#/sbin/mii-tool eth0
SIOCGMIIPHY on 'eth0' failed: Operation not supported
The mii-tool man page states this error is due to the fact the interface does not support MII queries.
The second check also fails in my environment. I am using private IP space for my vmware images and while at home I have a pingable gateway, at work I do not. The following note describes how to set FAIL_WHEN_DEFAULTGW_NOT_FOUND=0 in the $ORA_CRS_HOME/bin/racgvip script.
Subject: CRS-0215: Could not start resource 'ora..vip' Note:356535.1
However, this didn't work as I had expected. Even though my gateway isn't pingable, the VIP's would continue to failover. After looking at the script, the parameter FAIL_WHEN_DEFAULTGW_NOT_FOUND only applies if you don't have a gateway defined in your network setup. The racgvip script calls netstat to determine the gateway ip address. Since I have a gateway defined in case i'm experimenting from my home network, this parameter had no affect.
So that brings me to the last check. Before racgvip pings your gateway (check 2) it calls ifconfig and extracts the number of packets that your interface has received. Then it tries to ping your gateway. If the gateway doesn't respond or as in my case, not pingable, it calls ifconfig again and extracts the number of RX packets. If the number of RX packets hasn't changed, it concludes the interface has failed and initiates a VIP failover.
Even this check was failing in my environment. I edited the racgvip script to print out the number of RX packets each time it was checked. I noticed there wasn't much traffic during some of the checks and at times none, thus a failover was initiated. I believe this is due to a couple of factors. The first being that there is very little database activity since its a fresh install. The other is that I have a 2 node vmware cluster running on a 3 year old laptop, so at times it does freeze momentarily.
The solution to this problem is to extend the time between checks in the racgvip script. To do this, change the CHECK_TIMES variable from 2 to a higher value. I chose 20 but keep in mind, the higher you increase this value, the longer it will take for a failure to be detected.
If this was a proper RAC environment I don't believe I would have hit any of these issues. The benefit tho, is that now I have a better understanding of VIP failover.