Wednesday

VIP Failover

I finally have my 10g RAC environment up and running in a vmware environment. After a sigh of relief I checked the environments status using crs_stat -t and noticed my VIP and a few other services such as the database instance, asm instance and listener are down. Over a period of time I noticed the VIP bouncing from node to node. It would failover from node1 to node2 and after a period of time fail back from node2 to node1 and in the process shutdown services dependent on it.

The VIP status check (handled by $ORA_CRS_BIN/racgvip) consists of 3 things, as I understand it.

1. On linux, the mii-tool is used to check the status of the Interface.
2. If check 1 fails, it pings the gateway via the interface it is checking.
3. If check 2 fails, the number of packets received by the interface is checked over a preset amount of time. Metalink says this is 2 seconds but I believe its 6 seconds.

If all 3 of these checks fail, it concludes the interface has failed and will initiate a VIP failover.


The first check fails in my environment. When mii-tool is executed on the command line I receive:


[root@raclinux1 /]#/sbin/mii-tool eth0
SIOCGMIIPHY on 'eth0' failed: Operation not supported


The mii-tool man page states this error is due to the fact the interface does not support MII queries.

The second check also fails in my environment. I am using private IP space for my vmware images and while at home I have a pingable gateway, at work I do not. The following note describes how to set FAIL_WHEN_DEFAULTGW_NOT_FOUND=0 in the $ORA_CRS_HOME/bin/racgvip script.

Subject: CRS-0215: Could not start resource 'ora..vip' Note:356535.1

However, this didn't work as I had expected. Even though my gateway isn't pingable, the VIP's would continue to failover. After looking at the script, the parameter FAIL_WHEN_DEFAULTGW_NOT_FOUND only applies if you don't have a gateway defined in your network setup. The racgvip script calls netstat to determine the gateway ip address. Since I have a gateway defined in case i'm experimenting from my home network, this parameter had no affect.

So that brings me to the last check. Before racgvip pings your gateway (check 2) it calls ifconfig and extracts the number of packets that your interface has received. Then it tries to ping your gateway. If the gateway doesn't respond or as in my case, not pingable, it calls ifconfig again and extracts the number of RX packets. If the number of RX packets hasn't changed, it concludes the interface has failed and initiates a VIP failover.

Even this check was failing in my environment. I edited the racgvip script to print out the number of RX packets each time it was checked. I noticed there wasn't much traffic during some of the checks and at times none, thus a failover was initiated. I believe this is due to a couple of factors. The first being that there is very little database activity since its a fresh install. The other is that I have a 2 node vmware cluster running on a 3 year old laptop, so at times it does freeze momentarily.

The solution to this problem is to extend the time between checks in the racgvip script. To do this, change the CHECK_TIMES variable from 2 to a higher value. I chose 20 but keep in mind, the higher you increase this value, the longer it will take for a failure to be detected.

If this was a proper RAC environment I don't believe I would have hit any of these issues. The benefit tho, is that now I have a better understanding of VIP failover.

12 comments:

ebrian said...

Helpful information, thanks for posting!

krishnan said...

I try to install cluster RAC using VMware on windows vista,but when i reboot the 1 st VM mechanic it says clustering is not supported and it deleting all the file related to cluster

but i have installed on windows XP using VMware server it is successfull.When i install VMware server on windows vista its gives error stating its not compatible so i installed VMworkstation here on workstation cluster is not working


Can any one help how can i install cluster oracle RAC on windows vista what is the VM ware software i should use


Thanks a lot in advance

Dave said...

Hey krishnan, I don't believe clusters are supported on VMware Workstation. I haven't tried it personally but a couple of friends using workstation started to follow my guides and received pop up warnings when they made changes to the vmware instance config file to enable clustering.

I use VMware server because its free and supports clustering.

krishnan said...

Hi dave,
Tanks for your quick responce,But i install VMware server in the windows vista it give compatible errors while installing,If you have installed VMserver on vista can you please tell me the version you have installed

Dave said...

I'm currently using 1.0.3 on a Vista machine. I have 2 XP machines which are running 1.0.4 and 1.0.5.

SRIRAM said...

Hi,

This post is very useful , my question is if i have a stand alone mahcine which does not have a pingable default gateway or a router, then what should i set the default gateway to ?

should i configure MS loopback adapter and set a dummy gateway there?

It would be great if you can post something on how to configure the network components in a home pc environment.

Dave said...

SRIRAM: It depends on how your home network is setup. If you have a router, it should be your gateway. So you can use your routers IP address as your gateway.

The VIP status consists of 3 checks. The only reason I had issues, even tho I didn't have a pingable gateway is that running 2 RAC nodes on my laptop was very slow and seem to hang for a few seconds. This caused check 3, the number of packets received by the interface, to fail.

If your running your vmware images on a fast computer I wouldn't expect you to have this problem.

Anonymous said...

I also faced a similar error of VIP going down every now and then.

Setting CHECK_TIMES=10, resolves the issue but again in couple of hours again i hit the same problem.

As enabling the log shows it should be satisfied if it successfully ping any address, so i have given a pinable IP as gateway to both the machines and now is not getting any error.

Mine is just in vmware machine and is only for my learning,hopefully others can also try.

Cheers
Naresh

Amit said...

Could you please put a disclaimer on this article telling that before testing these in production servers , people should test or check with Oracle support. Changing CHECK_TIMES to 20 can lead to issues as oracle will take longer to know that vip is not reachable. Anyways I am not sure if this is supported. In case it is , then it would be nice to have referenced My Oracle Support aka metalink note.

Dave said...

"Could you please put a disclaimer on this article telling that before testing these in production servers"

Amit - I shouldn't even have to mention that. Before any changes are made in production they should be properly tested. The idea behind this article is to talk about how the racgvip script works and the reason I had to modify it. Which was because I was running a test 2 node rac cluster on a slow laptop.

I wouldn't expect someone to have to modify CHECK_TIMES in a real RAC environment.

"Changing CHECK_TIMES to 20 can lead to issues as oracle will take longer to know that vip is not reachable."

I mention that as well "keep in mind, the higher you increase this value, the longer it will take for a failure to be detected."

Brandon Hudson said...

Really appreciable art work. I have never seen these types of designed chair ever. Very cool and marvelous designing.

Lenovo - 14" ThinkPad Notebook - 4 GB Memory and 128 GB Solid State Drive - Black (23539MU)

Lenovo - ThinkPad 14" LED Ultrabook - Intel Core i5 i5-3427U 1.80 GHz - Black

Sridevi Koduru said...

Regards
Sridevi Koduru (Senior Oracle Apps Trainer Oracleappstechnical.com)
LinkedIn profile - https://in.linkedin.com/in/sridevi-koduru-9b876a8b
Please Contact for One to One Online Training on Oracle Apps Technical, Financials, SCM, Oracle Manufacturing, OAF, ADF, SQL, PL/SQL, D2K at sridevikoduru@oracleappstechnical.com | +91 - 9581017828.