Tuesday

Security Patch Woes

A few weeks ago we applied a number of security patches. Due to various reasons we were a bit behind schedule and had to push a couple of releases out to production. Since then we have encountered 3 bugs, one of which crashed production just before month end. 2 of the bugs were the result of upgrading to 10.2.0.3 (a requirement for the security patches). The other was a bug on top of ATG_PF.H.5

Problem 1:

After cloning, Concurrent Managers fail to start. As per Note:434613.1, this problem exists on top of ATG_PF.H delta 5 and delta 6, as well as R12. During cloning neither the service manager or the internal monitor are created thus the concurrent managers will not start. A patch has been published to resolve this issue and there is a very easy workaround. We have added the patch to the next release cycle and modified our cloning scripts to incorporate the workaround.

The interesting part about this problem is that we have been on ATG_PF.5 since Dec when we upgraded to 11.5.10.2. Since then we have cloned a test environment over 200 times (its rebuilt nightly) and did not encounter this bug. We did hit it once back in March for a one off clone, but since then we have recloned that environment plus many others multiple times and it didn't resurface. I find it interesting that now after upgraded to 10.2.0.3 (from 10.2.0.2) and applying security patches that we can reproduce the problem consistently.

In retrospect we should have added this patch to the release cycle but we tend not to recommend that unless a problem can be consistently reproduced.


Problem 2:

Users stopped receiving email notifications from workflow. The following error could been seen in the logs:

ORA-06502:

PL/SQL: numeric or value error: associative array shape is not consistent with session parameters


A quick search turned up bug 5890966 which mentions this problem could occur during periods of high activity. Once we encountered this bug emails ceased to be sent. Oracle confirmed that this is a mandatory patch for 10.2.0.3 but has not been published as such yet.

Thanks to the next problem tho, we had to restart our environment and the problem hasn't reoccurred yet. We have added the patch to the next release cycle and hopefully it won't reoccur before then.

Problem 3:

On 10.2.0.3 bug 5907779 can cause sessions to self hang if dbms_stats is executing... I recall reading a few blog posts about this particular error but since it wasn't recorded as a mandatory patch we didn't apply it. At least the blog posts helped me identify the problem quickly.

Our statistics gathering jobs are scheduled on weekends and in this particular case only 1 type of session was hanging as a result of this bug. This session was spawned by an integration which is scheduled to execute once every few minutes. Unfortunately it wasn't smart enough to detect previous instances were still running and of course isn't monitored on weekends.

So as the weekend wore on, more and more sessions consuming more and more resources accumulated in the database. We didn't realize there was a problem until Sunday night when APO users came online. Unfortunately by then it had progressed to the point where the system ran out of resources, sessions couldn't be killed and we had to reboot the server. Luckily I was able to capture enough information (hang analyze and system state dumps) to confirm that bug 5907779 was the culprit. Everything came back up properly and as a bonus temporarily fixed our workflow email issues.


Unfortunately these types of problems (at least number 2 and 3) are not likely to surface in a test/dev environment. We have some patch review meetings coming up over the next few days to re-examine our processes but i'm not sure how we can prevent these types of problems in the future. Note:401435.1 lists a number of issues specific to 10.2.0.3. I guess I could have analyzed each of those patches to determine if they were applicable to our environment but whats to say they wouldn't have introduced additional bugs? Even then, I would have only prevented one issue since the workflow patch isn't listed in that note. Normally I just review Note:285267.1 which is the EBS 11i and Database FAQ to make sure there are no known issues.

Feel free to leave a comment describing how you analyze patchsets and full releases...

No comments: