Thursday

ORA-01555 and Automatic Undo

In the past couple of years of support Oracle E-Business Suite I have only hit ORA-01555: snapshot too old twice... About a year ago and again today.

Before you start to troubleshoot you need to gather some information:
  • If your unaware of the current settings for undo you should check. One way is to use the show parameter command but you can also check the database initialization parameter file or spfile.


    SQL> show parameter undo

    NAME TYPE VALUE
    ------------------------------------ ----------- -------------
    undo_management string AUTO
    undo_retention integer 1200
    undo_suppress_errors boolean FALSE
    undo_tablespace string APPS_UNDOTS1


  • Determine which time the error occurred from the alert log, which will help you narrow down the time for a query we will use later. You alert log should have an entry similiar to:

    Sat Aug 16 23:05:41 2008
    ORA-01555 caused by SQL statement below (Query Duration=1506 sec, SCN: 0x0000.a8d6d835):
    Sat Aug 16 23:05:41 2008
    SELECT X FROM Y WHERE A=B

Now that you the automatic undo settings and the time of the error execute the following query (substitute the timestamp of the error in your alert log in place of the one below, with some buffer at either end):

SQL> select to_char(begin_time,'MM/DD/YYYY HH24:MI') begin_time, UNDOBLKS, UNXPSTEALCNT, EXPSTEALCNT ,
SSOLDERRCNT, NOSPACEERRCNT, MAXQUERYLEN
from v$undostat
where begin_time between to_date('08/16/2008 22:30:00','MM/DD/YYYY HH24:MI:SS')
and to_date('08/16/2008 23:30:00','MM/DD/YYYY HH24:MI:SS')
order by begin_time;


BEGIN_TIME UNDOBLKS UNXPSTEALCNT EXPSTEALCNT SSOLDERRCNT NOSPACEERRCNT MAXQUERYLEN
---------------- -------- ------------ ----------- ----------- ------------- -----------
08/16/2008 22:32 10394 0 0 0 0 513
08/16/2008 22:42 5693 0 0 0 0 212
08/16/2008 22:52 10853 0 0 0 0 451
08/16/2008 23:02 8950 0 0 0 0 1653
08/16/2008 23:05 11284 0 0 1 0 1506

5 rows selected.


Based on this query we can determine a couple of things such as the number of transactions which required UNDO during the time period. How many undo blocks they required and whether there were any (successful) atempts to steal undo space from other transactions.

The query also shows which transactions received ora-1555 errors (SSOLDERRCNT) and whether or not there was enough space in the UNDO tablespace (NOSPACEERRCNT). If you look at the row above with a timestamp of 23:05, this is the transaction which produced my ora-1555 error. We know this because the SSOLDERRCNT column for this row has the value 1 and the timestamp is the same as the one in the log file. Since the MAXQUERYLEN for this transaction (1506) is greated than the undo_retention (1200) one solution would be to increase undo_retention.*

*There are other ways to solve this problem to try and avoid ora-1555 errors such as transaction tuning, etc. More detail can be found in Note:10630.1

For this particular case I will set it to 2000 and monitor the undo tablespace. By increasing undo_retention undo is kept for a longer period of time, so you should monitor the undo tablespace to make sure you have enough room. Oracle 9i+ has advisors which can help you determine the impact of raising undo_retention. The following screen shot is taken from Oracle 9i Enterprise Manager but you can also find better graphs in 10G Database or Grid Control:




*Note: this graph is not from same database as above. Its just for informational purposes.

We can determine from this graph if we were to double undo retention from 30 to 60 minutes we would need at most 70MB of space. This estimate is based on undo activity since the database was initially started.

If the UNDO tablespace was too small, you would see a value in the NOSPACEERRCNT column. Take a look at the following row with a timestamp of 23:22:


BEGIN_TIME UNDOBLKS UNXPSTEALCNT EXPSTEALCNT SSOLDERRCNT NOSPACEERRCNT MAXQUERYLEN
---------------- -------- ------------ ----------- ----------- ------------- -----------
08/16/2008 23:22 1034 10 0 1 1 1428

This row has a values for UNXPSTEALCNT and NOSPACEERRCNT. This means the transaction was not able to obtain space from the UNDO tablespace and attempted to steal 10 blocks (UNXPSTEALCNT) from other transactions. In this case we need to add space to the UNDO tablespace because it was not large enough to meet the undo_retention requirement.

Tuesday

Statistics, Concurrent requests and Terminations

There is a script on Metalink which can help you automate gathering statistics for an E-Business Suite environment. As you are probably aware in an EBS environment you have to use FND_STATS or some predefined concurrent programs such as "Gather Schema Statistics".


Subject: coe_stats.sql - Automates CBO Stats Gathering using FND_STATS and Table sizes
Doc ID: Note:156968.1


This note contains a script which verifies statistics and generates a script and report which lists all the tables that need to be analyzed. Beware that coe_stats.sql automatically executes the dynamic script coe_fix_stats.sql script at the end of its execution. So make sure you comment out the line "START coe_fix_stats.sql" if you don't want this to happen.

The report generated also contains some information about the table. If its partitioned, number of rows (at the time it was last analyzed), sample percent, etc.. As well as "Days since last analyze". Immediately I noticed that the time since the tables were last analyzed was over a week. Our main Gather Schema Statistics job is scheduled to run every weekend.

What does this have to do with terminations? In a previous post I mentioned that our environment crashed as a result of bug 5907779. This happened as our weekly concurrent request to gather statistics was running. After the environment was bounced this job restarted but it was now during business hours. Statistics should always be gathered at periods of low activity because it causes performance issues. So I canceled the request via OAM, which changed the status to Phase: Completed, Status: Terminated.

The side affect to terminating a job, which had slipped my mind* is that it will not be rescheduled. I'm not sure why that happens, it doesn't make much sense to me but it appears to be by design. The solution is to reschedule the job.

* Apparently 8hrs of sleep in the past 72hrs can have that affect. Partly due to a last night out while on vacation, traveling, falling ill and being paged at 3am.

In summary, the goal of this post was to point out a useful metalink note and refresh your memory to reschedule canceled requests if you need them to run again.

Friday

A Smart Performer

Alex asked a question on his blog, "As a Manager who would you prefer - a smart performer or a hard worker?" I am not a manager but my personal preference would be to work with a smart performer. Over the past 10 years i've probably worked with over 100 dbas, so its probably safe to say i've seen a pretty broad range of expertise.

I personally find hard workers who don't produce results, frustrating to work with. At the end of the day, if I consistently have to solve issues then they add no value. I might as well do it all myself. I should add, I expect questions as a senior DBA since thats part of the job. But I expect junior level questions from junior members of the team, not senior. They say there is no such thing as a stupid question and I agree. However, if your a senior resource then you should be able to find the answer to junior questions very quickly yourself.

This debate has cropped up on mailing lists, forums many times over the past year. The general consensus seems to be that managers prefer a hard worker who has social skills vs a smart performer who may be a little socially inept. I've worked with my share of socially inept DBA's and even tho at times they can make you feel like dirt stuck in the crack of their shoe, at least there is an opportunity to learn from them. Usually these DBA's have that attitude for a reason and most times its because of managements inability to 'fix' hard workers who produce no results or consistently repeat past mistakes.

So to answer Alex's question, "Or a fireman who has a smart solution, extinguishes the fire in two hours and saves most of your house. Which one do you choose?" I choose the the fireman who saves my house, even if he curses on me for being a moron while doing so.

Tuesday

Everyone Has One

Everyone has an environment that they are tasked with supporting but dread the day a call comes in. For me is an old 10.7 environment which we have for historical purposes. I'm not exactly sure how often users need to run reports in there but i'd say a few times a month. I dread supporting this environment because I have never used 10.7... When I joined this group we had been using 11i for a few years. Since then, the only people who knew anything about 10.7 have left the company.

Recently we received a call that users could not connect. To make a long story short we had to restore the application server from a backup. I wish I could tell the story because its hilarious but suffice to say i've seen a few bloggers lose their jobs for posting too many environment specific details. Better safe than sorry.

Luckily I was going on vacation the day after we requested the restore and I was hoping it would all be resolved by the time I returned. No such luck. Once the server was ready for us I logged in to restart the services expecting no issues. No suck luck. Executing owsctl produced the following error:


> Error initializing CORE
> Please check if the ORACLE_HOME is set correctly.



Great. Luckily there is a 7 year old, archived document on Metalink which pointed me in the right location: Note:1015720.101 The note refers to changing
the location of your ORACLE_HOME but symbolic links under $ORACLE_HOME/ows/mesg were not updated correctly to the new location.

Well, in this case everything should be identical but with no other hits on Metalink its was my only option. Indeed the symbolic links were pointing to an invalid location. When the unix admin restored the files he had to put them in a temporary location first and copy them over. The restore process must have recreated the symbolic links because they were still pointing to that temp location.

Sweet, I found the solution so I fired up the startup scripts again.. No suck luck. Now I faced the following error:


OMN-2001, could not contact address server
Information: Application init function oracle_adp_init returned ADI_FATALINIT
Error: The server could not initialize
Information: The server is exiting
OWS-08811: Unable to startup Oracle Web Listener `xyz01'.



This time Note:1016600.102 came to my rescue. OMN-2001 means the admin process was started before the WRB (web request broker). Solution, start WRB before admin. Then I could process to starting the listener.

While the first problem was as a result of the restore, the second could have been avoided with good documentation. Now that I know the proper order to start the services (wrb, admin, listener) I found the correct startup scripts.

Friday

DB2 No More

A couple of posts ago I mentioned that the majority of my time was going to be spend learning DB2. I also mentioned I was going to put some feelers out internally to see if any Oracle Apps jobs were out there. So over the past few weeks I networked quite a bit and with luck I managed to find an opening. This week I start part time in an Oracle Applications DBA group and hopefully as integration work slows down i'll spend more and more time over there.

So now I can dedicate my free time to learning more about Oracle Apps (which hopefully means more blog posts) instead of ramping up on DB2.

Tuesday

Security Patch Woes

A few weeks ago we applied a number of security patches. Due to various reasons we were a bit behind schedule and had to push a couple of releases out to production. Since then we have encountered 3 bugs, one of which crashed production just before month end. 2 of the bugs were the result of upgrading to 10.2.0.3 (a requirement for the security patches). The other was a bug on top of ATG_PF.H.5

Problem 1:

After cloning, Concurrent Managers fail to start. As per Note:434613.1, this problem exists on top of ATG_PF.H delta 5 and delta 6, as well as R12. During cloning neither the service manager or the internal monitor are created thus the concurrent managers will not start. A patch has been published to resolve this issue and there is a very easy workaround. We have added the patch to the next release cycle and modified our cloning scripts to incorporate the workaround.

The interesting part about this problem is that we have been on ATG_PF.5 since Dec when we upgraded to 11.5.10.2. Since then we have cloned a test environment over 200 times (its rebuilt nightly) and did not encounter this bug. We did hit it once back in March for a one off clone, but since then we have recloned that environment plus many others multiple times and it didn't resurface. I find it interesting that now after upgraded to 10.2.0.3 (from 10.2.0.2) and applying security patches that we can reproduce the problem consistently.

In retrospect we should have added this patch to the release cycle but we tend not to recommend that unless a problem can be consistently reproduced.


Problem 2:

Users stopped receiving email notifications from workflow. The following error could been seen in the logs:

ORA-06502:

PL/SQL: numeric or value error: associative array shape is not consistent with session parameters


A quick search turned up bug 5890966 which mentions this problem could occur during periods of high activity. Once we encountered this bug emails ceased to be sent. Oracle confirmed that this is a mandatory patch for 10.2.0.3 but has not been published as such yet.

Thanks to the next problem tho, we had to restart our environment and the problem hasn't reoccurred yet. We have added the patch to the next release cycle and hopefully it won't reoccur before then.

Problem 3:

On 10.2.0.3 bug 5907779 can cause sessions to self hang if dbms_stats is executing... I recall reading a few blog posts about this particular error but since it wasn't recorded as a mandatory patch we didn't apply it. At least the blog posts helped me identify the problem quickly.

Our statistics gathering jobs are scheduled on weekends and in this particular case only 1 type of session was hanging as a result of this bug. This session was spawned by an integration which is scheduled to execute once every few minutes. Unfortunately it wasn't smart enough to detect previous instances were still running and of course isn't monitored on weekends.

So as the weekend wore on, more and more sessions consuming more and more resources accumulated in the database. We didn't realize there was a problem until Sunday night when APO users came online. Unfortunately by then it had progressed to the point where the system ran out of resources, sessions couldn't be killed and we had to reboot the server. Luckily I was able to capture enough information (hang analyze and system state dumps) to confirm that bug 5907779 was the culprit. Everything came back up properly and as a bonus temporarily fixed our workflow email issues.


Unfortunately these types of problems (at least number 2 and 3) are not likely to surface in a test/dev environment. We have some patch review meetings coming up over the next few days to re-examine our processes but i'm not sure how we can prevent these types of problems in the future. Note:401435.1 lists a number of issues specific to 10.2.0.3. I guess I could have analyzed each of those patches to determine if they were applicable to our environment but whats to say they wouldn't have introduced additional bugs? Even then, I would have only prevented one issue since the workflow patch isn't listed in that note. Normally I just review Note:285267.1 which is the EBS 11i and Database FAQ to make sure there are no known issues.

Feel free to leave a comment describing how you analyze patchsets and full releases...