Tuesday

RC-10203 and directory permissions

This past week we hit a weird issue while trying to clone a test environment from production. It took a fair bit of effort to figure out the problem so I thought i'd mention it in case you have a similar setup but haven't added this item to your checklist.

Almost a year ago we moved to a shared application filesystem. So as part of the cloning process we have to run the script:

$ cd
/bin
$ perl adclonectx.pl sharedappltop \
contextfile=
'applications context file for the existing node'

This configures the node to be able to access the shared setup. This script errored out:

Choose a value which will be set as APPLPTMP value on the target node [1]:
AC-10006: Exception - java.io.FileNotFoundException: /admin/template/adxmlctx.tmp thrown while creating OAVars object for file: ../admin/template/adxmlctx.tmp
RC-50004: Error occurred in CloneContext:
RC-10203: The template provided does not match with the context file.
Check logfile /apps/appsora/oraappl/admin/ORA/out/CloneContext_07270115.log for details.
ERROR: context creation not completed successfully.
Please check /tmp/adclonectx.err file for errors


Our environment hasn't changed much and cloning is a pretty regular exercise. So my first thought is what could be different now than previously? To me this type of problem screams setup issues.

On our side everything seemed fine. It wasn't until we noticed the following that any progress was made:

$ pwd
/apps/appsora/oraappl
$ df -k .
df: cannot canonicalize .: Permission denied


Since this is a shared application file system, the file system is mounted via NFS. Apparently the permissions on the mount point (/apps/appsora) were 700 so it wasn't writable by the group (in this case dba). I still wasn't sure why this was causing our problem... I could view, create, modify files since the account I was using was the owner. (UID's and GID's are global)

To fix the permissions problem the file system was unmounted, and the directory permissions changed to 770. Once it was remounted the adclonectx script worked properly. So you may want to add directory permissions to your checklist if your using NFS and EBS.

AWR Operation failed

I know most are talking about 11g but we finally upgraded our production EBS database to 10g a few weeks ago. A few months ago, while a fellow DBA was on vacation, I used 10g's diagnostics features to troubleshoot a hanging issue in their data warehouse DB. The historical view which allows you to see what was happening in a database during a window of time is priceless and easily helped me resolve the issue. How often do users call you as they are experiencing performance problems? Never...! I'll be in a weekly meeting and users will say that they experienced issues 3 days ago! So I was looking forward to this ability.

Much to my surprise tho when I clicked on the performance tab in database control, the data was a couple of days old. 10 snapshots after the upgrade to 10g finished, it suddenly stopped working. I checked the alert log, trace files, etc but the only error I could find was if I tried to change the AWR settings. ie, changing the snapshot interval would consistently produce the following error:


ORA-13516: AWR Operation failed: only a subset of SQL can be issued ORA-06512: at
"SYS.DBMS_WORKLOAD_REPOSITORY", line 10 ORA-06512: at
"SYS.DBMS_WORKLOAD_REPOSITORY", line 33 ORA-06512: at line 1


A search of Metalink didn't result in any good hits so I opened a tar. An RDA and a few updates later they asked me to execute the following query to see if AWR was in restricted mode. In our case, it was not:

SQL> select nam.ksppinm name, val.KSPPSTVL, nam.ksppdesc description
2 from x$ksppi nam, x$ksppsv val
3 where nam.indx = val.indx AND
4 nam.ksppinm = '_awr_restrict_mode'
5 order By 1
6 /

NAME
--------------------------------------------------------------------------------
KSPPSTVL
--------------------------------------------------------------------------------
DESCRIPTION
--------------------------------------------------------------------------------
_awr_restrict_mode
FALSE
AWR Restrict Mode


The only other suggestion was to restart the database. This past weekend we had to apply some one-off patches so we took advantage of the maintenance window and restarted the database. Since then it has been working properly. I'm not a big fan of this kind of solution, I'd rather know the root cause and if its preventable. But in this case it seems Oracle Support is not sure either.

Monday

ORA-00376: file 6 cannot be read at this time

It seems each night I try to hit the sack a little early (1am) I get paged shortly after. Last night was no exception with a page coming in from one of our application support teams. Users in Australia were having issues logging into one of our customer support applications.

All of the environments I support are unix based but our team does support some Windows servers. So while i'm on call i'm responsible for those as well.

The first thing I do is check the alert log and I notice the following errors repeating over and over:

Errors in file d:\orant9i\admin\orcl\udump\orcl_j000_2096.trc:
ORA-12012: error on auto execute of job 22
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: 'D:\ORCL\MYDATA01.DBF'
ORA-06512: at "SCHEMA.STOREDPROCEDURE", line 3
ORA-06512: at line 1


I seemed to recall some of our other windows based database servers having issues in the past with backup software locking the files. So the next thing I do is called the Intel Support group in Australia and ask them if any backups were executed today. Indeed this past weekends full backups failed and they initiated some manual backups today.

The support person confirmed that the OS backup had completed, so I restarted the database. The database opened successfully and I checked the alert logs to verify there were no errors. As it had in the past occurrences, these steps resolved the issue.

Since I couldn't get back to sleep I kept an eye on the alert log and surfed the web on my home PC. About 20 minutes later the error appeared again followed by an email from a manager in Australia.

I was kind of surprised by this so I checked v$datafile. The status of 3 files was "RECOVER". My first thought was, why do I always get the fun issues when i'm on pager?!

This was a relatively small database and before I attempt any recovery I always take a cold backup. (If time allows.) Once I restarted the database I executed the following commands for each datafile that was marked as needing recovery:

SQL> recover datafile 'c:\my\datafile.dbf';
SQL> alter database datafile 'c:\my\datafile.dbf
' online;

Media recovery was successful and the problem did not reoccur. So just a note for those windows DBA's out there. If you have automated backup software ensure that your datafile directories are excluded. While on unix the worst case would be an inconsistent backup, on windows, file locking behaves differently and could cause an outage like the one I experienced.

Well, i'm off to bed shortly and hopefully will get a good nights sleep!

Wednesday

Resetlogs and your Rman catalog

This is one of those, I knew it at one point but its been so long ago items. Also, the manner in which it was discovered had us scratching our heads for a minute. I try not to blog about work specific items, you never know who is watching out there and i'm sure we have all heard of bloggers who have been fired. Playing it safe, lets just say, someone made a boo boo.....

So just a reminder that when a database is opened with resetlogs your creating a new incarnation of the database. If your using an rman catalog, you will have to login to your catalog and execute register database. Otherwise you will get an error:

Recovery Manager: Release 9.2.0.8.0 - Production

Copyright (c) 1995, 2002, Oracle Corporation. All rights reserved.

connected to target database: MYDB (DBID=2508956331)
connected to recovery catalog database

RMAN> 2> 3> 4> 5> 6>
Starting backup at 11-JUL-07
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 07/11/2007 03:30:00
RMAN-06004: ORACLE error from recovery catalog database: RMAN-20003: target database incarnation not found in recovery catalog