The first thing you can do, and you should do this regularly, is to run an Exacheck report. Oracle Metalink Note 1070954.1 describes how to download the Exacheck report. The included user guide and best practices documents will describe how to install Exacheck and run it.
It's important to note that new versions of Exacheck are released fairly regurarly, and you want to download, install and run the most current version of Exacheck before you install the QFSDP. It's really a good practice to run Exacheck reports on a regular basis. This is a good method of cross-checking your Exadata system for any failures, errors and to note any configuration changes that might have slipped through.
I've run Exacheck before every patch that I've applied. In many cases we have found problems on the Exacheck that needed to be corrected before we could apply the patch. While many Exadata environments have ASR (phone home) setup for their Exadata machines, many do not for various reasons. If you are working with an Exadata machine and it's not setup with ASR, then you need to be extra vigilant reviewing your systems health. The beauty of Exadata is it's redundancy, but that redundancy can also mask a variety of problems that occur. If you don't catch and correct those problems, then things can and do eventually fail. If you find problems on the day you want to be applying patches, you might find your patching schedule thrown out the window as you scramble to repair a bad disk, replace a bad cable, or any number of other small things that can go wrong.
If you look on the Oracle Exadata Assessment Report, which is one of the reports that Exadata produces, you will see a message similar to this one:
Note! This version of exachk is considered valid for 31 days from today or until a new version is available
So, even Exacheck will warn you how much longer you have to run the current version of Exacheck.
So, you have your exacheck report. What are the things to look for before you install the patch sets? There are a number of things on the Exacheck that can get flagged and some are worth paying attention too, and others are less than important.
First, notice the disclaimer on the report that says:
NOTE : exachk is only one part of the MAA Best Practices recommendation methodology. My Oracle Support "Oracle Exadata Best Practices (Doc ID757552.1)" should be reviewed thoroughly as it is the driver for exachk and contains additional operational and diagnostic guidance that is not programmed within exachk.
So, Exacheck is just part of an overall set of best practices that Oracle recommends. Notice that it says you should review the note. It is not the purpose of Exacheck to remove the need for the DBA to think - rather it gives the DBA things to think about. Just because the report flags something does not mean it's bad and just because the report does not flag something does not mean things are good. At the end of the day, you have to use your experience to interpret the report and decide what is important in your environment and what is an exception.
One output from the Exacheck execution will be the Oracle Exadata Assessment Report. At the top of this report there will be a score allocated to your Exacheck system. This score is supposed to represent the overall health of the system. While this score is a good guide, it's a bit like the database hit ratio in that it can be misleading and it can also cause you to get target fixation. I do not recommend that you focus on lowering the score. Rather, I recommend that you review the result, determine which ones are critical and really apply to your environment and then correct those problems. I've seen people pay way to much attention to this score, and craft a goal that this number be as near 100 percent as possible rather than analyze the result and use the score as a general metric that you build reasonable thresholds around. Don't get score fixation.
The Exacheck reports contains various checks on the Exadata stack including:
- ASM check
- Cluster wide check
- Database check
- ORACLE_HOME check
- OS Check
- Patch check
- SQL check
- SQL parameter check
- Storage server check
- Infiniband switch check
These checks include a variety of things such as version checks, best practices checks and so on. So, the question is, what is truly important when it comes to Exacheck and making sure everything is ready for the patch to be applied.
Generally I will ignore database setting failures out of the gate. This isn't because these are not important, but generally a database parameter, on a working database, that is not set in alignment with best practices is not going to impact the application of the patches. So, let's say that you saw the following on your Exacheck report:
FAIL | Database Check | Database control files are not configured as recommended | All Databases |
Would this failure require resolution before you applied your patches. Probably not. Not that this isn't something to look and and figure out if your databases are secure, but this should not stop you from applying a database patch.
Now, what about this one:
WARNING | OS Check | Free space in root(/) filesystem is less than recommended. |
Note that this is a warning and not a failure, which seems a bit odd to me. What would you do if you saw this error message? Would you apply your patches? No, of course not. Since one of the things we are going to patch is the compute node, we want to make sure that there is plenty of space on the root file system. Also, we are going to be backing up the boot partition on these boxes, and we want the boot partition as clean as possible before we start backing it up, or patching it. So, in my mind, in spite of the fact that this is reported as just a warning, it's really something important to look into.
What do you do if you get a result and you don't know how to react to it? Please, please - contact Oracle Support and ask them! Don't assume it's ok, or that it's not important. If you don't understand it, take steps to understand it and then determine if you need to resolve it.
So, what are the issues that I worry the most about?
- Network issues. Cabling, configuration, all are important.
- ILOM issues. If an ILOM is flaky, you have a real problem.DO NOT PATCH IF YOU HAVE AN ILOM ISSUE. PERIOD!
- Storage cell issues. The storage cells should all be consistently without error. If an error on the storage cell layer is flagged, it's something to pay attention to. Me personal rule, don't patch anything unless the storage cells are 100% healthy without question.
- Any database server OS checks that fail due to configuration, infiniband or other infrastructure related errors.
- Any database server verify-topology checks that fail.
- Any database errors that surface that indicate the versions are different, incorrect or other types of errors.
- If the Exacheck report has the following included in it: WARNING! The data collection activity appears to be incomplete for this exachk run. Please review the "Killed Processes" and / or "Skipped Checks" section and refer to "Appendix A - Troubleshooting Scenarios" of the "Exachk User Guide" for corrective actions.
FAIL, WARNING, ERROR and INFO finding details should be reviewed in the context of your environment.
Finally, once you have run Exacheck, reviewed it's findings and corrected those that need correction, it's a good idea to document the final Exacheck results for future reference.
So, now you have your Exacheck ready to go and all problems that need attention corrected! We are one step closer to being ready to apply that patch!
Well, I hope this blog entry was helpful. In the next entry I'll discuss some Exadata results I've seen in the past and the impacts of those results. Then, we will talk about other Quarterly bundle patch related topics. Your comments about your experience with Exacheck are most welcome! Have a great weekend!
1 comment:
Thanks Robert. A healthy post on Exachk.
Post a Comment