Wednesday, July 24, 2013

Exadata Patching - Part Four...

So... you have downloaded the patch. Now what do you do. Good question.

Installing the Exadata patch sets *usually* involves the following:

1. Patch the cell servers.
2. Patch the Compute note OS.
3. Patch the GRID_HOME and ORACLE_HOME locations.
4. Run the database upgrade script on each database.

I say usually because there could be additional steps to perform, such as firmware updates, in any given patch set. There is also often updates to Enterprise Manager included in the patch sets, so you will want to consider those updates.

So, let's take a look at patching the cell servers in a bit more detail. In later posts I'll address the other steps.

I can't stress how important it is to be installing these patches from a location with reliable networking. While the internal Exadata network is very fast, redundant and reliable, it's an unfortunate truth that the networks outside the Exadata box may not always be the same. What you don't want is a slow or unreliable network connection that disconnects you in the middle of the the application of the patch. If your network is not reliable, you should consider other options such as the KDM, or directly connecting to the switch between the Exadata rack and your network if possible.
Patching the Cell Servers
The first thing we will want to do is patch the cell servers. Applying the cell server patch will update the entire cell server including OS updates, firmware updates and any other updates that are needed on the cell server (such as configuration related updates). This makes patching the cell servers much easier and it also ensures that the server images remain consistent.

Before you start to apply the cell server patch you will want to decide if you want to keep your Exadata rack up and available during the patch - which would require a rolling patch application, or if you want to take the rack down for about two hours to apply the patches.

The downside of doing a rolling patch is that you are looking at anywhere between and hour and two hours, per compute node, to apply the patch. Of course, the rack is up and available during this time, so the users don't notice anything. From an administrative point of view though this can represent a significant investment in time to upgrade the rack. If you are running a half-rack with 7 cell servers, for example, you are looking at anywhere from 7 to 14 hours to upgrade all of the cell servers. That's a long time for someone to sit and monitor the progress of the patch application!

The benefit of the rolling patch is clear though. If the patch fails, it only fails on one node, and that failure should not impact your availability. The loss of one cell server generally isn't something that's going to take the entire rack down. 

On the other hand, a non-rolling upgrade patches all of the cells at once. This has the benefit of shortening the time to upgrade the entire rack of cell servers - say 1 to 2 hours (I'd plan on 2 hours). On the other hand, upgrading all of the cell servers at once can be a bit of a frightening experience. This is because you are sitting there, hopeful, that the patch will apply successfully on all of the cells, that you have not made any mistakes and that unicorns fart rainbows.

If you can take the outage my personal preference is to always run the upgrade on one cell server (which allows the system to remain up) and then take the outage and upgrade the remaining cells all at once. I just like to see the patch apply successfully once so I get warm fuzzies, and generally I like to reduce the overall time it takes to apply patches for various reasons. Of course, if uptime is your principle goal, then you will want to do a rolling patch.

You can find the specific instructions on applying the current (July) patch to the cells on MOS. Note that you need to have access to Oracle MOS to access this note. Note that each patch has it's own install instructions, so you should review the patches to make sure that nothing has changed in the install process.

Cell Server Patching Prerequisites
While there are a number of steps that you will want to perform, there are some that I consider to be critical to applying the patch (at this time)? They are:

1. Make sure you can access the ILOM on each cell server. This is a critical step, so make sure you execute this on each ILOM.

2. I usually like to run the patches from one of the compute nodes. Determine which compute node you want to use to install the patch from and check ssh connectivity from that node to each of the cell servers. Really, if SSH wasn't working right then the Rack would probably be having problems anyway, but it's always a good idea to double check. The following command can be used to check for SSH connectivity (and user equivalence).

dcli -g cell_group -l root 'hostname -i'

dcli is a nice utility that provides the ability to run commands across a set of nodes via SSH. 
You might wonder what the cell_group business is about. This is a text file (usually in /home/oracle) that lists all the cell servers in the rack. Normally this file should have been created when the Exadata rack was installed.

Also, make sure you check the cyphers as indicated in the install documentation. 

3. Check the prerequisites that apply for the type of patch you are applying. For example, if you are doing a rolling patch you will need to adjust the disk_repair_time parameter for ASM. If you are patching all the nodes, you will be shutting down CRS and all associated services at this time.

4. While the patching program (patchmgr) will check for sufficient disk space, I like to manually check the available disk space on each cell server just to be sure.

Of course, this is just a few of the steps that you need to execute before you apply the patch. As I said before, the best thing in my mind to do is create a checklist that clearly calls out all the steps you intend to execute. While the documentation is good, it does contain a mix of instructions and I feel like it's a much easier and clearer process to follow a checklist first, referencing the documentation when required.

Applying the Patch
Once you have completed the prerequisites then it's time to patch the cell. To perform the upgrade you use the patchmgr utility. Usually you will do this from a compute node, while logged in as root. Oracle specifically says not to use the ILOM for applying patches, which makes sense if you think about it. :)

Once you are ready to patch the cells you generally use the patchmgr utility. The patchmgr utility provides an automated way of applying the patches to the storage cells.

When using the patchmgr utility you will first clean up any old patching related stuff, then you will run the prerequisite checks and then you will start your patching. Patchmgr will be in root of the directory where you unzipped the patch files too. The basic commands you would run are:

./patchmgr -cells cell_group -cleanup
This will clean up any previoius patches on all nodes in the cell_group group.
./patchmgr -cells cell_group -patch_check_prereq [-rolling]
This executes the prerequisite patch check. This will check the node(s) for space availability, current patch  compatibility and so on. Note that this will apply to all nodes on the cell_group list. If any errors appear when running the prerequisites, these should be corrected before proceeding and then you should run patchmgr with the cleanup and prereq check process, again.

./patchmgr -cells cell_group -patch [-rolling]
This executes the actual patch. Using the -rolling parameter will start a rolling patch upgrade. This will apply to all nodes on the cell_group list.

There have been times that I've had to fully path the cell_group location.

The biggest stress I find during the whole process is the rebooting of the cell servers. You really kind of find yourself praying that they will come back up. It can take time, so be patient.

Bottom LineThe bottom line is that the install of the patches is generally pretty straight forward process, but you should carefully follow the instructions provided by Oracle.

Next post, I'll talk about upgrading the Compute Node OS.

1 comment:

Bigdata Guy said...

Nice writeup. Thanks for sharing.

Subscribe in a reader