Tuesday, July 05, 2011

DBA 3.0 (The Holistic DBA) – Part Three

DBA 3.0 (The Holistic DBA) – Part Three

It’s been a little while since I wrote part two of my DBA 3.0 series. I find it’s 3am in the morning, I can’t sleep, I don’t feel well and I’m cold.

A perfect time to write.

First, I’ve decided that my concept of DBA 3.0, though well intended, is not very descriptive. There is one comment on my previous post where someone mentioned that they used the moniker DBA 3.0 in a work to describe something. Now, DBA 3.0 is generic enough that I don’t feel all that bad about adopting the name, but in my mind that is not descriptive of the concept. So, I’m rechristening the idea as the Holistic DBA (and maybe to make it extra cool we could add something like – The Next Generation to the title…. naaahhhhhhh).

In exploring the idea of the Holistic DBA in my previous posts, I lamented that many are entrenched in their cubicles, that many do not speak the language of business, that many are too busy writing scripts or digging into the next new technology. All of these things, in their proper proportion are good things, no doubt. It’s when the priorities get out of wack, then what we do (or as often what management tells us to do) can get us into trouble. In my world view, the holistic DBA is not a specialized DBA. They don’t spout out 10046 translations without using TKPROF, they don’t dump the contents of datafile headers and use bbed to magically open the database (and if they did I’d question if they were really an expert at all).

Let me say, first and foremost, that not every DBA needs to be a holistic DBA. Also, not every organization needs to have every DBA be a holistic DBA. There is a need in the technical world for specialization if your organization can afford it and if the need can be justified based on cost, meeting customer expectations and principally (in my mind), meeting established Service Level Agreements (SLA), Recovery Time Objectives (RTO) and Restore Point Objectives (RPO). I will also counter that every specialist DBA needs to have the skill sets and countenance to be a holistic DBA, especially that of communication.

There will always be a need for Lewis, Kyte, Millsap and that breed of expert (forgive me if I left your name out, you are all much smarter than me and I freely admit it).

And now, I’ll let the shock wear off your face with the notion of SLA’s, RTO’s and RPO’s. You might be asking, “What, do organizations really have those?”. Indeed, the best run, most successful data organizations do have those, somewhere.... often in a form that is pretty much useless, and not updated since the last great depression.

The organizations that have repeatable success have those. Organizations that fly by the “seat of their pants” typically do not, and that is one reason why those types of data organizations fail, eventually.

Eventually is a key word here too. Just because your ham-strung, bailing wire and spit, technical marvel has not yet failed, be sure that it will someday fail. When that day comes the song will be “Who’s got the job today? Oh Yeah! Oh Yeah!” as opposed to, “Information, I need the number for the unemployment office”.

If you think that it’s just the little boys who fail, and that the big boys and your big boy organization is immune from failure because of its vast size, number of DBA’s and it’s wonderful glowing talk about procedure, process and all the like, you are waiting to stand in the unemployment line too (or you should be already there). No my friends, the evil angel of death is an equal opportunity player when it comes to data center death and subsequent total mis-management of the world. In fact, I would suggest that the bigger you are, the harder you are more likely to fall and fall hard. Just because it hasn’t happened yet does not mean that the grim reaper does not have you on his visiting list… he is, after all, not omni-present. The stories I could tell of the foolish big boys and the key assumptions they made that caused massive failures.

So… What do we do to fix these problems? First of all, we have to realize that we have some principle goals as DBA’s. I’d like you to consider the following items as part of a list of these principle goals:

  1. Be able to reliably, consistently and efficiently backup and recover the databases and all database in accordance with SLA, RPO and RTO’s.
  2. Ensure database uptime with respect to all SLA, RPO and RTO’s.
  3. Ensure all databases are secure.
  4. Monitor all databases in a consistent and reliable manner.
  5. Communicate in an effective manner.
  6. Help users to help themselves.
  7. Work to understand and correct bad behaviors with respect to the organizations data policies.
  8. Work to understand and correct bad behaviors with respect to the organizations data designs.
  9. Work to understand and correct bad behaviors with respect to the organizations application designs.
  10. As befits the organization, your abilities, and time keep up with current data related technologies.

Do you notice anything about being the hero of the day on this list? Good, because you should never need to be the hero of the day. Yeah, it feels good when it works, but the unemployment line feels even worse when it does not.

Do you notice irritating details with respect to policies and procedures and doing backup and recovery testing? You didn’t? Really? Look closer at the list my friend. It’s in there.

Now…. I ordered this list in a very specific order for a specific reason and I wonder if you can tell me why? What is it about this list that, when done in the order listed, makes for a holistic DBA?

First, I think they are arguably in some order of importance. You might shift a couple of them around, say 2 and 3 or 8 and 9 (or you might consider them one and the same), but generally they are in order of priority (in my eyes) to any data organization. What else though, is magical about this list?

Automation, replication, consistent execution. Look carefully at 1 – 4. Holy cow, they are all something that can be documented, easily. They can be implemented rather easily one time, fire and kinda-forget within the constraints of an established policy (for example: Occasional maintenance, testing and the like thrown in to the schedule) and automation.

Something else I’ve noticed about the whole automation thing, we really like re-inventing the wheel don’t we? Why do we feel like we have to spend hours of time re-inventing the wheel by writing korn scripts, throwing them into Cron and having them run all over creation. Oh, here’s is a smart guy, he put them all on a shared NFS drive. UGH. Do I really need to mention the problem here? Do I really need to say this is the wrong way to be doing things?

Case in point – again the characters, places and all identifying information have been properly scrubbed. Maxwell Smart was a DBA in a rather large DBA organization. Now Maxwell was tasked with the assignment of getting RMAN “up and running” and to replace the stock of old, spaghetti code of backup scripts in the process. Maxwell is a pretty smart guy, but he really needs guidance and he really didn’t get much on this assignment. He felt the weight of this assignment on his shoulders quite broad and heavy. As a result, Maxwell replaced the old spaghetti code backup scripts and replaced them with a new set of equally spectacular spaghetti code backup RMAN backup scripts. Essentially we changed nothing in the process, except the tool that did the backups. The overall architecture was not considered, the total non-supportability of the scripts in the future was not considered, and the result was a product that was no better than what it replaced.

The point is that the holistic DBA, as we continue to define what and who they are (in my own limited understanding) is that we hope that Maxwell Smart will have moved past steps 1-4 long ago. That Maxwell would have moved onto the steps that offer the greater good, #5 in particular, and then steps 6 through 10. If Maxwell were a holistic DBA he might have come to understand some principles of architecture and design, that he didn’t possess when he re-created hell and he might well have created a backup haven.

Perhaps your retort to using technologies is something like “Well Grid Control didn’t work with backups n number of versions ago and so I gave up on it.”. Really. How very odd. The one product that Oracle gives you to make your life easier and at the first hint of failure, you give up on it. Instead you resort to korn, perl, or heaven knows what scripting language is your favorite and you spit out this cool, really dense code. WOW…. Meet Dr. Frankenstein. You have your creation and its’ a monster. Do I really need to tell you why it’s a monster? The bottom line is the holistic DBA put’s his ego and desire to raise the dead on hold and does it the right way. He/She does it in a way that is automated, centrally manageable and easy to use.

So, your retort, “what if it is broken”? Then my friend, it’s time to remember your priorities above, call Oracle support and get the damn thing fixed.

Then you retort, “They never work on my SR, they never call me back”…..my goodness someone call the wambulance and rush ahead to priority number 5 and get it down fast. The bottom line is that the holistic DBA can not be a passive “Yes man”. You don’t have to be a jerk, you don’t have to be evil incarnate to get what you want but you do have to persevere. You have to pursue excellence, even if those around you seem not to be following the same path. Heavens, this is leading me down the path to calling this the Black Belt DBA but I’ll avoid the temptation.

If you are not getting the support you desire from Oracle (or any other vendor) then who’s responsibility is it to get the level of support you need? YOU! It is your job to make these things work, and to bust dam’s (and perhaps even utter one or two aloud) in your pursuit of a solution. Oracle support can be very responsive if you know how to use it, and if you are persistent. If you don’t know how to use it, beyond opening an SR on Metalink, there are some great training tools available to teach you how to use the support system. Working with any support organization is a bit like working with an old car with an engine that tends to flood way to often. You have to figure out how it works, what knobs to pull and just how to pull them. Once you figure that out, the car will start up pretty well just about every time. Again though, I put the onus back on you to make support work for you. If you are going to sit and wait for support to come looking for you then the fault, in my opinion, lies within yourself.

So, we’ve figured out that 1-4 are pretty easy to take care of (though the implementation may take some time and major initial effort which can frustrate those who love to do things by the seat of their pants). Do you notice something else about #1-4? First, they don’t require you to know how to do 10046 tracing. They don’t require that you know how to do explain plans. They don’t require that you have to be able to do data file header dumps or a triple summersault with a half-gainer pike on the database (is that possible?). You might protest, but I need to know how to do these things? I need to be able to look like I know what I’m doing. I agree, you need to look like you know what you are doing. Remember that this list is a list of priorities. Few people will remember you for your marvelous once every six months 10046 trace that fixed one year old SQL code that was not designed to be scalable. They just won’t. They will remember that your database was restored flawlessly and on time and that the business never even noticed things were down. Better yet, they won’t even remember because they didn’t notice and there was no need to. If you have a good boss, he will remember. If you get off the wambulance and show some persistence in your reviews, he will certainly remember.

Here is another point about the over-arching nature of numbers 1 through 4. They can be costly if not done right. If you find yourself iterating only between these various areas of DBA Ville, you are costing your employer vast sums of money that really does not need to be spent.

Another point about 1 through 4 is about automation and simplification. You need to realize that these are just lines in the page and that there are other needs in this section that you need to “read between the lines”. Automate, automate, simplify, simplify. For example, if you find that you are creating lots of databases or lots of schemas, and that takes a lot of your time, find a way to automate that workflow. Make it user serviceable. The tools are out there to allow you to do basic, simple, automated provisioning. Rather than spending 6 hours to 5 days to provision a database, let’s spend 2 seconds as you click the approval email that is a part of the overall workflow. These are the kinds of things that you need to be doing, now.

Plenty has been said about re-inventing the wheel, about the costs of re-inventing the wheel and the dangers of re-inventing the wheel. Use the Damn wheel why don’t you and quite finding little reasons not to use the wheel and build your own wheel. “But there are bugs in the wheel”, you say. All wheels have bugs. The problem is, the only person who can solve your wheels bug is you, and if you leave, then who is it going to be. Sure, the Oracle wheel might have a bug or “undocumented” feature in there, but can you really cost-justify the time and money spent re-inventing the wheel rather than taking the support organization to task and making them fix the problem? Also, keep in mind that the wheel they created, has been tested many more times and in various different ways than the wheel you are creating.

The real point, with numbers 1 through 4 is that they really can take up an inordinate amount of time if not done correctly. They can also cause an inordinate amount of damage to the organization if not done properly. We need to have these settled, on automatic pilot and flying without aide of an instructor reliably. Then we have freed up time for tasks that are more strategic, and therefore quite important.

That the leads to the that magical mystical number 5? Let’s talk about that and the rest of them in my next post.


oscrub said...

I have a feeling a lot of people will have something to say about your views regarding who is responsible for the response times of MOS :) but I do see your point. Great article, will be passing this on to the DBAs (and non-DBAs) in our office.

oscrub said...

I have a feeling a lot of people will have something to say about your views regarding who is responsible for the response times of MOS :) but I do see your point. Great article, will be passing this on to the DBAs (and non-DBAs) in our office.

Noons said...

Hear hear, Robert! Couldn't agree more.
KISS (Keep I Simple, st....) has always been part of my weapon arsenal.

I have tremendous fights with external "consultant" dbas who come in wanting to install RAC and Exadata in our site "because such and such bank is using it". No other reason!

I call this kind of people the "hit-and-run" mob. They come in, wreak havoc and disappear before anyone can pin the medium and long term consequences of their actions on them.

Last thing I want is wasting my time on install/patch fests of overly complex solutions that address none of our requirements and leave us with complex systems that only one person can maintain!

Don't get me wrong: I love reading about all those fantastic exploits of simulating "multiple RAC nodes in a single virtualized laptop" and other such.

But anyone thinking that bears any relationship to a normal production DBA's work has gotta have rocks in their head!

I'd like to propose an alternative name to "holistic", though. Just plain and simple, good old:

"Common Sense DBA"

Because that is exactly what you are talking about: a common sense approach to the job. And that is badly needed, in this day and age.

Well done and thanks for having the courage to put the "dots on the iis".

Robert Freeman said...

I agree with the KISS principle of course Noons, but sometimes ya gotta do what you gotta do.

I also agree with the concept that the architecture and the solution needs to fit the mission requirements. Your right RAC and Exadata are not always a requirement. However, in this day and age of high up-time requirements, user expectations (documented or, more often, undocumented) additional complexity represented by bolt-on's like RAC, Exadata, Data Guard, Golden Gate, Replication, etc... are a fact of life.

You can't discount a technology just because it adds some additional layer of complexity. You weigh those complexities carefully, against the needs of the organization and decide what is required.

This is where those SLA's, etc... come in. If you need 0 downtime, if you need 5 9's uptime then your solution is going to be more complex than a 9-5 , 5 day a week and we don't care if it goes down once in a while, just get it back up please, database.

Give me an SLA, tell me your budget and I can tell you if that SLA is manageable. If not, then the solution is simple. Re-negotiate the SLA (and I think that the DBA's need to be part of that process) or increase the budget.

I think it's that simple. Once you have the SLA vs. budget question answered, then it's a matter of choosing the technology and moving on.

The truth is that lots of folks are asking for those 5 9's.... sometimes they back down when they realize the cost, but more often they have a real (or sometimes imagined) justification for the hardware and the complexity.

I'll also say for the record that I think RAC has come a long way since the days of OPS. It's become much easier to install, manage and use. Just my opinion.

I see Exadata as somewhat of a deployment of the KISS principle. Look at what it gives you:

1. Pre-configured architecture, tested and supported by one vendor. They leave it configured clustered for you, so you don't have to mess with that unless you just want too. They also leave you a database to start with.

2. No black box SAN here. You know where your disks are, what data is sitting where (given stripes and all that).

3. I dare say that it's a much more efficient solution than many "hobbled together" data center solutions from many points-of-view. It's also more scalable.

4. It's not for everyone.

That being said, having worked on Exadata, there is no reason that only one person can manage or access this box. It really is no different than a data center except the storage, the network, the cabling, the servers are sitting there in one cabinet.

I think that it's like anything big and imposing, it feels big and imposing until you look at it and realize that it's not really that big and imposing... rather it's just everything you pretty much already have, that is already big and imposing, all stuffed nicely in a box, right there, in your hot little hands ready to use.

I'm not really trying to sell Exadata here, that's not my goal. I think it fits many mission requirements, and of course, there are many it does not fit.

But that's where things come full circle and the tail and the head of the circle point back at us (or management if you like - whoever is making the final choice).

It's the salesman's job to sell us his wares. It's our job to learn about what is being sold us and determine it's value given our requirements and budget and the history, functionality and other attributes of the product.

Is there a character set with an undotted i (specifically in lower case) ;)

Noons said...

I hear you and mostly agree.

But let me just point out a couple of things, based on our concrete example as clients.

As you say:

"Once you have the SLA vs. budget question answered, then it's a matter of choosing the technology and moving on".

Yes, without a doubt! What I do object is an external consultant waltzing in and proceeding to tell me what I need to do is choose the technology and we'll sort out the SLA vs budget question later.

In a nutshell: reversing the order of the process you just described. All because bank XYZ has picked that technology and they were involved, they made a pile of $$$ on consultancy, and they want to "rinse and repeat" that process with us.

Hang on a tick: we are NOT even a bank, to start with!!

The second point: we have 12 Oracle instances in three virtual servers, most are development. We have 230 MSSQL databases in around 15 servers. Lost count of which are what.
ALL - MSSQL and Oracle - share the same SAN storage infrastructure.

Now: when was the last time that Exadata supported MSSQL?

Nuff said? ;-)

In simple terms: no matter how hard I might try to push for an Exadata solution here, there is simply no way in the world it will fit our needs, as those needs are at the moment.

Of course: one could argue we should not be using as much MSSQL and more Oracle.

Explain that to the people who make the apps that only run on MSSQL, and who sell them to our business.

See what I mean when I say: "Exadata simply will not fit our needs"?

Of course I'd love to get my mits on one of those things!
Try and convince my management?

Subscribe in a reader