There are some companies out there that seem to think that just because they have implemented a HA solution everything is OK, if they have a failure they can simply switch to the remote system and they can continue as if nothing happened! If they are unfortunate enough to suffer a system loss during processing they could be in for a big shock!
Questions you need to ask yourself!
1. When was the last time I reviewed the configuration to make sure it was replicating everything I need?
2. When was the last time I did a planned switch over?
3. Can I recover the data to a known starting position from any process failure? If so how long will that take me?
4. If the system is running batch jobs can they be restarted if the system fails mid-way through?
5. Is the process keeping up with BOTH object and data replication consistently, if not how important is data to object synchronization?
6. If I get behind after my batch run how long before it catches up?
All of these questions are important and need to be understand before you start your recovery for the following reasons.
The systems we all run are changing on a constant basis, a new program added by a developer which is missed in the replication process could be fatal, you may not have the source available on the target to create the new object? Sod’s Law says its going to be the one you missed that stops it all from working! A planned switch will not prove you have a recoverable system in an unplanned event, its only going to confirm that you have the right objects and data being replicated to the target system. Should an unplanned event occur missing objects, down level objects and incomplete data are all possible.
If you have a batch process which produces lots of transactions for the database and it fails mid-way how do you know exactly what data needs to be removed, even if you know the data that needs to be removed the HA products provide no method of removing it! Object and data transports provided by most of the HA solutions do not work in unison, the data (journalled object data) is replicated automatically by the Remote Journal function as changes occur, objects tend to be replicated using a save and restore technology or command submission based around the audit journal identifying changes. This means the process of getting the objects to the remote system is always going to be delayed in comparison to journal entry based changes. The object save process and transmission to the remote system can have many bottlenecks even with the best solutions. I don’t think the object replication is as important as some would have you believe, but when implemented it does appear to be the major problem cause for replication backlogs and failures.
The level of journal and object replication backlogs will fluctuate through the day, at times it will be right up to date and other times be hours behind. Your challenge is knowing how to recover regardless of the backlog, remember data and objects follow different paths so they may have no consistency at all. One customer we worked with had a backlog every night of about 4 hours, the process normally caught up before the next batch ran but after an object failure got into a loop where it would never catch up! They had to do a full restore to the target just to get back to the normal backlog position.
HA is an ever moving target, what appears to function today may soon become obsolete and not provide the cover you need. The challenge is keeping one step ahead! Unless you have the skills to manage the configuration and regularly test its suitability by doing planned switches you may soon have a solution which is no longer serving the purpose for its existence. HA is not a finite art, all of us are still trying to deliver the perfect solution for the user and every time we get a little closer we find something else we need to consider.
I still hold onto my belief that while HA can be used as a DR solution, the fact that the environment and configuration works in a planned (HA) switch will not guarantee its success in a DR situation. You have to fully understanding the fragility of the data to application match up that the HA solutions provide to truely know your recoverability position. Having your data in a perfect synchronized position on both systems does not guarantee the application will restart should the source (production system) fail! The biggest problem most will have to work through is how do you know the application data is in such a state that no mismatch can occur? Remember Objects and journal data travels through different paths, so you will always have the opportunity for the data and objects to be out of sync! Add to this the fact that the majority of batch jobs which run out there have no recoverability built in and should a batch job be mid way when the failure occurs there could be significant work needed to recover. First have to know the batch was running then determine how to remove the data before restarting the batch job.
It may be easy to simply explain to the CEO or CFO that you have HA and you have recoverability, the only time you may be caught out is if you do have to recover your system during an unplanned event. Luckily for most the possibility of that happening is fairly remote. Or is it? We are not saying HA is not worthwhile, but we are saying look at what you have before you bury your head in the sand! For some HA is definitely overkill and simple data protection would be more than sufficient, others need something in between and therefore can justify HA as the only option available. Not setting expectations at the right level is going to catch you out in the long run!
Chris…