The Guide to Practical and Pragmatic IT Architecture Design

Disaster Recovery And Operations Test

Disaster recovery testing is the phase to ensure that in case of any disaster, the system continues working. It also is called business continuity to ensure business can continue in case a disaster strucks. 

To plan a testing strategy for this phase, there are 2 parts to it: 1) scope and 2) required recovery time. 

For the scope, one needs to understand the different solution components that can cause an outage or failure. Best is took at the architecture blueprint and look at each component and ask what happens when that specific component fails. That could be a server not working, a network switch or any other component that is part of the solution. And one needs to think really out of the box as the case study below shows.

Case study:

A third party datacenter hosted critical core systems for a large investment bank. A new design was delivered to make it failure safe as all servers and network were made redundant in case of outage. It is also included a second battery capacity in case of power failure. And if the battery would lower to below 10% of its capacity, the servers would automatically shutdown.  However, failure struck and hit the datacenter. What happened? 

The datacenter used air conditioners to maintain the temperature and cool down the extensive heat. At the specific moment, one of the larger air conditioners  broke down, and there was no backup unit with same capacity to cool down the datacenter. Result was that the servers overheated and crashed with extensive impact on the bank operations. 


Granularity

Once all components have been identified, one needs to look in case of failure, what are the required recovery timeframes. Typically the business provides an overall recovery window requirement, but that means for an architect that he/she needs to analyze the impact on each of the impact on the underlying components. 

The primary metrics for disaster recovery is typically expressed in the mean time between failure (MTBF) and the mean time to recover (MTTR). MTBF is the average length of time between major outages, but more relevant for DR testing is the MTTR that is the average time in hours or minutes to restore an IT component that failed.

So, the required recovery time for a network switch could be more critical as a fundamental underlying component then a file server for instance. Once each of the MTTR are defined, these components need to be tested as a whole or separately dependent on the conditions and reach of DR testing.  

Component

Required recovery time / MTTR

Application Server (Disk, memory, IO)

8 hours during working hours

Database Server

 xxx

Web Server

 xxx 

 

 

 

 

Network Switch

10 minutes during working hours

WAN Connection

10 minutes during working hours


In most cases, companies only test DR with just a few overall scenarios and some direct component testing. 

Operations Testing

Operations testing validates that the tooling and processes to maintain operations of a software platform works as specified. It validates the following areas:
  • Configuration
  • Monitoring
  • Logging & Auditing
  • Backup, Restore & Archiving 
Technology Architecture Operations Testing

For each of these areas, a simulation of production environment is performed and validated through its results that the operational areas do work accordingly to the functional and technical specifications.   

No comments: