How to set up Disaster Recovery

In August, we held a webinar explaining the first step to implementing a successful Disaster Recovery solution.

There are three broad stages: Setup, Business as Usual and Invocation. This blog summarises the webinar’s key points on setting up Disaster Recovery.

In each area there are operational aspects and the technical aspects. The focus will be on the technical aspects of the setup.

1.) OPERATIONAL SET UP
• The Ideal Set Up
• Aims and Objectives
• Discovery Workshop

2.) TECHNICAL SET UP
• Software Installation
• Bandwidth
• Performance Assurance
• User Acceptance Testing
• Making Changes During the Set Up
• Business as Usual

Operational Set up

The Ideal Set up

There is an “ideal world” approach to setting up DR. The “Ideal Set up” looks like this:

1.) Identify Business Continuity Statement and scope of urgent business functions
2.) Select teams and determine responsibility
3a.) Determine impact on the business
3b.) Risk/threat identification
4.) Identify urgent functions – IT & other services
5a.) Implement mitigation strategies
5b.) Agree activation plans
6a.) Exercise & Test
6b.) Ongoing changes and maintenance

But there’s also the real-world experience of doing it.

For smaller business in particular, it doesn’t tend to follow this neat order. Usually, it all happens at once. Plus, you often have to make compromises. For example, what do you do if your BC planning isn’t finished before you start setting up?

A common problem we’ve seen is businesses starting with setting up Disaster Recovery. They then use their DR plan to feed into the Business Continuity plan. We would argue this is the wrong order – there is an easier way to do this.

Often businesses think “setting up disaster recovery” is having the Business Continuity Plan, without doing any of the planning or assessment. A small business will tell an IT manager to “set up DR”. But, the management won’t have done the preparatory work, such as defining recovery objectives, carrying out a BIA etc.

Aims & Objectives

There are common aims when IT teams set up disaster recovery or change their disaster recovery capability:

· Reduce downtime
· Close down a redundant data centre
· Reduce costs
· Improve DR capability by outsourcing

Discovery Workshop

Getting DR set up and working takes time.

When working with a service provider, such as Databarracks, you start each project with a discovery workshop, reviewing everything that’s already been agreed. Look into any assumptions and review all the systems to be replicated. Understand the dependencies between systems and impacts that has on recovery priorities.

Issues tend to be uncovered at this point.

More systems need adding

You might be looking to protect your most critical systems. If you have 100 servers in total, you’re only protecting a subset of the 50 most critical services.

At this stage, you may find there are other systems that need to be added in. Usually that’s fine. One concern can be around compatibility – if you’d not considered this early on when selecting technology, this may be an issue.

This does happen often – sometimes we’ll introduce another technology which can solve the problem, but you need to completely redesign the solution.

Technical Set up

Software Installation

The webinar didn’t cover any one particular technology – so we’ll speak in general terms about the steps to go through here.

You may need agents on individual servers and then reboots after installations.

You will need a local “media agent” or management server.

If you’re using DRaaS, go through the installation on the source side.

If you’re doing it yourself, you also need to install the software at the target side, whether that’s in your own vCloud or Hyper-V environment, or in Azure or AWS. Again, it depends on the specific tech you’re using and the target.

You’ll need a “cloud appliance” (aka a virtual machine) for the management server, or servers on the DR side.

Please note, even using the same technology, say Zerto or ASR – have different installation methods, depending on both the source and the target side.

For more details on software installation and other considerations, check out the webinar from minute 15:00.

Bandwidth

The first data transfer used to be physical; copying data locally at the source side to disk, and then transporting it to the DR site. Connectivity now tends to be strong enough that this is a rarity.

As a rule of thumb, we recommend 2Mb per VM you’re replicating. The rate of data change is the real factor – so in some cases you may need more – but it’s a good place to start.

We have a basic bandwidth calculator on our website which gives you times for data transfer, based on different bandwidths.

If you do need to upgrade your connectivity, it’s best to get that process in place early. It often takes 3-6 months to make upgrades.

Performance Assurance

Now we get to the really important part – making sure it all works.

This stage is often neglected because a.) it takes time and b.) when you test, you find things that don’t work and have to fix them, which is time consuming.

If you don’t test the recovery, you have no way of knowing if you can recover.

Here’s a recent example. We set up a customer with a particular technology. The DR target was Azure. According to the documentation, this technology should support the source environment and target environment.

The replication process worked as it should. But when we tested, we found a specific issue with scratch disks which means those servers can’t be recovered. If you don’t test, you don’t find that out until it’s too late.

Another common issue is finding out IPs haven’t been brought across. This is often an issue if your source is Hyper-V and target is VMware, or even Azure. For VMware destinations – you can install VMware tools on your Hyper-V environment which do nothing while it’s there but then come alive when you fail over.

So the first test is to make sure the recovery works.

* There is a distinction between “Testing” and “Exercising”. A test is something you can fail whereas an exercise is a check you go through. We do full recovery exercises and specific recovery tests.

Once we’ve tested that the workloads can be recovered, the next stage is to test the performance.

Even when you match the resources exactly in both environments, the differences between the hardware has an impact. So you need to do some fine-tuning. Some software will let you do this programmatically. You can create auto-tuning scripts that increase resources on the DR target side on recovery. This ensures the DR environment runs as it should.

That functionality may not be available in the technology you’re using. In this case, you can build it in as a manual process in your DR runbooks: “for this particular workload, after the recovery, make X changes”.

You might not want to exactly replicate the performance you have on your production side. You can choose to only operate your most critical and urgent services in the disaster and/or to work with a small skeleton team. In that case, work out the minimum requirements and work to that to keep your DR costs down.

User Acceptance Testing

Once the servers recover and you know performance is good, the final stage is user acceptance testing (UAT).

There’s overlap here with the performance testing stage. They may be able to feedback more performance issues – but the most important point here is the human factor.

· Do the users know how to access them? How did you tell them? How would they know in a real disaster?

How you communicate is key. If you’ve been through a disaster, there are huge demands on the team or person co-ordinating the recovery. The last thing you want is 50 messages coming from the team asking questions about how to login to the DR site.

· Once they can login, how does it all look? Are you using the same remote access method as they use in the office? If not, do they know how it all works?

The aim of UAT is to get your users to sign-off on the DR site as the final stage of testing. The most valuable things you learn are what you need to communicate to your users.

They need information ahead of any disaster: what the DR plan is – that they go to another site or work from home, what they can and can’t do etc.
What information do you need to supply at the time of the disaster to let them know they need to do it.

*N.B. decide how you want the DR environment to interact with any other systems you connect with. If you’ve failed over, you may be testing and making updates in your database. You don’t want those changes affecting transactions in any partner systems, for instance. You may want to disable the usual firewall rules, lock it down and only allow the bare minimum for testing, like the licensing and access.

Business as Usual

There are three important things to do when you shift from active project to BAU:

Keep updating your DR as you do your production infrastructure

It can be easy to miss smaller changes. You need to build DR into your standard change control procedures when changes are made, then retested.

Alerting and reporting

Once you’ve done the setup, you shouldn’t need to do a lot of active work to maintain DR, major recoveries aside.

But, you do need to make sure it works. Use alerts to tell you when replication is failing, and/or other issues e.g. bandwidth making your SLA drop below your specified levels. Again – a service provider will do the monitoring and error resolution day to day. If you’re doing it yourself, make sure you know if there are any issues so you can address them.

Documentation and writing the runbooks

During the testing and set up phase, you’ll have your experts working on it. You’re not rushing it, so it will be the same people who know the infrastructure very well working on it.

They need to document the recovery and write the runbook so anyone can execute that recovery. Anyone in the team should be able to pick it up and follow. For businesses with small IT teams, that can mean writing the runbook for senior management to enact.

Hopefully, in these cases you will be using a DRaaS service. You can get the service provider to take on their part doing the recovery. Senior management can handle the communication and movement of staff etc.