Recovery Time Objective (RTO) is a key metric for any company putting together a business recovery plan. It makes its way into business continuity plans, business impact assessments, and all discussions surrounding data protection. No wonder then that when evaluating vendors for data protection and recovery, a key element is understanding the RTO you can achieve with each vendor.
Surprisingly, most companies report not being able to achieve desired RTOs ensuring data protection and recovery a top IT priority among midsize companies for the past 5 years. If this is such a key criteria and IT professionals use it to vet the technologies they deploy what accounts for this apparent discrepancy?
There are three elements to consider in this RTO conundrum:
1. Single-instance recovery
2. Full site recovery
3. Recovery orchestration
Single-instance recovery means the recovery operation for a single instance of a physical or virtual machine that is being protected. Imagine a server that crashed and you have to spin up an image of that single server to continue working.
Full site recovery entails the recovery operation for your entire data center, including domain controllers, IP addresses, and all elements that interconnect your network and make sure your databases and applications remain accessible. Floods, power outage, virus and natural events could figure in this disaster-like scenario that forces you to start DR operations.
Recovery orchestration is the management of each type of recovery operation from the moment you identify the need until your failed systems are fully operational and business can resume as usual. This is the process of actually going through the motions of getting servers back online, network settings configured, and giving everyone the green light to get back ‘online’.
I am segregating these three components of recovery because is important for you to understand how to evaluate technology taking into consideration these three basic elements.
Smoke and Mirrors
Go to any data protection vendor’s website and you will see hefty claims. Instant recovery, 15 minutes RTO, seconds to spin up a VM of your system, and more. Are these claims true? Yes, but you have to understand the context in which they are made. For example, is the ‘seconds RTO’ the time it takes for a VM image to spin up after everything had been configured and you are just pressing the “spin up VM” button? Is the recovery time for any image snapshot or ones for certain operating systems only? Is it for local or cloud/off-site recovery?
What we have seen is that companies are falling short when it comes to fully understanding vendor claims and making sure they map those claims back to their own business needs and processes. So here’s how you can use the 3 elements I described above to get the true RTO from a vendor.
First, understand what the recovery operation is like, as well as all and any caveats that might exist including but not limited to operating system, data size, and whether is local or off-site recovery.
Second, ask the vendor to walk you through the process by using a simple scenario (e.g. MS Exchange Server crashed, let’s bring it up online again) and making sure to identify any particular characteristics that mimic your existing infrastructure (e.g. virtual or physical server, data size, and recovery from a different point in time).
Finally, map out how you would conduct the recovery operation for one and for multiple servers, from beginning to end. A vendor claiming RTO in seconds may be able to achieve that for one virtual image but when spinning up a few dozen snapshots the RTO may increase considerably. A “bootable VM” may require you to have a spare server (or space in an existing host) available so it can be started in just seconds after you have spent an hour or so transferring that image to a new hardware. Add to it the ‘orchestration’ variable, meaning the ease of use of the product and the steps you would take to setup and execute recovery operations and you can see that the initial claim may not hold true any longer. This becomes especially important when simulating an entire site recovery and is especially important if you have to travel to an off-site location instead of just using the cloud or if your cloud-based DR provider requires you to contact them to help out with a full site disaster.
The Magic Behind the Curtain
So here’s how you can get the answers that will determine what the true RTO really is. Start by asking the following questions:
– Can I failover my entire data center to your cloud or off-site facility without having to contact you first? (some vendors require you to fill out a ‘disaster declaration’ form and follow specific procedures)
– Can I handle the failover process myself or do I have to request professional services or support assistance? If assistance is required, what is the SLA?
– Is there a limit to how many servers I can spin up or failover simultaneously?
– Am I able to setup a ‘virtual network’ that allows me to setup DNS, IP addresses, port forwarding, and all networking elements? How long does this process take?
– Will the servers that I have failed over or spun up be able to ‘talk’ to each other or will they be secluded?
– Once I bring all my data center ‘online’ after the failover process how do I ensure data changes are being tracked for when I have to failback?
– What are the fees associated with ‘lighting up’ my entire data center in your cloud or offsite facility?
From this simple exercise you will be able to gain insight into the entire recovery process and will allow you to validate RTO claims as well as think about what your internal processes will require once you have the solution implemented. Another key aspect will be for you to consider how easy it will be to actually test recovery operations without impacting production systems which opens up another set of questions for BCDR vendors you might be evaluating.
When it comes to Recovery Time Objective and vendor claims, is important to take the full picture into consideration so you can accurately assess the real time to recover you data and applications.