It’s a sad story, one that’s been told many times:
Important data wasn’t backed up. Then a tragic event struck. It was a drive crash on a critical server, or a lost laptop, or a natural disaster that wiped out an entire office. All data was lost, which had dire consequences for the affected business.
Data loss is absolutely avoidable, yet an astonishing 66 percent of businesses don’t have a business continuity or disaster recovery strategy. (See The VAR’s Path to Managed Services for other interesting statistics about data loss and the cost of downtime.) But protection isn’t solely about running backups. A backup is really just a means to an end for ensuring the uptime of business-critical systems. It’s a mechanism that only partially answers the question, “Am I protected?” So, what else is needed?
A Matter of Tolerance
While a recovery point objective (RPO) articulates a desired frequency of restore intervals, it’s important to understand one’s tolerance for being outside the objective. For example, if a nightly backup of business records misses a day, is that acceptable? How about two days? Or a week?
The answer depends on business’ needs and the criticality of the data in question. Consciously defining the tolerance for a backup being out of bounds enables the establishment of thresholds at which corrective action must be taken.
Management by Exception
A good way to stay informed when backups go out of bounds is to set up an alerting structure using the most appropriate mechanisms for a particular IT operation. There are several very effective techniques:
- For managed services providers (MSPs), Remote Monitoring and Management (RMM) tools offer customizable dashboards and reports for overseeing the health of multiple clients’ networks. You can configure RMM tools to reflect when protection is in or out of bounds, thereby providing a convenient “green-yellow-red” indicator of status.
- IT teams can extend infrastructure monitoring tools to fire an alert if a backup exception occurs. Utilizing SNMP for this is the most common approach.
- Service ticket applications, such as those found in Professional Services Automation (PSA) tools, provide a means of initiating corrective actions. The idea is for the backup system to automatically open a service ticket when something requires attention.
- If you are not using any of the tools mentioned above, then you should at least take advantage of the native alerting and reporting tools within your chosen data protection system.
Check out Three’s Company: The Value of RMM and PSA Integrations for more discussion on managing backups by exception.
Trust, but Verify
Backups are running within spec, and monitoring is set up for management by exception. Good. But are you certain that critical data can be restored when needed?
It’s wise to periodically verify that backups are behaving as expected through restoration testing. Think of it as a disaster preparedness drill. Periodic verification uncovers issues caused by changes in the IT environment, such as movement of critical data to a new volume. At least spot checking for expected data within backups also reveals common problems, including improper configuration of network shares or file exclusion filters.
Additionally, testing server image failovers saves time when the call comes for a real failover. Keep in mind that simply booting to the login screen doesn’t guarantee a server image is healthy, since critical application data could be missing or damaged.
A Happy Story
This article opened with a sad story. Let’s close with a happy one:
Important data was backed up within tolerances and without longstanding neglected issues. Critical server images and data were routinely verified to be in working order. Then a tragic event struck. It was a drive crash on a critical server, or a lost laptop, or a natural disaster that wiped out an entire office. Data was quickly restored and the affected business was up and running in no time.
Which story do you prefer?
Todd Scallan is Axcient Vice President of Products