Given the financial and reputational threats of unplanned data center outages, it’s time to start thinking about IT Resilience, an approach that John Morency, Research Vice President at Gartner, has described as “the new disaster recovery.” With IT Resilience, CIOs and IT staff evolve from Disaster Recovery, Business Continuity and Continuity of Operations models which are inherently reactive to one that is preventative and proactive, and designed and executed for organizations to be able to ride through an adverse event.
Make no mistake. The IT decision-maker that interprets the difference between IT Resilience and Disaster Recovery as merely a question of semantics does so at great peril to his or her organization.
A 2016 Ponemon Institute report estimated that the mean cost of an unplanned data center outage is $740,000 with a maximum of approximately $2.4 million. Crunching numbers further, a study by the Aberdeen Group found that the cost of downtime per hour for medium-sized companies was $216,000 and for large enterprises $686,000.
If these statistics fail to grab your attention, you are not alone. Despite the economic fallout caused by unplanned outages due to natural or man-made catastrophic events, the Disaster Recovery Preparedness Council found that nearly three-quarters of organizations worldwide aren’t adequately protecting their data and systems.
But you might also want to consider this: According to the Federal Emergency Management Agency (FEMA), more than 40 percent of businesses never reopen after a disaster, and of those that do, only 29 percent were still operating after two years. As for those that lose their information technology for nine days or more after an adverse event? These companies enter bankruptcy within a year.
Business disruption, lost productivity and lost revenue, to say nothing of the injury to brand reputation compounded by 24-hour news cycles and social media, are all too frequent outcomes even when a company does manage to stay viable following an unplanned outage. When an 11-hour IT system outage occurred last December at Atlanta’s Hartsfield-Jackson International Airport, the mishap not only cost the affected airline $50 million, but forced it to cancel some 1,400 flights and lose an incalculable number of brand champions whose loyalty will likely never be recovered.
But whether caused by equipment failure, cyberattack, extreme weather events, human error, or power outage, many companies have poor insight into whether they can fully recover from extended outage, because their Disaster Recovery plan is rarely tested or scores low marks in preparedness.
According to a Disaster Recovery Preparedness Council Survey:
- 73% of survey participants worldwide scored ratings of either D or F in disaster readiness
- 60% surveyed do not have a fully-documented Disaster Recovery plan
- 40% admitted that the Disaster Recovery plan they currently have did not prove useful
And even if your firm does have a well-planned and oft-rehearsed Disaster Recovery plan, Recovery Point Objectives (RPOs) and Recovery Time Objectives (RTOs) can be damagingly expensive if not targeted within the optimal parameters.
Given the above-mentioned stakes, liabilities and pressures, it’s high time that CIOs prepare for unplanned outages and adverse events from a position that is both proactive and preventative. It’s time to implement a strategy of IT Resilience.
Let’s Level Set
Broadly defined, IT Resilience is an organization’s ability to maintain acceptable levels of service regardless of what challenges may occur. But before we look at IT Resilience in greater depth, let’s establish some agreed-upon definitions concerning Disaster Recovery, Business Continuity and Continuity of Operations to better understand where IT Resilience fits.
- Disaster Recovery (DR) is a set of policies, procedures and physical assets deployed by an organization that enable the recovery or continuation of vital technology infrastructure and systems following a disaster or negative event.
- Business Continuity (BC) as defined by the Business Continuity Institute is a plan to deal with difficult situations so an organization can continue to function with as little disruption as possible. A key component is the Business Continuity Plan, or BCP, which sets forth the policies, processes, procedures and instructions that enable a business to respond to a disaster.
- Continuity of Operations (COOP), which was developed in accordance with presidential and U.S. Department of Homeland Security directives, is an effort within individual executive departments and federal agencies to ensure that Primary Mission Essential Functions (PMEFs) continue to be performed during a wide range of adverse events, including localized acts of nature, accidents and technological or attack-related emergencies. The concept of COOP has since evolved beyond federal agencies and adopted by non-governmental organizations and institutions, such as hospitals and colleges and universities.
In this framework, both Disaster Recovery and Continuity of Operations are considered to be subsets of Business Continuity.
IT Resilience, a Paradigm Shift
Be assured, unplanned data center and system outages and other adverse events are not a question of if, but when. That said, by implementing a carefully and intelligently designed IT Resilience strategy that integrates distributed data centers, the physical security of the properties on which they sit, low latency network connectivity, and data replication capabilities, these events needn’t become debilitating to an organization’s ability to continue doing business.
But IT Resilience isn’t just about implementing geographically diverse colocation sites for failover and deploying automated network configuration backups. It’s also about integrating data centers and advanced technology solutions with policies, processes and people working in concert at any point in the crisis cycle.
A useful analogy to better understand the differences between traditional Disaster Recovery and IT Resilience is to consider the difference between brittle materials and ductile materials. When a low load is applied to a brittle material, such as glass, the material will come back to its original shape after the load is lifted, but at moderately high loads, the fracture is permanent – glass breaks. In contrast, ductile materials, such as aluminum or steel, will go through various stages before fracture and can be bent or stretched into wire while maintaining elasticity.
The goal of IT Resilience is to factor in data from Business Impact Analysis (BIA) and determine the risks and relative costs associated with anticipated threats, and then design organizations and systems that can withstand an adverse event and thus avoid recovery. While no system has unlimited elastic resilience, the new paradigm is to enable people, processes and technology to undergo stress, stretch and “bend to the event,” but not break and cause permanent damage to the organization’s ability to continue doing business.
Another way to look at the difference between IT Resilience and Disaster Recovery is to consider these concepts in relation to the building engineering sector. With resiliency, engineers employ earthquake-resistant construction methods and materials to withstand seismic events, while recovery involves earthquake response teams scouting and assessing structural damage. IT Resilience, therefore, is prevention and the ability to ride through unexpected events when they occur. So, let’s take a look at some of the cornerstones of a resilient organization.
IT Resilience Closes the Gap Between Business Continuity and Disaster Recovery
An effective IT Resilience strategy guides CIOs and their IT teams to close gaps in existing BC and DR plans across various components of IT, from networks to data centers to applications, in a more deliberate, methodical and constructive approach. Defending systems entails more than merely securing them, it also means taking measures that can reduce the probability of system failure. These steps could include load balancing servers to prevent an overload, or providing redundant systems that can prevent single points of failure.
Other proactive and preventative measures that improve resiliency include real-time traffic analysis that allows IT to spin up workloads and draw down capacity on demand, container movement to protected service regions, and deploying VMware VMotion, which enables the live migration of running virtual machines from one physical server to another with zero downtime, continuous service availability and no disruption to end users.
Early Detection Is Critical to Data Center Resilience
Many organizations have no effective tools or processes in place to alert IT staff of service disruptions in the data center. This is a major but remediable flaw, because the faster that IT members are alerted that a system has gone down, the faster it can remediate the problem. Beyond reporting a system outage, implementing a monitoring solution that gauges the performance of physical servers and their specific applications and services can assist IT staff to understand and address problems before they can cause a full disruption.
Moving Beyond Recovery to Ride Through
A detailed plan for addressing the effects of a disruption provides the foundation of IT Resilience. Historically, the focus of BC has been on IT Disaster Recovery — how to restore in the event of failure. But the ultimate goal of resilience is to ensure that when systems fail, IT can still provide essential services. Today, technology has evolved so that the focus is no longer on how to restore but how to ride through events. As some industry experts have recognized, advanced solutions such as data replication, continuous data protection and snapshotting can assist organizations to enhance resiliency and proactively avoid recovery situations.
Additionally, organizations can achieve continuous availability even in the face of an unplanned outage by running “active-active” data centers, whereby two data centers can service business-critical applications, and databases, storage and security policies are synced in both facilities. If a server at a colocation site in the Atlanta data center goes down, failover to a backup server at a data center facility in Chattanooga can take over almost instantaneously.
In an upcoming article, we’ll explore the essential elements of IT Resilience in both broader and more granular detail. We’ll also examine how DC BLOX leverages its colocation facilities, intelligent high-speed network, Cloud Storage and Cloud Ramp solutions to fortify IT Resilience, enabling our customers to effectively ride through an adverse event and carry on with their businesses.
DC BLOX builds Tier 3 data centers in growing underserved markets in the Southeastern United States, connects them with a high-speed optical network, and hosts cloud services to enable area businesses to effectively serve their local customers, efficiently scale their infrastructure, and ensure business continuity. Current DC BLOX data center locations are in Atlanta and Chattanooga, Tenn. while its newest data center is under construction in Huntsville, Ala. The company plans to build three additional data centers through 2019 in potential markets including Birmingham, Ala., Greensboro/Winston-Salem, N.C., and Greenville, S.C.