Disaster Recovery

Disaster Recovery
Disaster Recovery

 

 

1.0       Disaster recovery (DR) is an organization's ability to respond to and recover from an event that negatively affects business operations.

Many smaller businesses are heavily reliant on technology to run daily operations in the workplace. Many of these organisations either do not have the budget required to implement a full-blown disaster recovery plan on their own, or they are unaware of the importance of having a disaster recovery plan in place.

If your business has become heavily reliant on high availability technology, any type of disruption will lower the tolerance among staff and customers, in the event of a disaster that causes extensive downtime.

 

1.1       Disaster recovery (DR) which is an organization's ability to restore access and functionality to IT infrastructure after a disaster event, whether natural or caused by human action (or error).Disaster recovery as a process of maintaining or reestablishing vital infrastructure and  systems following a natural or human-induced disaster, such as a storm or battle.  It employs policies, tools, and procedures.

Disaster recovery is also a planning framework that helps to ensure that a business can withstand a disaster. One example of a disaster recovery strategy is data backup, which helps businesses, recover lost data after accidental deletion or a cyber attack.

These common elements allow you to prepare for and protect yourself from disaster. Disasters as recurring events with four phases: Mitigation, Preparedness, Response, and Recovery.

 

Every business disaster has one or more causes and effects. The causes can be natural

or human or mechanical in origin, ranging from events such as a tiny hardware or software component’s malfunctioning to universally recognized events such as earthquakes, fire, and flood. Effects of disasters range from small interruptions to total business shutdown for days or months, even fatal damage to the business.

When a disaster strikes, the normal operations of the enterprise are suspended and replaced with operations spelled out in the disaster recovery plan. Figure 1 depicts the cycle of stages that lead through a disaster back to a state of normalcy.

 

           Figure 1.    Enterprise Operations Cycle of Disaster Recovery

 

The disaster recovery plan should (1) identify and classify the threats/risks that may lead to disasters, (2) define the resources and processes that ensure business continuity during the disaster, and (3) define the reconstitution mechanism to get the business back to normal from the disaster recovery state, after the effects of the disaster are mitigated. An effective disaster recovery plan plays its role in all stages of the operations as depicted above, and it is continuously improved by disaster recovery mock drills and feedback capture processes.

 

1.2       The objective of a disaster recovery (DR) plan is to ensure that an organization can respond to a disaster or other emergency that affects information systems –and minimize the effect on business operations.

There are three major types of disaster recovery sites that can be used: cold, warm, and hot sites, to select the one that best suits company needs and mission-critical business operations. Prevention, mitigation, preparedness, response and recovery are the five steps of Emergency Management.

The goal of disaster recovery methods is to enable the organization to regain use of critical systems and IT infrastructure as soon as possible after a disaster occurs. To prepare for this, organizations often perform an in-depth analysis of their systems and create a formal document to follow in times of crisis. This document is known as a disaster recovery plan.

1.2.1    Disaster Disasters are inevitable but mostly unpredictable, and they vary in type and magnitude. The best strategy is to have some kind of disaster recovery plan in place, to return to normal after the disaster has struck. For an enterprise, a disaster means abrupt disruption of all or part of its business operations, which may directly result in revenue loss. To minimize disaster losses, it is very important to have a good disaster recovery plan for every business subsystem and operation within an enterprise.

Disasters are serious disruptions that are capable of crippling a business or its operations the top causes of business disruption in order are: power failure, IT hardware failure, network failure, winter storm, human error, flood, IT software failure, fire, hurricane, tornado, earthquake, and terrorism.

The practice of DR revolves around events that are serious in nature. These events are often thought of in terms of natural disasters, but they can also be caused by systems or technical failure or by humans carrying out an intentional attack. They are significant enough to disrupt or completely stop critical business operations for a period of time.

Types of disaster include

        ● Cyber attacks such as malware, DDoS and ransomware attacks

        ● Sabotage

        ●  Power outages

        ● Equipment failure

        ●  Epidemics or pandemics, such as COVID-19

        ● Terrorist attacks or threats

        ● Industrial accidents

        ● Hurricanes

        ● Tornadoes

        ● Earthquakes

        ● Floods

        ● Fires

1.3       Classes of Disaster

Although several nuanced classifications of disasters exist, disasters can be classified into two broad categories: natural disasters and man-made disasters .

Natural disasters: are very difficult to prevent and include earthquakes, smog, floods, tornadoes, and hurricanes. They can be very costly to businesses. However, risk management and precautionary measures like avoiding areas prone to such natural disaster and good planning can help as well as lead to the avoidance of significant losses.

• Man-made disasters: are the more often occurring form of disasters. They can be intentional (e.g. cyber attacks, bio-terrorism,) or unintentional (disastrous IT bugs, hazardous substance spills, industrial accidents).

Further, the term technological disaster has been used to define any disaster that can be in part or entirely attributed to human intent, error, negligence, or involving a failure of a man–made system.

From an IT perspective, disruptions or disasters fall into three major groups

• Malicious behavior- bomb threats/ blast, biological and chemical attacks, civil unrest, computer virus, hacking, sabotage, theft, workplace violence, espionage, logic bomb etc.

• Infrastructure related - burst pipes, blackouts, environmental hazards, power failures, epidemics, evacuations, etc.

2.0       Disaster Recovery Planning

This section explains the various procedures/methods involved in planning disaster recovery

2.1       Identification and Analysis of Disaster Risks/Threats

The first step in planning recovery from unexpected disasters is to identify the threats or risks that can bring about disasters by doing risk analysis covering threats to business continuity. Risk analysis (sometimes called business impact analysis) involves evaluating existing physical and environmental security and control systems, and assessing their adequacy with respect to the potential threats.

The risk analysis process begins with a list of the essential functions of the business. This list will set priorities for addressing the risks. Essential functions are those whose interruption would considerably disrupt the operations of the business and may result in financial loss.

These essential functions should be prioritized based on their relative importance to business operations.

While evaluating the risks, it is also useful to consider the attributes of a risk (Figure 2).

 

Figure 2. Risk Attributes

The scope of a risk is determined by the possible damage, in terms of downtime or cost of lost opportunities. In evaluating a risk, it is essential to keep in mind the options around that risk, such as time of the day or day of the week that can affect its scope.

The magnitude of a risk may be different considering the affected component, its location, and the time of occurrence. The effects of a disaster that strikes the entire enterprise are different from the effects of a disaster affecting a specific area, office, or utility within the company.

 

2.2       Classification of Risks Based on Relative Weights

When evaluating risks, it is recommended to categorize them into different classes to accurately prioritize them. In general, risks can be classified in the following five categories.

 

2.2.1    External Risks

External risks are those that cannot be associated with a failure within the enterprise. They are very significant in that they are not directly under the control of the organization that faces the damages. External risks can be split into four subcategories:

Natural: These disasters are on top of the list in every disaster recovery plan. Typically they damage a large geographical area. To mitigate the risk of disruption of business operations, a recovery solution should involve disaster recovery facilities in a location away from the affected area

Human caused: These disasters include acts of terrorism, sabotage, virus attacks, operations mistakes, crimes, and so on. These also include the risks resulting from manmade structures.

These may be caused by both internal and external persons.

Civil: These risks typically are related to the location of the business facilities. Typical civil risks include labour disputes ending in strikes, communal riots, local political instability, and so on. These again may be internal to the company or external.

Supplier: These risks are tied to the capacity of suppliers to maintain their level of services in a disaster. It is appropriate that a backup supplier pool be maintained in case of emergency.

 

2.2.2    Facility Risks

Facility risks are risks that affect only local facilities. While evaluating these risks, the following essential utilities and commodities need to be considered.

 

Electricity: To analyze the power outage risk, it is important to study the frequency of power outage and the duration of each outage. It is also useful to determine how many powers feeds operate within the facility and if necessary make the power system redundant.

Telephones: Telephones are a particularly crucial service during a disaster. A key factor in evaluating risks associated with telephone systems is to study the telephone architecture and determine if any additional infrastructure is required to mitigate the risk of losing the entire telecommunication service during a disaster.

Water: There are certain disaster scenarios where water outages must be considered very seriously, for instance the impact of a water cutoff on computer cooling systems.

Climate Control: Losing the air conditioning or heating system may produce different risks that change with the seasons.

Fire: Many factors affect the risk of fire, for instance the facility’s location, its materials, neighboring businesses and structures, and its distance from fire stations. All of these and more must be considered during risk evaluation.

Structural: Structural risks may be related to design flaws, defective material, or poor-quality construction or repairs.

Physical Security: Security risks have gained attention in recent years, and nowadays security is a mandatory 24-hour measure to protect each and every asset of the company from both outsiders and employees. Different secure access and authorization procedures, manual as well as automated ones, are enforced in enterprises. Factors such as workplace violence, bomb threats, trespassing, sabotage, and intellectual property loss are also considered during the security risk analysis.

 

2.2.3    Data Systems Risks

Data systems risks are those related to the use of shared infrastructure, such as networks, file servers, and software applications that could impact multiple departments. A key objective in analyzing these risks is to identify all single points of failure within the data systems architecture.

Data systems risks can also be due to inappropriate operation processes. Operations that have run for a long period of time on obsolete hardware or software are a major risk given the lack of spares or support. Recovery from this type of failure may be lengthy and expensive due to the need to replace or update software and equipment and retrain personnel.

Data systems risks may be evaluated within the following subcategories:

       ● Data communication network

       ● Telecommunication systems and network

       ● Shared servers

       ● Virus

       ● Data backup/storage systems

       ● Software applications and bugs

 

2.2.4    Departmental Risks

Departmental risks are the failures within specific departments. These would be events such as a fire within an area where flammable liquids are stored, or a missing door key preventing a specific operation.

An effective departmental risk assessment needs to consider all the critical functions within that department, key operating equipment and vital records whose absence or loss will compromise operations. Unavailability of skilled personnel also can be a risk. The department should have necessary plans to have skilled backup personnel in place.

 

2.2.5    Desk-Level Risks

Desk-level risks are all the risks that can happen that would limit or stop the day-to-day personal work of an individual employee. The assessment at this layer may feel a little like an exercise in paranoia. Every process and tool that makes up the personal job must be examined carefully and accounted as essential.

 

3.0       Why is disaster recovery important?

Disasters can inflict many types of damage with varying levels of severity, depending on the scenario. A brief network outage could result in frustrated customers and some loss of business to an e-commerce system. A hurricane or tornado could destroy an entire manufacturing facility, data center or office.

Additionally, many businesses are required to create and follow plans for disaster recovery, business continuity and data protection in order to meet compliance regulations. This is particularly important for organizations operating in financial, healthcare, manufacturing and government sectors. Thinking about disasters before they happen and creating a plan for how to respond can provide many benefits. It raises awareness about potential disruptions and helps an organization to prioritize its mission-critical functions. It also provides a forum for discussing these topics and making careful decisions about how to best respond in a low-pressure setting

3.1       What is the difference between disaster recovery (DR) and business continuity (BC)         BC is a proactive discipline intended to minimize risk and help ensure the business can continue to deliver its products and services no matter the circumstances. It focuses especially on how employees will continue to work and how the business will continue operations while a disaster is occurring. BC is also closely related to business resilience, crisis management and risk management, but each of these has different goals and parameters.

DR is a subset of business continuity that focuses on the IT systems that enable business functions. It addresses the specific steps an organization must take to resume technology operations following an event. DR is also a reactive process by nature. While planning for it must be done in advance, DR activity is not kicked off until a disaster actually occurs

3.2       Elements of a disaster recovery strategy: Before an organization can determine its DR strategies; it must first analyze existing assets and priorities. Two different analyses typically factor into DR decision-making:

Risk analysis

Risk analysis or risk assessment is an evaluation of all the potential risks the business could face, as well as their outcomes. Risks can vary greatly depending on the industry the organization is in and its geographic location. The assessment should identify potential hazards, determine who or what these hazards would harm, and use the findings to create procedures that take these risks into account.

Business impact analysis

Business impact analysis (BIA) evaluates the effects of the risks identified above to business operations. A BIA can help predict and quantify costs, both financial and non-financial. It also examines the impact of different disasters on an organization's safety, finances, marketing, business reputation, legal compliance and quality assurance.

Understanding the difference between risk analysis and BIA and conducting the assessments can also help an organization define it goals when it comes to data protection and the need for backup. Organizations generally quantify these using measurements called recovery point objective (RPO) and recovery time objective (RTO)

Recovery point objective

RPO is the maximum age of files that an organization must recover from backup storage for normal operations to resume after a disaster. The RPO determines the minimum frequency of backups. For example, if an organization has an RPO of four hours, the system must back up at least every four hours.

Recovery time objective

RTO refers to the amount of time an organization estimates its systems can be down without causing significant or irreparable damage to the business. In some cases, applications can be down for several days without severe consequences. In others, seconds can do substantial harm to the business.

RPO and RTO are both important elements in disaster recovery, but the metrics have different uses. RPOs are acted on before a disruptive event takes place to ensure data will be backed up, while RTOs come into play after an event occurs..

3.3       Disaster Recovery Plan

Once an organization has thoroughly reviewed its risk factors, recovery goals and technology environment, it can write a DR plan. The DR plan is the formal document that specifies these elements and outlines how the organization will respond when disruption or disaster occurs. The plan details recovery goals including RTO and RPO as well as the steps the organization will take to minimize the effects of the disaster.

The components of a DR plan should include:

         ● A DR policy statement, plan overview and main goals of the plan.

         ● Key personnel and DR team contact information.

         ● A step-by-step description of disaster response actions immediately following an incident.

         ● A diagram of the entire network and recovery site.

         ● Directions for how to reach the recovery site.

         ● A list of software and systems that staff will use in the recovery.

         ● Sample templates for a variety of technology recoveries, including technical documentation from vendors.

         ● A communication that includes internal and external contacts, as well as boilerplate for dealing with the media.

         ● Summary of insurance coverage.

         ● Proposed actions for dealing with financial and legal issues.

An organization should consider its DR plan a living document. Regular disaster recovery testing should be scheduled to ensure the plan is accurate and will work when a recovery is required. The plan should also be evaluated against consistent criteria whenever there are changes in the business or IT systems that could affect DR.

3.4       How disaster recovery works

DR initiatives are more attainable by business of all sizes today due to widespread cloud adoption and availability of virtualization technologies that make backup and replication easier. However, much of the terminology and best practices developed for DR were based on enterprise efforts to recreate large-scale physical data centers. This involved plans to transfer, or fail over, workloads from a primary data center to a secondary location or DR site in order to restore data and operations.

 

3.5       Disaster recovery sites

An organization uses a DR site to recover and restore its data, technology infrastructure and operations when its primary data center is unavailable. DR sites can be internal, external or cloud-based. An organization sets up and maintains an internal DR site. Organizations with large information requirements and aggressive RTOs are more likely to use an internal DR site, which is typically a second data center. When building an internal site, the business must consider hardware configuration, supporting equipment, power maintenance, heating and cooling of the site, layout design, location and staff.

An external disaster recovery site is owned and operated by a third-party provider. External sites can be hot, warm or cold.

A cloud recovery site is another option. An organization should consider site proximity, internal and external resources, operational risks, service-level agreements and cost when contracting with cloud providers to host their DR assets or outsourcing additional services.

3.6       Disaster recovery tiers

In addition to choosing the most appropriate DR site, it may be helpful for organizations to consult the tiers of disaster recovery. The tiers feature a variety of recovery options organizations can use as a blueprint to help determine the best DR approach depending on their business needs.

Another type of DR tiering involves assigning levels of importance to different types of data and applications and treating each tier differently based on the tolerance for data loss. This approach recognizes that some mission-critical functions may not be able to tolerate any data loss or downtime, while others can be offline for longer or have smaller sets of data restored.

3.7       Types of disaster recovery

In addition to choosing a DR site and considering DR tiers, IT and business leaders must evaluate the best way to put their DR plan into action. This will depend on the IT environment and the technology the business chooses to support its DR strategy.

Types of DR can vary, based on the IT infrastructure and assets that need protection as well as the method of backup and recovery the organization decides to use. Depending on the size and scope of the organization, it may have separate DR plans and implementation teams specific to departments such as data centers or networking. Major types of DR include:

Data center disaster recovery

Organizations that house their own data centers must have a DR strategy that considers the entire IT infrastructure within the data center as well as the physical facility. Backup to a failover site at a secondary data center or a collocation facility is often a large part of the plan, IT and business leaders should also document and make alternative arrangements for a wide range of facilities-related components including power systems, heating and cooling, fire safety and physical security.

Network disaster recovery

Network connectivity is essential for internal and external communication, data sharing and application access during a disaster. A network DR strategy must provide a plan for restoring network services, especially in terms of access to backup sites and data.

Virtualized disaster recovery

Virtualization enables DR by allowing organizations to replicate workloads in an alternate location or the cloud. The benefits of virtual DR include flexibility, ease of implementation, efficiency and speed. Virtualized workloads have a small IT footprint, replication can be done frequently, and failover can be initiated quickly. Several data protection vendors offer virtual backup and DR as a product.

Cloud disaster recovery

The widespread acceptance of cloud services allows organizations that traditionally used an alternate location for DR to be hosted in the cloud. Cloud DR goes beyond simple backup to the cloud. It requires an IT team to set up automatic failover of workloads to a public cloud platform in the event of a disruption.

Disaster recovery as a service (DRaaS)

DRaaS is the commercially available version of cloud DR. In DRaaS, a third party provides replication and hosting of an organization's physical and virtual servers. The provider assumes responsibility for implementing the DR plan when a crisis arises, based on a service-level agreement.

3.8       Disaster recovery services and vendors

Disaster recovery vendors can take many forms, as DR is more than just an IT issue. DR vendors include those selling backup and recovery software as well as those offering hosted or managed services. Because DR is also an element of organizational risk management, some vendor’s couple disaster recovery with other aspects of security planning, such as incident response and emergency planning. Options include:

Choosing the best option for an organization will ultimately depend on top-level business continuity plans and data protection goals, and which option best meets those needs along with budgetary goals.

Some of the major disaster recovery software and DRaaS providers include, but are not limited to:

Emergency communication vendors are also a key part of the recovery process, and include Everbridge Crisis Management, Cisco, Rave Alert, AlertMedia and BlackBerry AtHoc

While some organizations may find it a challenge to invest in comprehensive disaster recovery planning, none can afford to ignore the concept when planning for long-term growth and sustainability. Additionally, if the worst were to happen, organizations that have prioritized DR will experience less downtime and be able to resume normal operations faster.

Conclusion

Disasters are negative disruptive events bound to occur and organizations need to develop strategies to withstand its effects. IT organizations require a strategy or recovery plan that can help them rebound in the face of disasters. Disaster recovery plans (DRPs) are effective in helping bridge and manage disasters. They should be developed around the organization’s business continuity needs and should be reviewed frequently since business continuity needs evolve over time.

 

DISASTER RECOVERY

BY

ENGR) O.O.OWODOLU MNSE, MAES, AMIRTE(UK), FNISET.