| | | RssFeeds
 
Get NetworkComputing Connect Search   Search Search
 
NWC Print
July 2008
Beyond Headlines
Buzzcut
Editorial
Cover Story
On the Record
On Location
Show Case
Bulls Eye
Lateral View
Inshort
In-Depth : Wireless
Analytics Brief
Down to Business
Techmall
Book Review
In Passing
Last Mile
Archieve
 

Lateral View


 Dealing with Disaster

An eight step plan to fortify your data against disaster

 By Devendra Parulekar,
 Associate Director,
 Technology and Security Risk Services,
 Ernst & Young

In March 2000, a freak lightning bolt caused a fire at a factory of a leading RF chip company in the US. While the fire was put out in approximately ten minutes, it still impacted the production of radio frequency chips that were used by two leading mobile handset manufacturers.
The first handset manufacturer kicked its disaster recovery plan into high gear. It scoured for alternate suppliers and patched a solution to ensure that its handset production did not suffer. This company not only managed to meet its production goals but also increased its market share from 27% to 30%. On the other hand, the other manufacturer did not realize the implications of this incident and was unable to mobilize other manufacturers. In the end, this organization suffered losses close to a billion dollars that year and had to be rescued by linking with Sony to sell its handsets.
Although both the mobile handset manufacturers were hit by the same disaster, one company increased its market share while the other wound up losing significant parts of its business. This case clearly highlights the importance of Business Continuity and Disaster Recovery Planning.
Organizations these days are heavily dependant on their IT infrastructure for majority of their business transactions. The continued availability and readiness of these systems is no longer optional or discretionary but indeed vital and necessary.
A disaster may strike anytime, anywhere and often without any warning! It is very critical that organizations understand and be aware of the degree of damage a disaster can cause to their business. The damage could be loss of revenue, customer faith, productivity and more. Gartner estimates that two out of five enterprises that experience a disaster go out of business within five years.
While the intensity or the timing of a disaster cannot be predicted, performing Disaster Recovery planning will assist organizations be better prepared by defining strategies and options for dealing with disaster scenarios.

 

Initial Planning & Groundwork
Even before an organization can come up with a Disaster Recovery Plan, it needs to have a strong Information Security Program. A well defined Information Security Program will consist of comprehensive policies and procedures that are uniformly implemented across the organization. It will involve regular reviews and audits of the organization’s Information Security posture as well as programs to increase awareness amongst the employees.
In this article, we assume that the foundation for Information Security program is already in place and a Disaster Recovery Plan will be built on this foundation.

Safety of Employees & Communication during a Disaster
The organization should ensure that during any disaster scenario, the first concern should be to save human lives and reduce the possibility of any physical harm to its employees. It should have a team dedicated to coordinating activities to ensure safety and overseeing communication activities both with the internal employees as well as external entities like media. The procedures for building evacuation and an evacuation assembly area should be defined in advance.
In addition, it should clearly define notification procedures by which employees can alert the organization regarding any potentially disruptive events. Numbers and contact details of emergency personnel should be distributed to the employees. Recovery teams should have contact details of agencies like fire brigade, police, hospitals and the various hardware / software vendors of the organization.

Performing Initial Assessments
One of the activities that will need to be performed even before the disaster strikes is the Business Impact Assessment (BIA). The BIA is a quantitative and qualitative assessment of the financial and operational impacts which could occur if the organization were to be unable to perform business processes and support services.
As a part of this activity, the organization will also define the maximum period of time for which an interruption can be tolerated before it causes significant loss to the organization. This time period is known as the Recovery Time Objective or the RTO. Defining the financial and operation impacts as well as the RTOs, helps the organization identify its critical business processes as well as establish the relative priority for recovery activities. 
In addition to the BIA, the organization needs to perform a Threat and Risk assessment. As a part of this assessment, the organization will identify the possible events it needs to respond to by defining the possible threats and vulnerabilities. Each of these threats and vulnerabilities will be rated to calculate a comprehensive risk measure and define the mitigation strategies.

Disaster Prevention Strategies
In many cases, the outage impact identified via the BIA analysis may be mitigated or eliminated through preventive measures. The organization should investigate the following options as a part of its Disaster Prevention Strategy:

  • Disk Redundancy: Redundant Array of Independent Disks (RAID) utilizes multiple disk devices to store redundant data that can be used to retrieve original content in case of a hardware failure. RAID can be implemented both in hardware and via software. But software implementations of RAID come with a performance overhead and are not recommended for mission critical applications.
  • Server Redundancy: Clustering is a powerful way of providing failover support. If one member of the cluster fails, the remaining members ensure that the business operations remain unaffected. Based on the organization’s requirements, it can either go for Active-Active cluster or Active-Passive Cluster. The diagram below shows a sample clustering option:
  • Network Redundancy: The organization should actively identify single points of failure to increase the system reliability and robustness. The figure below shows sample network architecture with redundancy built in. In some cases, the inevitable will happen and the organization will face situation whereby its business processes are interrupted. At such times, it will need to migrate its processes to an alternate DR site.

 

Setting up an Alternate Data Centers
If the organization faces a situation where it suffers from a loss of its primary data center for a period greater than the defined RTOs, it will need to consider the use of an alternate data center. Setting up an Alternate Data Center will require the organization to evaluate the following components:

  • Alternate Office Space: The organization can perform a city wise analysis on the basis of parameters like personnel and rental costs, air / rail / road connectivity, availability of bandwidth, seismic zone etc. For the alternate office space, it could enter into reciprocal agreements, contract commercial recovery space or make an outright purchase.
  • Alternate Computing Equipment: The organization should make an assessment of its hardware requirements to identify requirements for the Alternate Data center. The organization can either go for equipment lease or outright purchase based on a Cost-Benefits analysis.
  • Alternate Data Communications Facilities: To ensure continuity, the organization will have to procure necessary communication links. Typically, the organizational will require Point to Point Links as well as Internet Leased Lines. Based on capacity planning done by the organization, adequate bandwidth for both the links can be procured, ideally from a different vendor.
    Disaster Recovery Site Options
    Regardless of the type of options chosen, the facility must be able to support system operations as defined in the contingency plan. The alternate site types may be categorized in terms of their operational readiness. Based on this factor, the sites may be classified as:
  • Cold Site: Consists only of facility with the space and infrastructure to support the systems. Infrastructure may include racks, power supply, cabling and environmental controls. Hardware is procured after the disaster via service level agreements with hardware vendors or disaster recovery specialists.
  • Warm Site: Partially equipped office space that contains some or all of the required hardware and applications along with connectivity and is prepared in advance of a disaster. The site can get into operational mode in a very short period of time.
  • Hot Site: Fully equipped and configured with necessary hardware, supporting infrastructure and support personnel. Typically staffed 24x7 and often used for load balancing.
    The table given below summarizes the criteria that can be employed by the organization to determine the DR site option most suitable to its environment:

 

Data Replication
The organization will also need to select a Data Replication method for ensuring that the primary and alternate sites are in sync.  Some of the options available to the organization are diagrammatically represented below and explained in detail in the sections that follow.

  • Tier 1: Backup Tapes: While backing up on a tape library is a cost-effective and tested method, the disadvantage is that it is not possible to quickly restore from such a backup. In addition there could be significant data losses as backup operations are performed once a day.
  • Tier 2: Automated Electronic Vaulting: Electronic vaulting consists of transmitting backup data over the network to tape devices located at the alternate Data Center. The disadvantage of this option is that like Tier 1 option, it reflects the data at a specific point in time, and any changes to the data after that point will be lost when the backup is restored.
  • Tier 3: Asynchronous Replication: Asynchronous replication is a process of duplicating primary data volumes over an IP connection to a storage subsystem at an alternate location. Source and target of a replication are usually separated by a significant distance to safeguard data from disasters that effect a specific geographic location, such as a region-wide power outage. This option provides high level of protection against data loss but results in higher costs.
  • Tier 4: Synchronous Replication: Synchronous replication is a process of real time duplication of primary data volumes over an IP connection to a storage subsystem at an alternate location. This option, sometimes also known as mirroring, ensures that the primary site and alternate site data is always in sync. While this ensures there is no data loss, this option may require changes to applications and is the costliest of options discussed here.
    Based on the RTOs, the organization can combine some of these options to form an ideal mix for its
    recovery strategy.

 

Defining the Disaster Recovery Plan
The development of an effective Disaster Recovery Plan affords an organization the ability to respond rapidly and effectively to an adverse business scenario. It documents the steps involved in identifying the crisis, planning a response to the crisis and resolving the crisis. In addition, this plan defines the DR team structure and provides a functional overview of the DR team.  The plan should ideally consist of the following sections:

  • DR Team: The DR organization should be comprised of a series of functional teams that are focused on a specific recovery related task. The team composition, contact details, roles and responsibilities should be clearly defined as a part of the DR plan.
  • Emergency Operations Center: An Emergency Operations Center (EOC) allows a company’s management to reestablish organizational leadership, allocate resources, and focus on emergency containment and recovery. Ideally, the organization should identify a Primary EOC for short term contained disasters and a Secondary EOC for long term disasters that render the Primary EOC inoperable.
  • Recovery Procedures: The recovery procedures document activities that need to be performed to bring the systems back to a functional state. It should include procedures for recovery of Systems, Networks, Applications and Operations. In addition, DR activation procedures may be included in this section.

 

DR Awareness & Testing
Training and awareness regarding the activities of a DRP is essential. This awareness is achieved through formal education and training sessions that are conducted on a regular basis. This provides a way of ensuring the personnel who are responsible for maintaining the DRP are aware of the plan and understand its meaning.
Similarly, testing is intended to verify the accuracy of the collected supporting documentation, as well as to confirm the functioning of the DRP and the alternate DR Site. Testing can and should identify vulnerabilities as well as changes to the organizational environment which require updating. Testing of the DRP should be executed once annually. Individual components of the DRP such as notification, declaration of disaster, recovery procedures should be tested once every 6 months.
A disaster, typically, is unexpected and disruptive. As such, its occurrence can cause panic and confusion. While a disaster recovery plan can not guarantee complete resumption of business operations, having such a plan tips the odds in favor of survival / recovery. Having trained staff following a predetermined course of action will significantly reduce the amount of time lost and therefore increase the possibility of salvaging the maximum amount of material.

Print this Page   E-mail this Page
RATE THIS ARTICLE
 Worse   Better 
Comment:*
First Name:*
Last Name:*
Company:
City:*
E-mail:*
Verification Code:*

Type the characters you see in the picture above.
 
  Reset

Comments >>

1
No Comments to display

Disclaimer >>

 
 CIO of the Week >>

“The management has identified technology as the change agent that will drive the company ahead”

Nitin Arora, CIO, Writer Corporation

 

More: CIO OF THE WEEK >>


 FEATURED STORIES >>

Americans not Confident About the Safety of Their Personal Data

Only an average of eight percent of Americans say they are very confident in the ability of US retailers, government and banks to protect their personal information

 

BT to Launch £1.5 Billion Programme

BT announced plans to roll out fibre-based, super-fast broadband to as many as 10 million homes by 2012

 

Icahn Would Sell Yahoo's Search Business to Microsoft for $1 Billion

Under Icahn's plan, Microsoft also would pay billions of dollars to become the exclusive search provider on all Yahoo sites for a term of 5 years

CAST YOUR VOTE>>

"Do you think growing consolidation among IT vendors is good for enterprise users?"



View Polls Archive
ADVERTISEMENTS >>
 
Powered By: ssCMS 2.2.0.0