Business Continuity in an Age of Terror
By Eldad Galker.
September, 2001
The author of these lines is owner and General 
Manager of the Chief Group, a group of companies founded in the 
nineteen eighties and dealing ever since with Data Survivability 
in computerized systems.
One of the highest goals in the computing area 
is Service Continuity (business continuity).  The amounts of money and other resources 
nowadays invested in backup systems, survival and recovery are enormous. 
According to IDC estimations by analysis of the yearly sales revenues of the 
companies supplying backup and recovery solutions, the yearly investment in this 
market slice surpasses the 2.7B$ and will surpass the 4.7B$ in the year 2005. 
The recent terrorist events in USA  (9/11) will certainly affect those 
estimations in the short and long terms.
The fact that Morgan Stanley, which offices were 
located in the New York World Trade Center, managed to go back to functionality 
within less than 24 hours, is only a result from the building of a suitable 
preventive backup system, performed according to a correct risk assessment. That 
risk assessment was based on the understanding of the need for preparation of an 
alternative backup infrastructure for the possible case of computer systems 
collapse.
Although the monthly current expenses of Morgan 
Stanley on those backup systems surpasses the 100,000$, it is possible to find 
suitable solutions for more suitable budgets of medium and even small companies.
In order to understand the accurate risk 
assessment as it should be done, the difference must be noted between the 
computer technician’s understanding of the computer system, and the end user’s 
understanding of the computer, and also between those in general and the term 
“system” in particular.
The computer in itself is a tool, and as such, 
it is its mission to aid in the creation, storage, finding and quick retrieval 
of information when needed.
The computer, as a simple pencil, makes it 
possible to transfer ideas, thoughts, data and general information from the 
human conscience to the written media. As the mentioned pencil, it allows 
erasing and rewriting of the written data.  The computer, its operating system 
and the software installed 
In it, lack all importance by themselves without 
a human being transferring its thoughts, exactly as the pencil has no importance 
by itself.
Continuing the analogy, the computer allows the 
filing of documents as in a bookcase, in different file holders, and sorting 
them by names, date of creation, etc. And this, in order of making possible the 
location and opening of the files when so needed.  The computer allows doing so 
by making use of different ways through the operating system, “adaptors of 
hardware components” (drivers), and specific software programs.
For the computer technician, the computer system 
is worthless while the central processor unit – CPU, which allows the 
configuration of the capacities and performance of the system, hasn’t been 
installed. The technician can configure the computer according to his experience 
and understanding, through changes to be performed in the definition features of 
the BIOS system. In the eyes of the systems technician, the computer hardware 
elements are worthless while the operating system and the utilitarian software 
programs have not been installed. Those, as well, can be adapted to different 
needs by changes introduced and adaptations in the configuration files or the 
Registry.
In each case, computer technicians assume that the 
computer’s purpose is to properly function, and to achieve this they invest the 
best of their efforts.
Technicians believe that the correct 
installation of a hardware system, an operating system, and utilitarian software 
programs, in such a way that all the hardware components should optimally 
function without interfering with each other, is the highest goal. Beyond this, 
they rarely show additional interest in the system.  For the technicians, System 
is a hardware complex with a sound and properly working operating system, which 
allow them to apply their judgment and experience to affect the performance 
through changes they introduce in the system.
In contrast, for the end user, System means 
being able to click on an icon with the mouse and by this getting instant access 
to the mechanism through which they can transfer information from their mind to 
the databases in the computer or to quickly retrieve it in case of need. A 
smoothly working system is the basis for all their computer related activities. 
Being the one and only purpose of 
the computer the making possible for the user to transfer information to and 
from the system, any system complying with all the standards and conditions 
posed by the technical experts but not allowing the end user to handle it 
according to his needs, experience, knowledge or skills, is definitely not a 
sound nor effective system.
As mentioned, computer systems are binary 
systems, always composed by two components: SYSTEM and DATA. Every system by 
itself allows, through the user interface, to define, modify or process data 
created or collected by human beings.
When checking on the ways of work with huge 
databases, even worse confusion can be found: in the eyes of the system 
technicians responsible for the backup of the database, the whole database 
represents DATA, which has to be backed up. They do not consider any difference 
between the database system and the data storied in it- which was created by the 
users. Therefore, they sometimes provoke a deficient service or even service 
interruptions to their customers: the end users.
The existent backup and recovery methods also 
make a difference between the treatment of the data and the treatment of the 
systems.
Backup Solutions Versus Fault Tolerant 
Solutions.
To protect the system, methods called Fault 
Tolerance are used. These methods make possible the continuation of system 
functionality even after the happening of a hardware error or even a system 
failure.
Fault Tolerance solutions are not expected or 
capable to cope with software errors or data problems. Among the more popular 
solutions of the kind there are the RAID and the MIRROR. By these methods, in 
case one of the hard drives in the systems stops working for any reason, the 
system’s function and service go on unaltered. The system can generally not cope 
with the collapse of more than one hard drive at the time. Systems of the kind 
are completely “indifferent” to the sort and content of the data storied in them 
– it is possible to totally delete it, contaminate it with viruses or scramble 
it in any possible way without getting any alert or protection from the 
computer’s system.
To protect Data, the Backup method is used. 
Backup means to keep a copy of the previous information in a different way.
A different way means, to transfer the information to an other location or 
another computer, another hard drive, or another media kind as for instance 
magnetic tapes or optic media, and even printing it on paper. The more historic 
versions kept of the information, the better chances for and quality of the 
information at retrieval. 
 
We daily meet organizations where damaged data 
has been backed up without any awareness of the damage which made the data 
useless. Keeping a number of historical versions of the data can 
often help in the reconstruction of the desired information.
The difference between these two methods is 
clear and sharp. The survival of the system is ensured only by Fault Tolerant 
solutions, while the data is protected only by Backup solutions. It is of course 
possible to defend data 
from damage through Fault tolerant solutions. 
However, we must be aware about the fact that this kind of solutions protect the 
data availability, but not its contents nor its validity. It is also possible of 
course to protect the system through backup solutions, while aware that this 
kind of method will allow the retrieval of the definition files only to a
sound system.
 
The protection of data by the means of a fault 
tolerant solution is not effective, because any damage to the original data will 
instantly affect also and in the same way the alternative data. As said before, 
fault tolerant solutions are completely “indifferent” to the data content and 
any act like modification or complete deleting performed on the data, are 
perfectly legal as far as this type of solutions are concerned.
The protection of the system by the means of a backup solution will be ineffective at the same degree, given that the meaning of a backup procedure is to copy the files and the information. Copies of the kind do not make possible the performance of a system boot process in a case of crash, but only if reinstalling all the system anew, and installing the proper backup software before being able to retrieve the files from the backup.
Fault Tolerant solutions 
run mostly on hard drives. Additional solutions allow alternative computer 
systems on line, and even alternative locations containing everything needed to 
continue the corporate
operation. Lately, some of the 
manufacturers make it possible to actively backup the system to magnetic tapes 
in a method that allows its retrieval without an operating system or the need 
for any backup & recovery software, in a similar way to the still used in mini 
computers.
Backup solutions were traditionally implemented on magnetic tapes, which permit portability out of the backup site or into safes.
These backups are available and convenient, but 
not always reliable. Therefore, they have to be created according to strict work 
procedures of regular tape refreshment, tape head cleansing, on tape data 
quality tests, and optimal environment storage conditions. These solutions grant 
a relatively simple backup, while many times the retrieval process is slow, 
complicated or inconvenient.
As well, data can be backed up to floppies, CD 
or DVD. In all these options the data volume represents an obstacle. Lately, 
some manufacturers of backup systems make it possible also to use hard drives as 
a part of the backup process, but not always as a part of the retrieval process.
Combined solutions exist, which grant the 
virtues of all methods, such as RAIT (Raid on Tapes) which enables the 
simultaneous recording of a number of tapes and thus significantly increases the 
read/write velocity from tapes and the survival chances of the data.
In the area of database backup the 
considerations have to be the same as in the backup of servers: the database 
system has to be backed up to a sound and functional copy which can be 
immediately activated in case of damage as in Fault Tolerance. Separately, the 
data accumulated in it has to be preventively backed up for the probable case in 
which the database will have to be rebuilt and the last data version, retrieved.
Data backup has to be preferentially kept in the 
simplest possible mode for data recovery/ rescue. Keeping backed up data under 
compression or encryption, or in a non standard format, difficult, delays and 
makes more expensive the whole process of recovery and recuperation.
Down Time 
When taking into consideration the corporate 
backup processes, there are a number of critical factors affecting the corporate 
decisions related to those processes: 
    1. 
Down Time, or period of time (in hours) of expected service interruption.       
DT
    2. 
All inclusive cost of each DT hour.*                                                               
DT$
    3. 
Expected time (in hours) elapsed between consecutive DT events.                  
T
    4.
Quality of the retrieved data (percentage retrieved from the lost data)
        
after the service interruption.                                                                           
Q
    5.
All Inclusive cost of hour (from the yearly cost) of corporate data 
survival 
       
protective measures.                                                                                       
$
*According 
to Contingency Planning Research (http://www.contingencyplanningresearch.com) 
in the year 2001 survey, 46% of 
the companies reported that the cost of one DT hour can reach up to 50 K$,  and 
28% of the companies estimated the same cost between 51K$ and 250K$. As a result of the survey it is also clear that 40% 
of the companies are not in existential danger within the range of 72 DT hours, 
and 21% of them within 48 DT hours. 
The optimal monetary investment ($) should 
increase the time (T) elapsed between DT events, decrease the DT value, and 
increase the quality (Q) of the available data at the end of each DT event. 
Thus, at minimal costs regarding the organizational needs degree of flexibility, 
according to the different parameters, and as a result of the cost / effective 
analysis of the 
financial investment needed to prevent the 
direct and indirect damage caused by probable service interruption.
Theoretically, it could be said that an ideal 
situation in which no computer system mishap will ever happen is impossible to 
reach. Mathematically expressed, when Q=100, DT=0, and T=∞, 
then $=∞. 
Therefore, an ideal solution is a utopia, and only an optimally focused solution 
is practical.  Optimal solutions do not always require financial investments. 
Mostly, they require, first of all, the investment of serious thought and 
attention to the users needs, so  to find the issues with which is possible to 
compromise  in order to reach 
optimal performance.
Examples:
        A. 
An isolated RAID system 
with no back ups, in which a single hard drive collapsed. The system
          
does not interrupt the supply of services, so DT=0 and Q=100%.  In the 
same system, when           
two hard drives happen to collapse and the interruption 
of services is total, DT=∞ 
(“infinite”)           
and Q=0. The only possible outlet for such a situation is submitting the 
system for treatment at           
a data recovery laboratory in order to decrease DT and increase Q. In 
this kind of systems, it           
is compulsory to add backup to magnetic tapes.
        B.
Assuming that the average lifetime of a computer system stays on about 4 
years (between        
crashes) and the organization is able to stand DT =1 hour once in 4 years 
and also Q equaling the last 24 hours old sound data backup. In such a case, the 
crash would make the organization loose all the new data created or updated 
during the last workday. The proper advice here will be to install a backup to 
tape system to back the daily important data up. The restrictions, however, will 
be:  
     1. No individual 
tape unit will be in service for more than 20 times and in any case it won’t 
enter rotation for over 6 months.
     2. Tape heads 
will be cleansed once in 20 backup sessions, and at least once a month.
     3. The backed up 
data has to undergo a sampling recovery test from each of the tape units at 
least once in the lifetime of the unit, etc.
     4. 
Once in a quarter, a system crash and recovery simulation test will be carried 
out. 
The most common solution to decrease DT and 
increase Q is the addition of magnetic tape for data backup. This increases the 
$ factor. In this situation DT equals a number of hours and Q lacks the gap 
between 
the latest data created or updated since the 
last smooth backup, and the last sound backup session.
An additional solution to the RAID problem is to 
install RAID 10 (a system composed by two RAID 5, which mirror each other). This 
solution doubles the expenses, 2X$, but decreases DT 
to 0 while enhancing Q to 100. Yet, this last statement will be true while two 
hard drives, one from each RAID 5, have still not simultaneously collapsed. In 
such a case, which we’ve already witnessed, the situation is the same as in the 
former structure.
Organizations in which a 24 old Q represent real 
direct or indirect damage and the lost hours of work accumulation is relatively 
high, much more creative combinations are to be considered. As systems that 
combine the backup and security elements together without overloading the data 
traffic in the organization
The first element to identify in the 
organizational risk analysis is the hourly price of service interruption for 
each one of the systems.
Personal Down Time (PDT)
In any survival and recovery process, one out of 
the some times unnoticed components, is the concept expressed by a new term we 
here claim: Personal Down Time (PDT). This denomination refers to the time loss 
of one single corporate user. For instance, an employee who invests great effort 
in the creation of a file and later erroneously deletes or scrambles it, 
provokes himself PDT as long as the time needed to rewrite allover the work 
anew, and for as many hours as the delay in the  execution of his other deeds.
The damage is
seemingly not significant, but while 
better analyzing the issue it can be discovered that in average numbers one of 
100 employees in a typical organization happens to suffer 3 hours PDT at least 
once a day. In a 100 employees 
organization working 23 days a month, the time loss would be 69 hours a month, 
or 8,6 workdays which represent 37% of a job.  In any organization with 266 
employees, the meaning is the payment of a full salary to one extra (virtual) 
employee called PDT.  This statement can be easily checked. Didn’t happen to any 
of us by chance to load a file, change it  and then save it by “save” instead of 
by “save as “, while later many hours were needed to rewrite the original file ? 
Doesn’t this kind of insignificant human error happen to us at least twice or 
three times a year? (Approximately once in 100 days).
When comparing from the cost point of view that 
accumulative harm with the damage caused by DT once in four years to the main 
server, it turns out the damage accumulated during 4 years is many times more 
significant. To our surprise, we see that the corporate investment in PDT 
prevention or in the improvement of Q at recovery from PDT is near to 0.
In addition, we clearly see that the conventional 
backup solutions are not planned to solve the PDT problem.  Mostly, the recovery 
process of a single lost file from tape lasts almost the same time needed to re 
create the same single file, if not even more.
Shall it be emphasized that this cost calculation 
does not take into account the total lost of creative work, lateness in 
scheduled projects, or in service supply, all of which might be originated by 
PDT.
Organizations in which the creative work 
component is high, such as in software houses, graphic departments, law offices 
or accounting offices, the loss of one day’s work can cause average 
accumulative damage of three days.  In this kind of organizations it is 
compulsory to install backup solutions which allow data copies each 30 to 60 
minutes, in order to reduce the damage to a minimum tolerable.
Given tape backup solutions are not planned to 
supply backup sessions as frequent as needed, the demanded solution in most 
organizations is the data versions backup on hard drives. The hard drive is one 
of the daily continuously decreasing cost components, while its available 
capacity continuously increases. The write and read velocity is higher than in 
any other comparable storage method, and the location and extraction of data are 
immediate.
In conclusion:
Backup is not our highest goal, in the same way 
that a completely sound ---but unavailable for the user- computer system isn’t. 
The backup is only an aid for service continuity. In order to supply service 
continuity it is mandatory to combine quick recovery solutions, and not 
precisely quick backup solutions. 
Backup and survival solutions must suit the 
organization according to its needs, capacity, and reasonable threats to service 
continuity. Survival arrays have to fit the technical skills of the computer 
caretaking staff and to the accorded regulations for work conditions within the 
organization. Correct work regulations which computer responsible staff cannot 
completely accomplish, are unacceptable. In such a case, the regulations have to 
be reconsidered and matched to the attainable.
In addition to  the various modern threats to service 
continuity such as system crash, backup system crash and power supply 
failures, also the different terrorism 
threats are to be added today to the list of reasonable, 
possible and probable threats and include them in the corporate 
risk analysis
For a small organization which can build an 
alternative small system in an alternative location (in some cases the owner’s 
own home), it is recommended to perform everyday a complete backup 
to tape and to retrieve, daily, the full content of the tape to the alternative 
system.
By this method, the tape condition and the backed 
up data quality are daily tested, while also keeping an intact updated 
alternative system, ready to service in case the original computer system at the 
company’s premises is down or missing.
These kind of simple and relatively inexpensive 
solutions are also used, although at a different scale, in bigger, more branched 
and complex organizations.
Under the modern organizations' work 
conditions, in which the backup time-window continuously decreases and the 
recovery quality importance continuously increases, matching backup and recovery 
systems are to be built to make possible the high frequency data backup – each 
number of hours or even minutes- and the different data generations immediate 
recovery, to and from hard drives.  In addition, the data has to be backed up to 
magnetic tape units to be driven out of the location or to safes to prevent the 
cases of fire, robbery, or any other kind of total damage. In this situation, we 
face today the only whole solution which allows focused or complete data 
recovery according to needs, and at low costs.
Comments to this article will be welcome at eldad@chief-group.com
A backup and recovery solution for part of the problems presented in this article can be freely downloaded from the Chief Group site http://www.bos.co.il
More sites of interest are:
http://www.contingencyplanningresearch.com
http://www.infosecuritymag.com/articles/may00/departments1_note.shtml