Analyzing system reliability is a significant
activity in the design of large systems.
While reliability engineering is a sophisticated and often esoteric field,
its foundations are quite simple and in line with common sense. A system is composed of components which may
fail. If we know the rates at which
various components are likely to fail, and the effect of particular failures or
combinations of failures on system performance, then we can estimate the
probability that the system will fail as a consequence. Reliability analysis has the distinction of
offering one solution to all problems, redundancy. To see why, consider how the same components affect reliability
in redundant and non-redundant systems.
Systems redundancy is similar to series and
parallel circuits. In a non-redundant
system any component failure is sufficient to cause the system to fail. In other words, everything must work or the
system fails. In our example this probability is:
So, the non-redundant system will fail about
one out of three times.
In the redundant system, on the other hand,
all the components must fail simultaneously in order for the system to
fail. In this case the probability of
failure becomes:
Or about one in ten thousand (the symbol,,
indicates an iterated product just as
indicates
an iterated sum.
It is this ability to take failable
components and build systems which are either practically impervious to failure
or almost certain not to work that make reliability analysis worth the
effort. You can see the effects of this
reasoning in redundant circuits added to low yield chip making processes to
obtain viable acceptance rates, in redundant air surface controls in military
and conventional aircraft, and proliferation of safety systems in industrial
processes. The key to leveraging high
reliability systems out of low reliability components lies in actually achieving
the independence of failures assumed in our redundant model. Anything that breaches this independence
such as a common cause failure, for instance a fire, or allows failures
to propagate, such as a component which overheats inducing failures in adjacent
components despite electrical isolation can make actual reliability orders of
magnitude lower. It is this feature of
well engineered manned systems that poses a dilemma. If they are well engineered, their reliability will depend
largely on carefully planned redundancy.
Yet if they are manned, the personnel must operate and maintain
equipment across subsystem boundaries.
The subsystems therefore are no longer truly independent. It is this relationship between well
engineered systems and human error that makes human reliability critical to their
operation.
Non-redundant system: Any component failure causes system failure Redundant System: Any component can provide system
function and prevent Failure
![]()

THERP, the technique for human error rate
prediction, was developed by Alan Swain at Sandia National Laboratories in the
1950's as a quality control method for estimating errors in the assembly of
nuclear warheads. Although the full
blown methodology incorporates distributional assumptions needed for
sensitivity analysis, it is exceedingly simple in point estimate form. Types of errors, such as reading or omitting
an instructional step, or choosing the wrong switch, are presumed to occur at
constant rates. If the tasks that a
person performs can be broken down into subtasks for which these types of
errors can be predicted, then the probability of the successful completion of
the overall task can be predicted.
As with other methods for reliability
analysis, the usefulness of THERP depends on what you want to predict. Where tasks are rote, there is little
stress, and each step is crucial to successful completion, THERP works very
well. If any of these conditions are
not met, it may produce estimates that deviate substantially from actual
failures. THERP is often criticized for
its assumption that human error rates can be accurately quantified and
predicted. A glance at the diagrams
should suggest why this criticism is less telling than it may seem. We could double or half the component
failure rate in these models without affecting their relative differences in
reliability in any substantial way (1.6 x
10-3vs. .59) for doubling rates for instance). The tabled error rates used by THERP have
evolved over a 30 year period based on a combination of statistical data and
expert judgement and are presumed to be accurate within an order of
magnitude. When studies have been
conducted to verify this assumption, estimates are usually found to be much
closer, varying by factors of 2 or 3 rather than 10. The modeling assumptions
typically made by human reliability analysts are another matter. As in any form of reliability analysis the
validity of results depends crucially on the modeling of dependencies. Although paying
lip service to this nostrum, human
reliability analysis as a practical tool, almost always presumes independence
unless there is overwhelming evidence to the contrary. The event tree on the next page suggests
why.
Although the probability that a person will
successfully complete a long task is a complex conditional probability
involving a myriad of possible combinations of errors, it becomes exceptionally
simple if these errors just happen to be independent. When this is the case, the conditional probabilities are just the
same as the simple ones. If it also
just so happens that every step is necessary to the successful completion of
the task, then this probability becomes simply, one minus the joint probability that each of
the steps were successfully completed.
This very special case (although with slight modifications to
accommodate dependence when it can't be avoided) forms the basis of THERP's
event tree methodology. Tasks are
represented as a tree of constituent task steps, each of which can either be
successfully completed or be unsuccessful due to error. The error branches of the tree are usually
left undeveloped, resulting in a tree having a single success path with errors
at each of the steps represented by undeveloped leaf nodes. The probability of successful task
completion can then be found simply by taking the product of the simple
probabilities along the success path.
![]()
The
only other prominent feature of THERP lies in modeling the recovery of
errors. Just as people are presumed to make
errors at fixed rates, they are presumed to fail to notice they have made
errors at particular rates. So, for
example, if you were setting a clock with a display which blinked until a particular step was successfully
completed, even if you committed the error of omitting that step, noticing the
blinking digits would provide a second opportunity to correct your error. This type of error recovery plays an
important role in reliable human performance.
To see why, consider an error which is made at rate, a, and recovered at
rate, 1-b. The probability of failure
at that step is now reduced to a*b.
These error-recovery cycles in the tree can made persisting errors
extremely rare for some steps. Another
feature of administrative controls rewarded by this model of error recovery is
the use of checklists and subsequent verifications. If these fail at a rate of
~.1, each additional check increases the system's reliability by an
order of magnitude.
While the THERP methodology is relatively straightforward there are a number structural and notational conventions which need to be fixed arbitrarily. The conventions we will follow are:
1) The largest unit of analysis will be the independent
step. An independent step is the smallest
action or sequence of actions which can be analyzed independent of preceding or
succeeding actions. This amounts to
treating actions without opportunities for recovery as independent steps
and grouping together sequences of actions with a common opportunity of
recovery as a single independent step.
2) Where multiple sources of error affect a
single action the should be combined within the tree in a fashion reflecting
the "logic" of the task. For
example, if a step involves selecting a control the user may fail by either
neglecting to perform the step (p = .01) or selecting the wrong control (p =
.05). Because these events are mutually
exclusive and initiation of the step is logically prior to performing it
incorrectly, the probability of failure for the step, "select
control" should be: .01 (fail to initiate) + .99 (not fail to initiate) *
.05 (fail to select correct control).
3) By
grouping actions and sequences of actions into independent steps we
produce a probability tree without dependencies between steps. Determining the probability of successfully
completing (every step) the task, therefore is simply the joint
probability of the independent steps which is simply their product. The probability of failing (somewhere) is
then simply 1 - p(success). While this is an exceedingly simplistic
model of human reliability it suffices for many real tasks such as programming
a vcr for which failing to set the time properly, or activating the record
feature, or loading the tape, or selecting the proper channel, or.. any of a
myriad of other discrete errors can lead to the failure to record the desired
programming.
VCR Reliability Analysis Assignment
Use the THERP methodology to find the probability of successfully setting up the sample VCR (instructions provided) to record a program at some later time. . Use the HEP values below (loosely derived from nuclear power plant HEPs) to construct your tree.
Potential Errors
|
HEP |
Controls
|
|
|
Fail to select well-labeled control |
.003 |
|
Fail to select ambiguously labeled control |
.05 |
|
Set a selector switch in wrong position |
.01 |
|
Operate spring-loaded switch until proper position reached |
.003 |
|
|
|
|
|
|
|
|
|
|
|
|
Recovery
|
|
|
Fail to recover error when feedback is present |
.1 |