# Preliminaries

Analyzing system reliability is a significant activity in the design of large systems.  While reliability engineering is a sophisticated and often esoteric field, its foundations are quite simple and in line with common sense.  A system is composed of components which may fail.  If we know the rates at which various components are likely to fail, and the effect of particular failures or combinations of failures on system performance, then we can estimate the probability that the system will fail as a consequence.  Reliability analysis has the distinction of offering one solution to all problems, redundancy.  To see why, consider how the same components affect reliability in redundant and non-redundant systems.

Systems redundancy is similar to series and parallel circuits.  In a non-redundant system any component failure is sufficient to cause the system to fail.  In other words, everything must work or the system fails. In our example this probability is:

So, the non-redundant system will fail about one out of three times.

In the redundant system, on the other hand, all the components must fail simultaneously in order for the system to fail.  In this case the probability of failure becomes:

Or about one in ten thousand (the symbol,, indicates an iterated product just as indicates an iterated sum.

It is this ability to take failable components and build systems which are either practically impervious to failure or almost certain not to work that make reliability analysis worth the effort.  You can see the effects of this reasoning in redundant circuits added to low yield chip making processes to obtain viable acceptance rates, in redundant air surface controls in military and conventional aircraft, and proliferation of safety systems in industrial processes.  The key to leveraging high reliability systems out of low reliability components lies in actually achieving the independence of failures assumed in our redundant model.  Anything that breaches this independence such as a common cause failure, for instance a fire, or allows failures to propagate, such as a component which overheats inducing failures in adjacent components despite electrical isolation can make actual reliability orders of magnitude lower.  It is this feature of well engineered manned systems that poses a dilemma.  If they are well engineered, their reliability will depend largely on carefully planned redundancy.  Yet if they are manned, the personnel must operate and maintain equipment across subsystem boundaries.  The subsystems therefore are no longer truly independent.  It is this relationship between well engineered systems and human error that makes human reliability critical to their operation.

 Non-redundant system:  Any component failure causes system failure

 Redundant System: Any component can provide system function and prevent Failure

## THERP

THERP, the technique for human error rate prediction, was developed by Alan Swain at Sandia National Laboratories in the 1950's as a quality control method for estimating errors in the assembly of nuclear warheads.  Although the full blown methodology incorporates distributional assumptions needed for sensitivity analysis, it is exceedingly simple in point estimate form.  Types of errors, such as reading or omitting an instructional step, or choosing the wrong switch, are presumed to occur at constant rates.  If the tasks that a person performs can be broken down into subtasks for which these types of errors can be predicted, then the probability of the successful completion of the overall task can be predicted.

As with other methods for reliability analysis, the usefulness of THERP depends on what you want to predict.  Where tasks are rote, there is little stress, and each step is crucial to successful completion, THERP works very well.  If any of these conditions are not met, it may produce estimates that deviate substantially from actual failures.  THERP is often criticized for its assumption that human error rates can be accurately quantified and predicted.  A glance at the diagrams should suggest why this criticism is less telling than it may seem.  We could double or half the component failure rate in these models without affecting their relative differences in reliability in any substantial way (1.6 x  10-3vs. .59) for doubling rates for instance).  The tabled error rates used by THERP have evolved over a 30 year period based on a combination of statistical data and expert judgement and are presumed to be accurate within an order of magnitude.  When studies have been conducted to verify this assumption, estimates are usually found to be much closer, varying by factors of 2 or 3 rather than 10. The modeling assumptions typically made by human reliability analysts are another matter.  As in any form of reliability analysis the validity of results depends crucially on the modeling of dependencies.  Although paying lip service to this nostrum, human reliability analysis as a practical tool, almost always presumes independence unless there is overwhelming evidence to the contrary.  The event tree on the next page suggests why.

Although the probability that a person will successfully complete a long task is a complex conditional probability involving a myriad of possible combinations of errors, it becomes exceptionally simple if these errors just happen to be independent.  When this is the case, the conditional probabilities are just the same as the simple ones.  If it also just so happens that every step is necessary to the successful completion of the task, then this probability becomes simply,  one minus the joint probability that each of the steps were successfully completed.  This very special case (although with slight modifications to accommodate dependence when it can't be avoided) forms the basis of THERP's event tree methodology.  Tasks are represented as a tree of constituent task steps, each of which can either be successfully completed or be unsuccessful due to error.  The error branches of the tree are usually left undeveloped, resulting in a tree having a single success path with errors at each of the steps represented by undeveloped leaf nodes.  The probability of successful task completion can then be found simply by taking the product of the simple probabilities along the success path.

The only other prominent feature of THERP lies in modeling the recovery of errors.  Just as people are presumed to make errors at fixed rates, they are presumed to fail to notice they have made errors at particular rates.  So, for example, if you were setting a clock with a display which blinked until a particular step was successfully completed, even if you committed the error of omitting that step, noticing the blinking digits would provide a second opportunity to correct your error.  This type of error recovery plays an important role in reliable human performance.  To see why, consider an error which is made at rate, a, and recovered at rate, 1-b.  The probability of failure at that step is now reduced to a*b.  These error-recovery cycles in the tree can made persisting errors extremely rare for some steps.  Another feature of administrative controls rewarded by this model of error recovery is the use of checklists and subsequent verifications.  If these fail at a rate of  ~.1, each additional check increases the system's reliability by an order of magnitude.

## Building THERP trees for assignment

While the THERP methodology is relatively straightforward there are a number structural and notational conventions which need to be fixed arbitrarily. The conventions we will follow are:

1) The largest unit of analysis will be the independent step.  An independent step is the smallest action or sequence of actions which can be analyzed independent of preceding or succeeding actions.  This amounts to treating actions without opportunities for recovery as independent steps and grouping together sequences of actions with a common opportunity of recovery as a single independent step.

2) Where multiple sources of error affect a single action the should be combined within the tree in a fashion reflecting the "logic" of the task.  For example, if a step involves selecting a control the user may fail by either neglecting to perform the step (p = .01) or selecting the wrong control (p = .05).  Because these events are mutually exclusive and initiation of the step is logically prior to performing it incorrectly, the probability of failure for the step, "select control" should be: .01 (fail to initiate) + .99 (not fail to initiate) * .05 (fail to select correct control).

3) By grouping actions and sequences of actions into independent steps we produce a probability tree without dependencies between steps.  Determining the probability of successfully completing (every step) the task, therefore is simply the joint probability of the independent steps which is simply their product.  The probability of failing (somewhere) is then simply 1 - p(success).  While this is an exceedingly simplistic model of human reliability it suffices for many real tasks such as programming a vcr for which failing to set the time properly, or activating the record feature, or loading the tape, or selecting the proper channel, or.. any of a myriad of other discrete errors can lead to the failure to record the desired programming.

VCR Reliability Analysis Assignment

Use the THERP methodology to find the probability of  successfully setting up the sample VCR (instructions provided) to record a program at some later time.  .  Use the HEP values below (loosely derived from nuclear power plant HEPs) to construct your tree.

# HEP

## Controls

Fail to select well-labeled control

.003

Fail to select ambiguously labeled control

.05

Set a selector switch in wrong position

.01

Operate spring-loaded switch until proper position reached

.003

## Recovery

Fail to recover error when feedback is present

.1