shareengineer: Reliability, availability and dependability

Friday, January 4, 2013

Reliability, availability and dependability

One persistent shortcoming with the general topic of making computers systems that can survive component faults has been confusion over terms. Consequently, perfectly good words like “reliability” and “availability” have been abused over the years so that their precise meaning is unclear.

Defining Failure:

Laprie picked a new term––dependability–– to have a clean slate to work with:

Computer system dependability is the quality of delivered service such that re- liance can justifiably be placed on this service. The service delivered by a sys- tem is its observed actual behavior as perceived by other system(s) interacting with this system’s users. Each module also has an ideal speci?ed behavior, where a service speci?cation is an agreed description of the expected behavior.

A system failure occurs when the actual behavior deviates from the specified behavior. The failure occurred because an error, a defect in that module. The cause of an error is a fault.

When a fault occurs it creates a latent error, which becomes effective when it is activated; when the error actually affects the delivered service, a failure oc- curs. The time between the occurrence of an error and the resulting failure is the error latency. Thus, an error is the manifestation in the system of a fault, and a failure is the manifestation on the service of an error.

To clarify, the relation between faults, errors, and failures is:

A fault creates one or more latent errors.
The properties of errors are

1) a latent error becomes effective once activated;

2) an error may cycle between its latent and effective states;

3) an effective error often propagates from one component to another, thereby creating new errors. Thus, an effective error is either a formerly latent error in that componentor it has propagated from another error in that component or from elsewhere.

A component failure occurs when the error affects the delivered service.
These properties are recursive, and apply to any component in the system.

Users perceive a system alternating between two states of delivered service with respect to the service speci?cation:

1. Service accomplishment, where the service is delivered as specified,

2. Service interruption, where the delivered service is different from the speci-

fied service.

Transitions between these two states are caused by failures (from state 1 to

state 2) or restorations (2 to 1). Quantifying these transitions lead to the two

main measures of dependability:

1. Module reliability is a measure of the continuous service accomplishment (or,

equivalently, of the time to failure) from a reference initial instant. The overall

failure rate of the collection is the sum of the failure rates of the modules. Ser-

vice interruption is measured as Mean Time To Repair (MTTR).

2. Module availability is a measure of the service accomplishment with respect

to the alternation between the two states of accomplishment and interruption.

For non-redundant systems with repair, module availability is statistically

quantified as:

Module availability = MTTF

-------------------------------

(MTTF+MTTR)

Note that reliability and availability are now quantifiable metrics, rather than

synonyms for dependability. Mean Time Between Failures (MTBF) is simply

the sum of MTTF + MTTR. Although MTBF is widely used, MTTF is often

the more appropriate term.

Classifying faults and fault tolerance techniques may aid with understanding.

Gray and Siewiorek classify faults into four categories according to their cause:

1. Hardware faults: devices that fail.

2. Design faults: faults in software (usually) and hardware design (occasionally).

3. Operation faults: mistakes by operations and maintenance personnel.

4. Environmental faults: fire, flood, earthquake, power failure, and sabotage.

Faults are also classified by their duration into transient, intermittent, and perma-

nent [Nelson 1990]. Transient faults exist for a limited time and are not recurring.

Intermittent faults cause a system to oscillate between faulty and fault free opera-

tion. Permanent faults do not correct themselves with passing of time.

Reliability improvements into four methods:

1. Fault avoidance: how to prevent, by construction, fault occurrence;

2. Fault tolerance: how to provide, by redundancy, service complying with the

service specification in spite of faults having occurred or are occurring;

3. Error removal: how to minimize, by verification, the presence of latent errors;

4. Error forecasting: how to estimate, by evaluation, the presence, creation, and

consequences of errors.

3 comments:

UnknownJune 25, 2013 at 9:24 PM
Hi there! glad to drop by your page and found these very interesting and informative stuff. Thanks for sharing, keep it up!

- reliability centered maintenance
ReplyDelete
Replies
UnknownJuly 17, 2013 at 9:03 PM
Hi, nice post. Well what can I say is that these is an interesting and very informative topic. Thanks for sharing

your ideas, its not just entertaining but also gives your reader knowledge. Good blogs style too, Cheers!

- The reliability engineer jobs
ReplyDelete
Replies
UnknownMay 6, 2018 at 7:03 PM
From the bottom of my heart, i want to say thank you!!
ReplyDelete
Replies

Add comment

shareengineer

Pages

Translate

Friday, January 4, 2013

Reliability, availability and dependability

3 comments: