Translate

Friday, January 4, 2013

Reliability, availability and dependability


One persistent shortcoming with the general topic of making computers systems that can survive component faults has been confusion over terms. Consequently, perfectly good words like “reliability” and “availability” have been abused over the years so that their precise meaning is unclear.

Defining Failure:
Laprie picked a new term––dependability–– to have a clean slate to work with:
Computer system dependability is the quality of delivered service such that re- liance can justifiably be placed on this service. The service delivered by a sys- tem is its observed actual behavior as perceived by other system(s) interacting with this system’s users. Each module also has an ideal  speci?ed behavior, where a service speci?cation is an agreed description of the expected behavior.
A system failure occurs when the actual behavior deviates from the specified behavior. The failure occurred because an error, a defect in that module. The cause of an error is a fault.
When a fault occurs it creates a latent error, which becomes effective when it  is activated; when the error actually affects the delivered service, a failure oc- curs. The time between the occurrence of an error and the resulting failure is  the error latency. Thus, an error is the manifestation in the system of a fault,  and a failure is the manifestation on the service of an error.

To clarify, the relation between faults, errors, and failures is:
  • A fault creates one or more latent errors.
  • The properties of errors are
1) a latent error becomes effective once activated;
2) an error may cycle between its latent and effective states;
3) an effective error often propagates from one component to another, thereby creating new errors. Thus, an effective error is either a formerly latent error in that componentor it has propagated from another error in that component or from elsewhere.
  • A component failure occurs when the error affects the delivered service.
  • These properties are recursive, and apply to any component in the system.

Users perceive a   system alternating between two states of delivered service with respect to the service speci?cation:
1. Service accomplishment, where the service is delivered as specified,
2. Service interruption, where the delivered service is different from the speci-
fied service.

Transitions between these two states are caused by failures (from state 1 to
state 2) or restorations (2 to 1). Quantifying these transitions lead to the two
main measures of dependability:

1. Module reliability is a measure of the continuous service accomplishment (or,
equivalently, of the time to failure) from a reference initial instant. The overall
failure rate of the collection is the sum of the failure rates of the modules. Ser-
vice interruption is measured as Mean Time To Repair (MTTR).
2. Module availability is a measure of the service accomplishment with respect
to the alternation between the two states of accomplishment and interruption.
For non-redundant systems with repair, module availability is statistically
quantified as:

                                      Module availability = MTTF
                           -------------------------------
                           (MTTF+MTTR)

Note that reliability and availability are now quantifiable metrics, rather than
synonyms for dependability. Mean Time Between Failures (MTBF) is simply
the sum of MTTF + MTTR. Although MTBF is widely used, MTTF is often
the more appropriate term.

Classifying faults and fault tolerance techniques may aid with understanding.
Gray and Siewiorek classify faults into four categories according to their cause:

1. Hardware faults: devices that fail.
2. Design faults: faults in software (usually) and hardware design (occasionally).
3. Operation faults: mistakes by operations and maintenance personnel.
4. Environmental faults: fire, flood, earthquake, power failure, and sabotage.

Faults are also classified by their duration into transient, intermittent, and perma-
nent [Nelson 1990]. Transient faults exist for a limited time and are not recurring.
Intermittent faults cause a system to oscillate between faulty and fault free opera-
tion. Permanent faults do not correct themselves with passing of time.
      
Reliability  improvements into four methods:
1. Fault avoidance: how to prevent, by construction, fault occurrence;
2. Fault tolerance: how to provide, by redundancy, service complying with the
service specification in spite of faults having occurred or are occurring;
3. Error removal: how to minimize, by verification, the presence of latent errors;
4. Error forecasting: how to estimate, by evaluation, the presence, creation, and
consequences of errors.

3 comments:

  1. Hi there! glad to drop by your page and found these very interesting and informative stuff. Thanks for sharing, keep it up!

    - reliability centered maintenance

    ReplyDelete
  2. Hi, nice post. Well what can I say is that these is an interesting and very informative topic. Thanks for sharing

    your ideas, its not just entertaining but also gives your reader knowledge. Good blogs style too, Cheers!

    - The reliability engineer jobs

    ReplyDelete
  3. From the bottom of my heart, i want to say thank you!!

    ReplyDelete