Reliability

Sriman Eshwarappa
Dec 20, 2021
1 min read

Reliability is the ability of the system to "work correctly" even when things go wrong.

"Working correctly" could be described based on several things depending on the system under concern. For ex:

System behaving as expected given a specific set of inputs
System can protect itself or its data from becoming corrupted when wrong inputs are given or when unexpected conditions come about
System doesn’t let unauthorized actions be taken on different data or modules

etc.

When all such conditions are met we call a system "working correctly"

The things that can go wrong are called Faults. The goal of reliability should be that faults should not result in Failures. This is called Fault-tolerance or Resilience.

Fault is deviation from spec v/s Failure is system shut down

Types of Faults and Mechanisms to Avoid these faults.

Hardware Faults:

Examples: machines can get turned off, Hard disk crash, RAM getting corrupted, network cables get cut etc.

How do we avoid:

Redundancy – add extra components so that if one goes off the other takes over. This can be done manually or programmatically(elastic).

Software Faults:

Examples: bugs, runaway processes, etc

How to avoid:

Design with all possible faults in mind

Test thoroughly, Monitor continuously

Human Errors:

Examples: Wrong configuration, forgetting something, etc.

How to avoid:

Abstract out

Decouple

Provide API/Admin privileges to correct the mistakes

Monitor continuously & Raise Alerts – Telemetry.

Practices and Processes