Reliability
- Sriman Eshwarappa
- Dec 20, 2021
- 1 min read
Reliability is the ability of the system to "work correctly" even when things go wrong.
"Working correctly" could be described based on several things depending on the system under concern. For ex:
System behaving as expected given a specific set of inputs
System can protect itself or its data from becoming corrupted when wrong inputs are given or when unexpected conditions come about
System doesn’t let unauthorized actions be taken on different data or modules
etc.
When all such conditions are met we call a system "working correctly"
The things that can go wrong are called Faults. The goal of reliability should be that faults should not result in Failures. This is called Fault-tolerance or Resilience.
Fault is deviation from spec v/s Failure is system shut down
Types of Faults and Mechanisms to Avoid these faults.
Hardware Faults:
Examples: machines can get turned off, Hard disk crash, RAM getting corrupted, network cables get cut etc.
How do we avoid:
Redundancy – add extra components so that if one goes off the other takes over. This can be done manually or programmatically(elastic).
Software Faults:
Examples: bugs, runaway processes, etc
How to avoid:
Design with all possible faults in mind
Test thoroughly, Monitor continuously
Human Errors:
Examples: Wrong configuration, forgetting something, etc.
How to avoid:
Abstract out
Decouple
Provide API/Admin privileges to correct the mistakes
Monitor continuously & Raise Alerts – Telemetry.
Practices and Processes

Comments