Definition
Fault tolerance refers to a computing functionality where a system is capable of continuing operations even after a component such as data center, server or connection has failed.
It is an essential aspect of system design to ensure operations continuity and minimize downtimes in the event of unforeseeable failures.
Components of Fault-tolerant Systems
- Hardware systems: Hardware systems use redundant components such as duplicate servers and RAID configurations to ensure continuity of operations when main systems fail.
- Software systems: Backing up databases using other programs makes them fault-tolerant. A database copy can be replicated onto another system, which comes alive after the primary database fails.
- Power sources: Infrastructures such as UPS units and backup generators are erected to make power sources fault-tolerant. These ensure an uninterrupted power supply for normal operations such as air conditioning, heating, and ventilation during power outages or failures.
Fault Tolerance Examples
- Redundant systems: System designs use redundancy to achieve fault tolerance. For instance, finding multiple servers or connections is common to ensure continuity if one fails.
- RAID storage: RAID is a type of data storage that duplicates data across different disks for easier recovery in the event of disk failure.