Failure of distributed database is the situation when distributed databases do not work as they were intended. It makes the distributed database not so reliable to work with.
To design a reliable distributed system that can recover from failures, we need first to identify the types of failures with which the system has to deal. The reasons for these failures can be traced back due to both hardware and software issues.
Mainly, distributed database system has four types of failures:
Chain of Events Leading to System Failure
When instructions on a transaction are not committed in a distributed database, we can say that the transaction has failed.
Reasons why transactions fail?
The usual approach to take in cases of transaction failure is to abort the transaction.
This resets the database to a state prior to the start of a transaction.
System failures are also known as site failures. It can occur due to hardware or software failures or even both in some cases.
System failures are always assumed to result in the loss of main memory contents.
Any part of the database that was in the main memory buffer is lost as a result of a system failure. However, the database that is stored in secondary storage is assumed to be safe and correct.
There are two types of system failures :
Media failures are also known as disk failures. It refers to failures of the secondary storage devices that store the database.
Reasons why Media failures occur?
Media failures assume all or part of the database that is on the secondary storage is considered to be destroyed and inaccessible.
How to solve Media failures?
The three types of failures described above are common to both centralized and distributed DBMSs. But Communication failures are unique to the distributed databases.
The most common types of communication failures are:
Handling errors and improperly ordered messages are the responsibility of the computer network.
Lost or undeliverable messages are typically the consequence of communication line failures or (destination) site failures.
If a communication line fails, in addition to losing the message(s) in transit, it may also divide the network into two or more disjoint groups. This is called network partitioning. If the network is partitioned, the sites in each partition may continue to operate.
We need to detect the undelivered messages and use reliability protocols to ensure the messages are passed successfully.