Failures and their types in distributed database

Failures in distributed database

Failure of distributed database is the situation when distributed databases do not work as they were intended. It makes the distributed database not so reliable to work with.

To design a reliable distributed system that can recover from failures, we need first to identify the types of failures with which the system has to deal. The reasons for these failures can be traced back due to both hardware and software issues.

Mainly, distributed database system has four types of failures:

Transaction failures
System failures
Media failures
Communication failures

Chain of Events Leading to System Failure

Chain of Events Leading to System Failure

Transaction failures

When instructions on a transaction are not committed in a distributed database, we can say that the transaction has failed.

Reasons why transactions fail?

Incorrect input data given to the system
While detecting Deadlock and preventing it
Few concurrency control algorithms ( which do not permit a transaction to proceed or even to wait if another transaction currently accesses the required data)

The usual approach to take in cases of transaction failure is to abort the transaction.

This resets the database to a state prior to the start of a transaction.

System failures

System failures are also known as site failures. It can occur due to hardware or software failures or even both in some cases.

System failures are always assumed to result in the loss of main memory contents.

Any part of the database that was in the main memory buffer is lost as a result of a system failure. However, the database that is stored in secondary storage is assumed to be safe and correct.

There are two types of system failures :

Total failure: failure of all sites in the distributed database system
Partial failure: failure of only some sites (others remain operational)

Media Failures

Media failures are also known as disk failures. It refers to failures of the secondary storage devices that store the database.

Reasons why Media failures occur?

Operating system (OS) errors
Hardware corruptions

Media failures assume all or part of the database that is on the secondary storage is considered to be destroyed and inaccessible.

How to solve Media failures?

Duplication of disk storage
Maintaining archival copies of the database

Communication Failures

The three types of failures described above are common to both centralized and distributed DBMSs. But Communication failures are unique to the distributed databases.

The most common types of communication failures are:

Errors in the messages
Improperly ordered messages
Lost (or undeliverable) messages
Communication line failures

Handling errors and improperly ordered messages are the responsibility of the computer network.

Lost or undeliverable messages are typically the consequence of communication line failures or (destination) site failures.

If a communication line fails, in addition to losing the message(s) in transit, it may also divide the network into two or more disjoint groups. This is called network partitioning. If the network is partitioned, the sites in each partition may continue to operate.

We need to detect the undelivered messages and use reliability protocols to ensure the messages are passed successfully.