FT Group / Gossip V2.0 (03/16/02)
Fault Tolerance and Resource Management in Heterogeneous Distributed Networks and Systems
One promising approach for scalable failure detection in large systems
is the use of gossiping and other epidemic techniques to disseminate
availability information. Gossip protocols and services provide a means
by which failures can be detected in large, distributed systems in an
asynchronous manner without the limitations associated with reliable
multicasting for group communications. Gossip v2.0 is one such failure
detection service that addresses the needs of high-performance
applications and high-availability compute centers. A comprehensive
analysis with a full implementation of this service on a cluster testbed
of approximately 160 nodes has been performed. Please see our
publications for more information on such analysis.
A few features of Gossip v2.0 include:
- Support for any number of layers, arbitrary system sizes and
- Support for dynamic insertion of a new node into the system. The new
node can be inserted into any group in the system.
- Maintains the system in a consistent state when dead nodes come back
alive and join the system out of sync.
- Live list of a group is passed to the nodes in other groups along with
the usual gossip messages. This method overcomes inconsistencies in the
system state when a node misses a consensus broadcast packet.
- The packet building and decomposing functions have been modified to
suit both flat and layered gossiping, making the code compiler friendly
- No recompilation is required for changes in system configuration.
- In earlier versions of Gossip, at least half the nodes in a group had
to be functional in order to come to consensus on failed ones in that
group. Gossip v2.0 supports dynamic reduction of group size down to as
small as size three as nodes incrementally fail, and thus can achieve a
quorum with a few as two nodes functioning.
Experiments indicate that optimum resource efficiency is achieved when
a flat service is used for systems with fewer than eight nodes, a
two-layered service is used for systems with fewer than 64 nodes, a
three-layered service for those with fewer than 512 nodes, a four-layered
for those with fewer than 4,096 nodes, and a five-layered for those with
fewer than 32,768 nodes. Failure detection time can be as low as 130ms
for systems of any size with a group size of eight. Bandwidth used by the
service can be kept as low as 11 Kbps even for systems as large as 25,000
nodes with a five-layered service.