The University of Florida 
High-performance Computing & Simulation Research Lab
home > project > ftgroup

    submenu »     project home | overview | downloads | publications | related links

FT Group / Gossip V2.0 (03/16/02)
Fault Tolerance and Resource Management in Heterogeneous Distributed Networks and Systems

One promising approach for scalable failure detection in large systems is the use of gossiping and other epidemic techniques to disseminate availability information. Gossip protocols and services provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limitations associated with reliable multicasting for group communications. Gossip v2.0 is one such failure detection service that addresses the needs of high-performance applications and high-availability compute centers. A comprehensive analysis with a full implementation of this service on a cluster testbed of approximately 160 nodes has been performed. Please see our publications for more information on such analysis.

A few features of Gossip v2.0 include:

  • Support for any number of layers, arbitrary system sizes and configurations.
  • Support for dynamic insertion of a new node into the system. The new node can be inserted into any group in the system.
  • Maintains the system in a consistent state when dead nodes come back alive and join the system out of sync.
  • Live list of a group is passed to the nodes in other groups along with the usual gossip messages. This method overcomes inconsistencies in the system state when a node misses a consensus broadcast packet.
  • The packet building and decomposing functions have been modified to suit both flat and layered gossiping, making the code compiler friendly and comprehensible.
  • No recompilation is required for changes in system configuration.
  • In earlier versions of Gossip, at least half the nodes in a group had to be functional in order to come to consensus on failed ones in that group. Gossip v2.0 supports dynamic reduction of group size down to as small as size three as nodes incrementally fail, and thus can achieve a quorum with a few as two nodes functioning.

Experiments indicate that optimum resource efficiency is achieved when a flat service is used for systems with fewer than eight nodes, a two-layered service is used for systems with fewer than 64 nodes, a three-layered service for those with fewer than 512 nodes, a four-layered for those with fewer than 4,096 nodes, and a five-layered for those with fewer than 32,768 nodes. Failure detection time can be as low as 130ms for systems of any size with a group size of eight. Bandwidth used by the service can be kept as low as 11 Kbps even for systems as large as 25,000 nodes with a five-layered service.