The University of Florida 
High-performance Computing & Simulation Research Lab
home > project > ftgroup

    submenu »     project home | overview | downloads | publications | related links

FT Group / Gossip with GEMS V1.0
(released 10/15/02 for Linux, 02/25/03 for Tru64 and Solaris)
Fault Tolerance and Resource Management in Heterogeneous Distributed Networks and Systems

Gossip protocols provide a scalable means for failure detection and resource monitoring in heterogeneous distributed systems in an asynchronous manner without the limits associated with group communication. In addition to supporting all the features provided in the earlier release of Gossip v2.0, Gossip service with GEMS v1.0 supports resource monitoring as an extension of gossip-style failure detection protocol. The gossip-enabled monitoring service (GEMS) is implemented by piggybacking monitored data on the failure detection messages. This technique makes the combined service scalable, distributed and fault-tolerant. The new version addresses two major challenges of clustering, failure detection and resource monitoring. The service can be used as a middleware for system administration, scheduling, and load balancing middleware services.

New features supported by Gossip service with GEMS v1.0 include:
  • Failure detection of groups through group consensus in addition to failure detection of nodes.
  • Supports fully distributed resource monitoring with an array of built-in sensors for monitoring load average, network utilization, etc.,
  • Data consistency maintained through the heartbeat protocol of gossip-style failure detection service
  • Supports aggregation of monitored parameters to present an aggregate view of groups of nodes, which also aids in improving the scalability of service through reduce resource utilization
  • Provides provisions for the dissemination and aggregation of application data
  • Dynamic inclusion of user-defined aggregation functions
  • Simple API for retrieval and dissemination of monitored data