Gossip for Failure Detection and Consensus
Researchers at UF have been active with gossip protocols for large-scale distributed systems for over four years. Initial work focused on gossip-based failure detection and consensus, and more recently the work has expanded to provide a broad class of resource liveness and performance monitoring for large clusters and for grids. Research was initially conducted via modeling and simulation, with UDP-based implementations and experiments on them having emerged in the past several years.
Gossip-style information dissemination opened a promising avenue of research on failure detection and resource monitoring for large systems. Gossip protocols and services provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limitations associated with reliable multicasting for group communications. Every time step (e.g. 10ms), every node in the system sends a gossip message to another node (e.g. chosen at random) about its perspective on the liveness status of all the other nodes in the system based upon the messages received earlier from other nodes. Every node maintains a table of heartbeat counters for every other node in the system, which is updated on the receipt of each gossip message. After a node stops sending gossip messages (i.e. the node has failed), the heartbeat count corresponding to the failed node exceeds a predetermined threshold in all nodes and the failure is detected by the other nodes in the system.
In order to obtain a consistent system view of failures and prevent false failure detections, it is necessary for all the nodes in the system to come to a consensus on the status of a failed node through some form of distributed consensus protocol (included in GEMS) through discovery and notification. To improve scalability, the system is divided into a logical hierarchy of multiple levels each with multiple groups, where more frequent communication occurs between the nodes in a same group. Nodes within each group participate at random in communication with other groups. The consensus algorithm operates within each group and notification of failures is disseminated throughout the system by members of the group.
Gossip-based failure detection has been demonstrated to be extremely resilient, efficient and scalable. It has been significantly extended to piggyback other system information along with liveness information, and the result is referred to as GEMS (Gossip-Enabled Monitoring Service). In addition to using the gossip-style failure detection service as a carrier for robust and responsive data dissemination, the service also uses the gossip heartbeat mechanism to maintain data consistency.
