Epidemiology, the branch of medicine dealing with the study of the causes, distribution and control of disease in populations, has been the inspiration to build the so-called ‘epidemic’ or ‘epidemiologic’ algorithms for reliable and scalable communication in large distributed systems.  While these algorithms were initially introduced for consistency management of replicated databases, they have more recently been effectively used as an alternative to traditional reliable broadcast and multicast protocols and for resource monitoring.

The basic underlying theory of the epidemic protocols is that, in each time step, each node x in the system selects some other node y as a communication partner (by some predetermined rule) and exchanges information with y.  Over a period of time, the information spreads throughout the system in an epidemic fashion.  This type of information dissemination closely resembling ‘rumor mongering’ or ‘gossiping’ among humans, the epidemiologic algorithms have lately been more popularly known as gossip-based algorithms with key advantages for large-scale distributed systems in terms of scalability, responsiveness, and robustness.

The High-Performance Computing (HPC) group is responsible for the GEMS (Gossip-Enabled Monitoring Service) project and related HPC projects in the HCS Lab at the University of Florida.  The GEMS project focuses on the development of key concepts and mechanisms in scalable failure detection, consensus, and resource performance monitoring and management for heterogeneous, distributed networks and systems and related HPC applications. The HPC group focuses on these issues for large-scale, heterogeneous clusters and grids, and works with the international iVDGL group at UF to adapt the methods for scalable resource health and performance monitoring for the needs of scientific data grids. The HPC group also works to provide high-performance parallel solutions to complex problems including simulation of joint mechanics and undersea surveillance with sonar array technologies.

The GEMS concepts and mechanisms are now being applied to a wide array of research challenges. These challenges include failure detection and consensus in cluster, grid, and embedded systems, as well as performance monitoring of both node and network resources. New techniques are being developed for the monitoring of virtual and physical resources in virtual grids. Still others are under development for health and performance monitoring of broadly heterogeneous resources in reconfigurable clusters and grids. More information on these directions is accessible via the links in the table of contents on the lefthand side of this page.