
Epidemiology, the branch of medicine dealing with the study of the causes, distribution and control of disease in populations, has been the inspiration to build the so-called ‘epidemic’ or ‘epidemiologic’ algorithms for reliable and scalable communication in large distributed systems. While these algorithms were initially introduced for consistency management of replicated databases, they have more recently been effectively used as an alternative to traditional reliable broadcast and multicast protocols and for resource monitoring.
The basic underlying theory of the epidemic protocols is that, in each time step, each node x in the system selects some other node y as a communication partner (by some predetermined rule) and exchanges information with y. Over a period of time, the information spreads throughout the system in an epidemic fashion. This type of information dissemination closely resembling ‘rumor mongering’ or ‘gossiping’ among humans, the epidemiologic algorithms have lately been more popularly known as gossip-based algorithms with key advantages for large-scale distributed systems in terms of scalability, responsiveness, and robustness.
The
High-Performance
Computing (HPC) group is responsible for the GEMS (Gossip-Enabled
Monitoring Service) project and related HPC projects in the
HCS Lab at
the University of
Florida. The GEMS project focuses on the development of key
concepts and mechanisms in scalable failure detection, consensus, and resource
performance monitoring and management for heterogeneous, distributed networks
and systems and related HPC applications. The HPC group focuses on these issues
for large-scale, heterogeneous clusters and grids, and works with the
international iVDGL group at UF to adapt the
methods for scalable resource health and performance monitoring for the needs of
scientific data grids. The HPC group also works to provide high-performance
parallel solutions to complex problems including simulation of joint mechanics
and undersea surveillance with sonar array technologies. The GEMS concepts and mechanisms are now being applied to a wide array
of research challenges. These challenges include failure detection and
consensus in cluster, grid, and embedded systems, as well as performance
monitoring of both node and network resources. New techniques are being
developed for the monitoring of virtual and physical resources in virtual
grids. Still others are under development for health and performance
monitoring of broadly heterogeneous resources in reconfigurable clusters
and grids. More information on these directions is accessible via the
links in the table of contents on the lefthand side of this
page.