HPC Group / Project Overview
Scalable and Dependable Applications and Infrastructure for
With the constantly increasing performance levels and decreasing costs associated with uniprocessor and multiprocessor servers and desktop computers, high-performance cluster systems such as Cplant at Sandia National Labs hold the potential to provide a formidable supercomputing platform for a variety of grand-challenge applications. However, in order for heterogeneous and high-performance clusters and grids to achieve their potential with parallel and distributed applications, a number of technical challenges must be overcome in terms of scalable, fault-tolerant computing and efficient parallel algorithms. These challenges are made all the more complex with the increasing heterogeneity of technologies in network architectures, server architectures, operating systems, sharing strategies, storage subsystems, etc. This group tries to address solutions to these challenges with extensive research efforts as part of various projects.
One particular need for large-scale, heterogeneous clusters and grids based on open systems strategies is the capability to achieve failure detection and consensus in a scalable fashion. Applications capable of self-healing, perhaps with checkpointing, process migration, etc., require such services so that jobs involving many resources in a large-scale system can come to a consensus on the state of the system as nodes dynamically exit and enter operational status in the system. Classical group communications are inappropriate due to their inherent limits in scalability, and proprietary vendor solutions do not support the heterogeneous nature of the system.
A particularly promising approach towards this goal leverages the notion of gossip communication. Among several options, random gossiping based on the individual exchange of liveliness information between nodes has been investigated as a possible failure-detection approach for scalable heterogeneous systems. This interest is due to several advantages of gossip-style failure detectors. For example, gossiping is resilient and does not critically depend upon any single node or message. Moreover, such protocols make minimal assumptions about the characteristics of the networks and hosts, and hold the potential to scale with system size. The gossiping approach, when coupled with an effective strategy for system-wide consensus, has proven to be an efficient, effective and scalable solution.
Work began on the project in 1998 with the development and analysis of
new protocols and simulation models for failure detection and consensus
based on gossip schemes. Since 2000, the research has primarily been
experimental in nature, beginning with the first full implementation of
the failure detection and consensus service operating on Linux-based
computers via UDP packets and gossip daemons.
Recently, new concepts and protocol components associated with the gossip service for distributed failure detection and consensus have been developed. First, a multi-layered failure detection service has been developed to support gossip hierarchies of two, three or more levels in terms of potential support for terascale systems with additional features to support dynamic node insertion and avoidance of potential deadlocks. Second, a new distributed resource monitoring service for clusters called GEMS has been developed. GEMS has the potential to serve higher-level services like schedulers, load balancers, etc., as an efficient low-overhead information provider. Work is underway to provide a user-friendly graphical interface to analyze monitored data along with facilities to store the monitored data for prediction purposes.
The most recent focus of this research is on the development of new concepts associated with the gossip service for distributed failure detection and resource monitoring in several key directions. First, the protocol extensions are being designed, developed, and analyzed to extend the monitoring service to grids. Second, GEMS service is being extended to be leveraged with and support various other services like MonALISA (monitoring service), Sphinx (grid scheduler built in UF) and In-VIGO (grid computing middleware). Third, additional features are being investigated and developed to optimize the monitoring methods and the data monitored. Finally, a preliminary investigation into promising concepts for exploiting the principles of the gossip service in new directions of key relevance to the mission at Sandia, such as scheduling, management, load balancing, and support for middleware like MPI is being undertaken.
More information on GEMS can be found here.
OTHER HPC Group PROJECTS
Reconfigurable Computing (RC) Hardware Empowered Grid Computing
Parallel and Distributed Computing for Fault-tolerant Sonar Arrays
Computational Framework for Simulating Joint Mechanics