|
MS Group / Project Overview
Modeling and Simulation for Performance and Tradeoff Analysis in High-Performance Computing and
Networking
As processing power and networking capabilities increase, so do the needs of the applications that push these technologies to the limit. In order to satisfy the almost insatiable need for more performance, applications are parallelized and executed on parallel machines such as clusters - collections of computers connected by conventional, or in many cases, high-speed networks. The potential heterogeneity of high-performance systems gives rise to many issues that can dictate the efficiency of hardware usage, the execution time for a job, and the overhead involved in running a parallel or distributed program. Finding the optimum configuration of hardware and software components and task distribution for a specific application is almost impossible and very impractical from an experimental standpoint. Therefore, simulation tools can be employed to provide an affordable and effective means to determine system configurations for particular applications. Currently, researchers and industry use these tools to determine high-level aspects of systems such as resource allocation, network congestion, and job completion time. Many of these tools sacrifice fidelity for speed or vice versa. This trade-off raises many questions and concerns when developing a mission-level simulation environment. What level of fidelity should be used for modeling components in a system to ensure reasonable execution times while providing an accurate portrayal of the system? Is it feasible to use high-fidelity models for some components while sacrificing accuracy for others to speed up simulations? What aspects of a system are the most important in determining the best configuration on which to run an application? Research conducted by members of the MS Group, specifically work on the Fast and Accurate Simulation Environment (FASE), aims to answer these questions while providing an environment for the simulation of high-performance systems with mission-critical applications. The ultimate goal of FASE is to provide a robust simulation environment that will allow a user to create customized systems of any size and topology using a variety of components such as clusters comprised of symmetric multiprocessors, reconfigurable devices, and high-speed interconnects.
The first iteration of FASE focuses on the design and implementation of key elements involved in the execution of a parallel program. In general, a parallel application's execution time can be broken down into computation and communication. Computation is abstracted through the use of simple timing functions activated and deactivated between communication events. The times obtained are then scaled in order to model other computational units. Communication, by contrast, is modeled at a higher fidelity. Parameters such as source, destination and message size are collected from each communication event during the application's execution. Currently, the communication events must be a selected subset of either the MPI or SHMEM libraries; however, future iterations will include more extensive function support for MPI and SHMEM as well as UPC support.
These events drive network models that accurately portray the actual interconnect. The current network library consists of InfiniBand, RapidIO, Scalable Coherent Interface (SCI), Ethernet, and TCP/IP. These models have been developed using the simulation tool Mission-Level Designer (MLD), which provides the foundation for FASE. MLD is a commercial block-oriented, discrete-event simulator based on C++ that allows virtually anything to be modeled at any level of fidelity. Future iterations of FASE will focus on advancing the modeling and simulation of the computation in an application. This work will consist of determining different pre-simulation methods to capture relevant details of the computational blocks in an architecture-independent manner, as well as simulation methodologies that employ the pre-simulation information to accurately represent the results produced by the virtual processor(s).
The primary focus of the MS Group is on modeling and simulation with an emphasis on the balance between fidelity and simulation speed to support performance analysis and projections for clusters operating in a mission-critical, scientific computing environment. In addition to broad goals within this focus, our research also consists of simulative investigations involving three key objectives. The first major theme focuses on virtual prototyping of advanced space system architectures based on RapidIO, through work sponsored by Honeywell Space Systems. This work includes an investigation through computer-based simulation into the optimal means by which to develop advanced space system architectures based on RapidIO. In addition to the vast design space regarding node and board architectures in such an embedded processing system, there are a countless number of options in the RapidIO design space such as link speed, link width, and level of flow control. This project uses simulation in order to gain a thorough understanding of the tradeoffs associated with these design options supporting several space-based radar algorithms on a space-based cluster of embedded processing nodes. Most recently, we are studying issues related to fault-tolerance for RapidIO-based space systems. Previous space-based networks mostly relied on bus technology, and the potential for a failure of the bus required that a completely separate backup network be employed. Through the use of switched RapidIO networks, the fault-tolerant design space has opened up with many potential avenues to explore in this area.
Another theme deals with the modeling and analysis of new forms of optical avionics networks using Wavelength Division Multiplexing (WDM). We are collaborating with the Naval Air Systems Command (NAVAIR) to support work towards a new standard for optical LANs in aircraft. This project will build off previous work performed in the Optical Networking (OPN) group of the HCS Lab. The OPN Group originally worked with Rockwell Collins to produce an optical component library (LION, the Library for Integrated Optical Networking), which we will use and enhance. This library will enable us to create a variety of complex optical networks to evaluate the trade-offs involved with employing different topologies, upper-layer protocols, and network components.
Yet another theme focuses on network, architecture, and system simulations for large-scale data grids related to the iVDGL project. The MS group is active on research in the area of simulating and analyzing end-to-end performance of long-haul grid networks and the systems they interconnect for data-intensive scientific computing. A significant challenge is that elements of the network and system architectures, from storage systems to network control and resource-sharing algorithms, may not scale to this regime. The goal of this activity is to develop an understanding of these limits and formulate from them potential solutions for achieving end-to-end performance through the use of carefully crafted simulation models where simulation fidelity and speed are balanced and the emphasis includes computer engineering issues at and below the transport layer.
In summary, the MS Group focuses on several related but distinct research topics. The first is FASE, which aims to provide a modeling environment such that large, heterogeneous clusters can be simulated effectively for critical applications. The key challenges are in first understanding the limits of the diverse and constantly evolving technologies involved, from storage devices to wide-area optical networks, and then in achieving an efficient and effective balance between simulation fidelity and speed to gain insight through simulative research in a timely manner with reasonable accuracy. Another key topic deals with the design and analysis of space-based systems using the RapidIO interconnect. Simulation is used to explore design tradeoffs for RapidIO-based systems executing space-based radar and other cutting-edge applications. In addition, we are studying fault-tolerance issues related to RapidIO networks in space. A third activity deals with the design and analysis of optical components in avionics specifically dealing with military networking applications. The key challenge for this project is to configure the topology, network characteristics, and upper-layer protocols for an optical network dealing with data traffic patterns and environmental conditions found in military aircraft. The grid-level simulation research is a fourth topic of interest in the group. This work parallels the FASE project, although it deals with systems and datasets on a much grander scale. Like FASE, the main difficulty will be balancing simulation speed and fidelity while producing accurate results, but this research will also incorporate other factors such as resource management and technological limitations at all layers of the communication stack.
|
 |