This flash video shows our prototype support for GPSHMEM in TAU and KOJAK in action. It was run on Adam's single-processor Linux workstation named "lazyboy". Please be patient as the flash video is very large and may take a long time to load. Also note that you need Macromedia Flash v7.0 plugin installed (you can get it from here). This demo was recorded in 1024x768 resolution, and is best viewed in a resolution larger than that.
For more information on TAU and KOJAK, see their respective evaluations --
TAUKOJAKJumpshot
.
The sequence of steps shown by the demo are as follows:
The demo starts off in an xterm. All relevant software (KOJAK, TAU, Jumpshot, GPSHMEM, and ARMCI) has already been installed in the ~/work/usr directory. The ~/work/code directory contains a Makefile setup and several example GPSHMEM codes from the GPSHMEM distribution, along with the TAU and KOJAK GPSHMEM wrapper code. The application used in this demo is the fox.c program that comes distributed with GPSHMEM itself as an example program. It uses Fox's algorithm to multiply two square matrices. The fox.c code used in the demo is almost entirely unmodified, with the exception that a tiny amount of instrumentation (approx. 3 lines) was needed to get it to work with KOJAK.
After a few ls commands, you'll see the Makefile being opened up with vim. The Makefile is a standard Makefile, with the CC, CFLAGS, and other commands being controlled by the included Makefiles (Makefile.tau, Makefile.tautrace, and Makefile.kojak). After a quick view of the Makefile, you will see Makefile.tau in vim.
After make is run, you will see the compilation and linking process for TAU in profile mode as specified in the Makefile.
After the compilation is done, you will see the fox executable being run. Since GPSHMEM uses MPI, the executable must be run with mpirun. Note that since this is being run with 4 processes on a single-processor machine, so the application runs slow. Also, since the machine is simultaneously recording video, the timing between runs fluctuates, so the overall times shown in the demo do not give a good indication of the relative overhead of using each performance analysis tool.
Using TAU in profile mode results in several .profile files being created after the program is done running. This demo will show TAU's utilities for showing the information contained in those files. First, the pprof command is run and some text output is displayed showing the relative amount of time taken by overall application code and GPSHMEM functions. Then, the paraprof command is run, which shows the same information graphically.
Next, the demo will show TAU being run in trace mode. Steps 2-4 are repeated, except that the Makefile uses Makefile.tautrace instead of Makefile.tau. After the fox program is run, several .trc files are created. These are merged and exported to SLOG-2 format, and then displayed with Jumpshot-4. Browsing the SLOG-2 file with Jumpshot, we see that the application does not spend much time in communication, and we also see where each GPSHMEM call occurs on the timeline. Note that since we are actually running this on a single-processor machine, some of the timings between function calls (ie, the GPSHMEM barrier call) seem a bit off. This can be attributed to the context switches that occur due to the timesharing of the single CPU.
Finally, the demo will show fox being used with KOJAK. Steps 2-4 are again repeated with the Makefile using Makefile.kojak. After the fox program is run, the kanal program is used to analyze the resulting EPILOG trace, and then show the CUBE GUI. Since GPSHMEM uses some MPI calls for various things (such as in its initialization code), CUBE shows MPI calls as children of GPSHMEM calls in the center window pane. KOJAK identifies a few late sender and wait at barrier bottlenecks and these are displayed by CUBE. CUBE also shows that for this run, the overall execution time was dominated by computation and not by GPSHMEM calls.