UPC oddities

This page is an attempt to catalogue the strange things I have seen with UPC.

-T or -fthreads compilation option

Both UPC compilers let you specify the number of threads you want to use at either compile time or run time. The catch if you wait until runtime (to avoid recompiling)? You need to have your shared matrices be multiples of THREADS, meaning that

	shared int A[100];
needs to be written as
	shared int A[THREADS * 100];
This becomes annoying, especially because the compiler will freak out if you have THREADS listed more than once in your array declarations.

This behavior is actually listed in the UPC language spec (section 6.4.3.2). I suppose the idea is to let the compiler figure out affinity for you when you choose the number of threads to use at runtime. I wasn't able to get the best performance when the compiler chose this for me, so I've always stuck to specifying the number of threads at compile time and manually figuring out the best affinity to use for arrays.

Gasnet MPI conduit

Don't use it. I had some strange problems where using a lock like this:

	int j;
	upc_forall(j = 0; j < ABIGNUMBER; j++; j)
	{
		if (something true with low probability)
		{
			upc_lock(alock);
			// do something
			upc_unlock(alock);
		}
	}
The lock would sometimes take over 5 seconds (!!) to acquire, even with only 6 or 7 threads. Also, throughput isn't that great compared to the VAPI conduit, so avoid at all costs.

pthreads and VAPI

I ran into a strange situation where compiling using the VAPI conduit in conjunction with the pthreads library slowed down portions of code that were purely doing math computations. I haven't been able to narrow down what triggers this condition, but it definitely is present in the camel cryptanalysis program I used in my independent study (code here in the camel subdirectory).

My advice is to use the pthreads library when possible, because in most cases it does result in slightly faster code. However, use some timing code (gettimeofday or something better) to make sure the parallel version of the code isn't being slowed down by linking with the pthreads library.

Again, this is a weird thing that I haven't been able to fully explain, but you should keep an eye out for it or it will screw up your speedup calculations.

Marvel weirdness in execution times

links to code below: timer.h test.upc test.c

Update 1/11/2005 -- This weirdness is probably due to extraneous processes running for the Tru64 cluster management software. See "Identifying and Eliminating the Performance Variability on ASCI Q" by Fabrizio Petrini et al.

Marvel has been giving strange execution times lately, even with code that does not use any communication at all. For example, the following UPC program:

	#include "upc.h"
	#include <math.h>
	#include "timer.h"
	
	int main()
	{
			long long s = get_cycles();
			double a = 44.4;
			for (int i = 0; i < 10000000; i++)
			{
					a = sin(a + i * 3.14/2 - 33.0);
			}
			long long e = get_cycles();
			printf("%f\n", a);
			printf("Time :%f\n", cycles_to_ns(e - s) / 1e9);
	}
Given that it has no communication at all, each thread should take nearly the same time to execute. However, running it four times gave the following output:
	marvel-1.hcs.ufl.edu> upcrun -n4 ./a.out
	-0.281251
	Time :1.901068
	-0.281251
	Time :1.898971
	-0.281251
	Time :2.694840
	-0.281251
	Time :2.623537
	marvel-1.hcs.ufl.edu> upcrun -n4 ./a.out
	-0.281251
	Time :1.876951
	-0.281251
	Time :1.886389
	-0.281251
	Time :1.877999
	-0.281251
	Time :1.879048
	marvel-1.hcs.ufl.edu> upcrun -n4 ./a.out
	-0.281251
	Time :1.874854
	-0.281251
	Time :1.875902
	-0.281251
	Time :1.885339
	-0.281251
	Time :1.876951
	marvel-1.hcs.ufl.edu> upcrun -n4 ./a.out
	-0.281251
	Time :1.876951
	-0.281251
	Time :2.518680
	-0.281251
	Time :1.889534
	-0.281251
	Time :2.564816
	marvel-1.hcs.ufl.edu>
Notice that half the time, the threads took the same time to compute (measured with gettimeofday), and the other half, two threads took much longer. I have no explanation for this yet. I tried disabling all services running on Marvel, but that had no effect. For now, I have just been throwing away the "bad" runs.

One possible problem is that one of Marvel's memory banks has slightly less memory available to it than the other banks. However, this program does not use that much memory, so I don't see how that could affect anything. I have tried messing around with different versions of the HP/Compaq UPC runtime and compiler and I still get the same thing.

What's strange is that compiling the silly program above using gcc results in the same time for execution, and running with one UPC thread also has no problems. I will try to figure this one out sometime, but for now you should be aware of this odd behavior.