Parallel SpGEMM
From Cs240aproject
This is the Spring 2007 project of Aydin Buluc and Fenglin Liao.
Download Final Paper from here (pdf format)(ps format)
Download Final Presentation from here (ppt format)
Download Progress report from here (ps format)
Contents |
Project Description
The project is to write a 2D parallel implementation of the sparse matrix times sparse matrix routine. It is projected to be implemented in C++ using MPI. The project will have multiple iterations:
1) The simplest decomposition is the regular block checkerboard partitioning and to use a SUMMA-like algorithm on top of that. Here we assume the existence of a fast sequential algorithm for sparse matrix multiplication that is used as a block box.
2) The second way to decompose to data is to try assigning equal number of edges to each processor. The initial idea is to map each non-zero of the matrix to a point in the rectangular (m x n) region and use a quadtree [1] data structure. Here one may or may not use a sequential matrix multiplication as block box [Yet to be explored].
3) The third and the hardest way is to assign edges to processors such that the following metric is minimized.
Failed to parse (Can't write to or create math temp directory): min(max_i(communication(i) + computation(i)))
In other words, we try to load balance across the processors. The load balancing is generally done using graph partitioning [2] [3] but it is unclear how to do it for Parallel SpGEMM. Whether this load balancing is going to be done statically or dynamically is yet to be determined.
Resources
Boost.MPI [4] Note that the data pointed by the pointers in a class are transmitted automatically due to the 4th requirement of the serialization library, which is "Deep pointer save and restore. That is, save and restore of pointers saves and restores the data pointed to"
Boost Build v2 [5]
Serizalization of Derived Pointers
SDSC Resources [6]
Caveats
1) When implementing MPI with threads, you will be oversubscribing (running more processes than processors) the nodes without OpenMPI figuring this out. For example, if each of your processes has 3 threads, where 2 of these contain MPI_SEND, MPI_RECV operators, you would want your thread to yield() if it blocks on a MPI_RECV or MPI_SEND. However, by default, MPI will run on aggressive mode when the number of slots = the number of mpi_processes [7].
So, you wanna set mpi_yield_when_idle manually to 1 when oversubscribing implicitly.
mpirun --mca mpi_yield_when_idle 1 -np 16 ./testpar
Warning: This may degrade the performance when processors = threads ! [not always though]
mpi_yield_when_idle 1 --> 4.13 sec
mpi_yield_when_idle 0 --> 2.95 sec
2) OpenMPI installed on Neumann has the following default:
ompi_info | grep Thread
Thread support: posix (mpi: no, progress: no)
Meaning that the build doesn't support MPI_THREAD_MULTIPLE !
Solution: OpenMPI has thread-support disabled on default. You have to re-build OpenMPI from source and use '--enable-mpi-threads' during 'configure'.
However, OpenMPI's thread support is only lightly tested and there are many complaints about it on the web such as "it hangs". Therefore, I am installing MPICH2 for my own sake using:
./mpich2-1.0.6/configure -prefix=/usr/local/bin/mpich2 --enable-threads=multiple --with-thread-package
Now, the system has two different MPI implementations (thus two different mpirun, mpicc, etc) we need to tell the system which one we want to use by:
export PATH=/usr/local/bin/mpich2/bin:$PATH --> this will give precedence to MPICH2
However, MPICH system has the following problem (http://www.mpi-softtech.com/article.php?id=r1037051037):
"MPICH and its derivatives rely on polling for synchronization and notification, which leaves little space for exploiting programming techniques that would achieve computation and communication overlapping and optimal resource utilization [13]. Although polling can ultimately achieve the lowest message passing latency, it wastes CPU cycles for spinning on notification flags. A user thread that polls for synchronization does not yield the CPU until this thread is de-scheduled by the OS, thus reducing the CPU time for useful computation. Latency is generally not considered as a realistic measure of performance for applications that can overlap computation and communication."
Again, this might not be true either. MPICH2 installation guide says:
"build the channel of your choice. The options are sock, shm, and ssm. The shm channel is for small numbers of processes that will run on a single machine using shared memory. The shm channel should not be used for more than about 8 processes. The ssm (sock shared memory) channel is for clusters of smp nodes. This channel should not be used if you plan to over-subscribe the CPU�s. If you plan on launching more processes than you have processors you should use the default sock channel. The ssm channel uses a polling progress engine that can perform poorly when multiple processes compete for individual processors."
This sounds as if sock doesn't use a polling progress engine?
3) Forcing the processes to be launched on a subset of processors only:
mpirun -np 16 taskset -c 12-15 ./testpar --> 100 sec.
This might also be helpful in the case of processor affinity.
So, combine the two: "mpirun -np 16 --mca mpi_yield_when_idle 1 taskset -c 12-15 ./testpar " --> 7.7 sec. only
MPICH2's MPD's solution to the problem: "If you tell the mpd how many cpus it has to work with by using the --ncpus argument, as in
mpd --ncpus=2
then the number of processes started the first time the startup message circles the ring will be determined by this argument.
4) Check to see the compiler options for mpic++ by:
mpic++ --showme:compile
mpic++ --showme:link
5) Difference between boost::timer and boost::mpi::timer
boost::mpi::timer is a wrapper around the MPI_TIMER so it provides much finer grain timings up to 5 digits precision whereas boost::timer only provides 2 digits precision such as "0.04 sec".
6) Using gprof2dot to get call graphs of your program:
- Compile with -pg option
- Run that executable with whatever input you want (./exp p256), that'll create the gmon.out file
- gprof expnogas > explite.txt
- python gprof2dot.py -e 0.00 -n 0.00 explite.txt > gvexplite (includes all the nodes and edges)
- python xdot.py gvexplite
Regular Installation of Boost with MPICH2
0) If you have another boost installed before and you want to keep the installation, first remove "boost/bin.v2/libs" contents
1) Add the following lines to $(HOME)/user-config.jam:
using gcc : : : <cxxflags>-DMPICH_IGNORE_CXX_SEEK ;
using mpi : $(PATH_MPICH2)/mpicxx ;
2) Then execute the following command from the top boost directory:
sudo ./bjam --with-mpi --with-thread
3) Finally the installation:
sudo ./bjam --prefix=/usr/local/boostmpich2 --with-mpi --with-thread install
4) If you still insist on keeping the old MPI installation, then you either explicitly point out the path to new mpicxx/mpirun for each call, or put the following line to your .bashrc file so that it finds the new mpicxx/mpirun during calls.
export PATH=/usr/local/bin/mpich2/bin:$PATH
Threading Issues
Data copying from SparseDColumn<T> ** M[i][j] to local matrix by each thread takes the following times:
Without Hoard. Using kinner=i (no contention avoidance) :
Data copy took 0.134991 seconds
Data copy took 0.074193 seconds
Data copy took 0.083097 seconds
Data copy took 0.088695 seconds
For a total of 0.379 seconds.
Without Hoard. Using kinner= (i + ((dimx+dimy)% gridy)) % gridy (contention avoidance) :
Data copy took 0.090011 seconds
Data copy took 0.059552 seconds
Data copy took 0.069522 seconds
Data copy took 0.072168 seconds
For a total of 0.290 seconds.
With Hoard. Using kinner=i (no contention avoidance) :
Data copy took 0.217126 seconds
Data copy took 0.125259 seconds
Data copy took 0.142138 seconds
Data copy took 0.175438 seconds
For a total of 0.659 seconds
With Hoard. Using kinner= (i + ((dimx+dimy)% gridy)) % gridy (contention avoidance) :
Data copy took 0.217977 seconds
Data copy took 0.054324 seconds
Data copy took 0.124619 seconds
Data copy took 0.107576 seconds
For a total of 0.502 seconds
Building GASNET C++ Clients
Change .../smp-conduit/smp-par.mak file (and any .../xxx-conduit/xxx-par.mak file you wanna use):
- In two places, there are /usr/bin/gcc, make them g++
- In one place, there is -lgcc, make it -lstdc++
- Remove the -Winline flag used in compilation.
Starting them:
> export PATH=$PATH:/home/aydin/localinstall/gasnet/bin
> amudprun -spawn 'L' -np 4 ./comp input1 input2_1
Checking memory errors in them:
> export GASNET_SPAWNFN='L'
> valgrind --trace-children=yes ./testgas 4
To see the details:
> GASNET_VERBOSEENV=1
LONESTAR (VAPI/IBV) Details:
> GASNET_VAPI_SPAWNER (set to "mpi" or "ssh") can override the value set at configuration time.
> GASNET_TRACEFILE - specify a file name to recieve the trace output may also be "stdout" or "stderr", (or "-" to indicate stderr) each node may have its output directed to a separate file, and any '%' character in the value is replaced by the node number at runtime (e.g. GASNET_TRACEFILE="mytrace-%") unsetting this environment variable (or setting it to empty) disables tracing output (although the trace code still has performance impact)
> For some reason, mpi-spawner of gasnetrun_ibv do not get along well with the batch environment of lonestar. It requires a machinefile list. Luckily, ssh-spawner automatically looks at $LSB_HOSTS after if cannot find $GASNET_SSH_NODEFILE variable.
bsub -I -n 4 -W 0:05 -q development gasnetrun_ibv -n 4 -spawner=ssh ./testgas
Installing BOOST.MPI to DataStar
1) Download boost into Project directory.
2) Make sure mpCC works.
3) Go to "../boost/tools/jam/src" folder
4) Type "./build.sh vacpp"
5) Go to "../boost/tools/jam/src/bin.aixppc" folder and copy the "bjam" executable to "../boost" directory. (i.e. top-level boost directory)
6) Copy "../boost/tools/build/v2/user-config.jam" to $HOME and add line "using mpi; "
7) "using mpi; " will probably fail. Thus you might need to configure MPI yourself. In order to do that, you need to know which libraries are related to mpi. Such libraries are inside the PE (parallel environment) folder of dspoe: "/usr/lpp/ppe.poe/lib"
using mpi : : <find-shared-library>library1 <find-shared-library>library2 <find-shared-library>library3 ;
8) Type "bjam --with-mpi --toolset=vacpp" in your top-level boost directory.
to see what is going on: "bjam --with-mpi --toolset=vacpp --debug-configuration 2 > debugconferr.txt"
C++ Lessons Learned
1) Don't use operator BT() inside the composition closure object MMul. SparseDColumn (or whatever template is used for BT) is going to take care of implementing the necessary operation through the assignment & constructor accepting MMult<T> &
2) Distinguish between assigning a pointer or assigning the value of a pointer very clearly. If you're gonna malloc a pointer inside a function, you should pass it as a reference to a pointer:
void (int * & array)
But if you're gonna just change the contents (for example write NULL to the memory location)
int * array = malloc(100 * sizeof(int));
- ((MemoryPool**) array) = NULL;
CUDA Stuff
Running decuda:
>> make data/dlur.cubin
>> python decuda dlur.cubin > mygemm.decuda
Start-up script:
Inside /etc/rc.local, you'll see the following line:
./root/nvidia.sh start