Parallel SpGEMM

From Cs240aproject

Revision as of 21:36, 17 June 2008 by Aydozz (Talk | contribs)

This is the Spring 2007 project of Aydin Buluc and Fenglin Liao.

Download Final Paper from here (pdf format)(ps format)

Download Final Presentation from here (ppt format)

Download Progress report from here (ps format)



Contents

Project Description

The project is to write a 2D parallel implementation of the sparse matrix times sparse matrix routine. It is projected to be implemented in C++ using MPI. The project will have multiple iterations:

1) The simplest decomposition is the regular block checkerboard partitioning and to use a SUMMA-like algorithm on top of that. Here we assume the existence of a fast sequential algorithm for sparse matrix multiplication that is used as a block box.

2) The second way to decompose to data is to try assigning equal number of edges to each processor. The initial idea is to map each non-zero of the matrix to a point in the rectangular (m x n) region and use a quadtree [1] data structure. Here one may or may not use a sequential matrix multiplication as block box [Yet to be explored].

3) The third and the hardest way is to assign edges to processors such that the following metric is minimized.

 Failed to parse (Can't write to or create math temp directory): min(max_i(communication(i) + computation(i)))


In other words, we try to load balance across the processors. The load balancing is generally done using graph partitioning [2] [3] but it is unclear how to do it for Parallel SpGEMM. Whether this load balancing is going to be done statically or dynamically is yet to be determined.



Resources

Boost.MPI [4] Note that the data pointed by the pointers in a class are transmitted automatically due to the 4th requirement of the serialization library, which is "Deep pointer save and restore. That is, save and restore of pointers saves and restores the data pointed to"

Boost Build v2 [5]

Serizalization of Derived Pointers

SDSC Resources [6]


Caveats

1) When implementing MPI with threads, you will be oversubscribing (running more processes than processors) the nodes without OpenMPI figuring this out. For example, if each of your processes has 3 threads, where 2 of these contain MPI_SEND, MPI_RECV operators, you would want your thread to yield() if it blocks on a MPI_RECV or MPI_SEND. However, by default, MPI will run on aggressive mode when the number of slots = the number of mpi_processes [7].

So, you wanna set mpi_yield_when_idle manually to 1 when oversubscribing implicitly.

   mpirun --mca mpi_yield_when_idle 1 -np 16 ./testpar


Warning: This may degrade the performance when processors = threads ! [not always though]

mpi_yield_when_idle 1 --> 4.13 sec

mpi_yield_when_idle 0 --> 2.95 sec


2) OpenMPI installed on Neumann has the following default:

   ompi_info | grep Thread
  Thread support: posix (mpi: no, progress: no)  

Meaning that the build doesn't support MPI_THREAD_MULTIPLE !

Solution: OpenMPI has thread-support disabled on default. You have to re-build OpenMPI from source and use '--enable-mpi-threads' during 'configure'.

However, OpenMPI's thread support is only lightly tested and there are many complaints about it on the web such as "it hangs". Therefore, I am installing MPICH2 for my own sake using:

   ./mpich2-1.0.6/configure -prefix=/usr/local/bin/mpich2  --enable-threads=multiple --with-thread-package

Now, the system has two different MPI implementations (thus two different mpirun, mpicc, etc) we need to tell the system which one we want to use by:

   export PATH=/usr/local/bin/mpich2/bin:$PATH --> this will give precedence to MPICH2

However, MPICH system has the following problem (http://www.mpi-softtech.com/article.php?id=r1037051037):

"MPICH and its derivatives rely on polling for synchronization and notification, which leaves little space for exploiting programming techniques that would achieve computation and communication overlapping and optimal resource utilization [13]. Although polling can ultimately achieve the lowest message passing latency, it wastes CPU cycles for spinning on notification flags. A user thread that polls for synchronization does not yield the CPU until this thread is de-scheduled by the OS, thus reducing the CPU time for useful computation. Latency is generally not considered as a realistic measure of performance for applications that can overlap computation and communication."

Again, this might not be true either. MPICH2 installation guide says:

"build the channel of your choice. The options are sock, shm, and ssm. The shm channel is for small numbers of processes that will run on a single machine using shared memory. The shm channel should not be used for more than about 8 processes. The ssm (sock shared memory) channel is for clusters of smp nodes. This channel should not be used if you plan to over-subscribe the CPU�s. If you plan on launching more processes than you have processors you should use the default sock channel. The ssm channel uses a polling progress engine that can perform poorly when multiple processes compete for individual processors."

This sounds as if sock doesn't use a polling progress engine?



3) Forcing the processes to be launched on a subset of processors only:

   mpirun -np 16 taskset -c 12-15 ./testpar   --> 100 sec.

This might also be helpful in the case of processor affinity.

So, combine the two: "mpirun -np 16 --mca mpi_yield_when_idle 1 taskset -c 12-15 ./testpar " --> 7.7 sec. only

MPICH2's MPD's solution to the problem: "If you tell the mpd how many cpus it has to work with by using the --ncpus argument, as in

   mpd --ncpus=2

then the number of processes started the first time the startup message circles the ring will be determined by this argument.

4) Check to see the compiler options for mpic++ by:

mpic++ --showme:compile

  mpic++ --showme:link


5) Difference between boost::timer and boost::mpi::timer

boost::mpi::timer is a wrapper around the MPI_TIMER so it provides much finer grain timings up to 5 digits precision whereas boost::timer only provides 2 digits precision such as "0.04 sec".

6) Using gprof2dot to get call graphs of your program:

- Compile with -pg option

- Run that executable with whatever input you want (./exp p256), that'll create the gmon.out file

- gprof expnogas > explite.txt

- python gprof2dot.py -e 0.00 -n 0.00 explite.txt > gvexplite (includes all the nodes and edges)

- python xdot.py gvexplite


Regular Installation of Boost with MPICH2

0) If you have another boost installed before and you want to keep the installation, first remove "boost/bin.v2/libs" contents

1) Add the following lines to $(HOME)/user-config.jam:

using gcc : : : <cxxflags>-DMPICH_IGNORE_CXX_SEEK ;

using mpi : $(PATH_MPICH2)/mpicxx ;

2) Then execute the following command from the top boost directory:

sudo ./bjam --with-mpi --with-thread

3) Finally the installation:

sudo ./bjam --prefix=/usr/local/boostmpich2 --with-mpi --with-thread install

4) If you still insist on keeping the old MPI installation, then you either explicitly point out the path to new mpicxx/mpirun for each call, or put the following line to your .bashrc file so that it finds the new mpicxx/mpirun during calls.

export PATH=/usr/local/bin/mpich2/bin:$PATH


Threading Issues

Data copying from SparseDColumn<T> ** M[i][j] to local matrix by each thread takes the following times:

Without Hoard. Using kinner=i (no contention avoidance) :

Data copy took 0.134991 seconds

Data copy took 0.074193 seconds

Data copy took 0.083097 seconds

Data copy took 0.088695 seconds


For a total of 0.379 seconds.


Without Hoard. Using kinner= (i + ((dimx+dimy)% gridy)) % gridy (contention avoidance) :

Data copy took 0.090011 seconds

Data copy took 0.059552 seconds

Data copy took 0.069522 seconds

Data copy took 0.072168 seconds

For a total of 0.290 seconds.


With Hoard. Using kinner=i (no contention avoidance) :

Data copy took 0.217126 seconds

Data copy took 0.125259 seconds

Data copy took 0.142138 seconds

Data copy took 0.175438 seconds


For a total of 0.659 seconds


With Hoard. Using kinner= (i + ((dimx+dimy)% gridy)) % gridy (contention avoidance) :

Data copy took 0.217977 seconds

Data copy took 0.054324 seconds

Data copy took 0.124619 seconds

Data copy took 0.107576 seconds


For a total of 0.502 seconds



Building GASNET C++ Clients

Change .../smp-conduit/smp-par.mak file (and any .../xxx-conduit/xxx-par.mak file you wanna use):

- In two places, there are /usr/bin/gcc, make them g++

- In one place, there is -lgcc, make it -lstdc++

- Remove the -Winline flag used in compilation.


Starting them:

> export PATH=$PATH:/home/aydin/localinstall/gasnet/bin

> amudprun -spawn 'L' -np 4 ./comp input1 input2_1


Checking memory errors in them:

> export GASNET_SPAWNFN='L'

> valgrind --trace-children=yes ./testgas 4


To see the details:

> GASNET_VERBOSEENV=1


LONESTAR (VAPI/IBV) Details:

> GASNET_VAPI_SPAWNER (set to "mpi" or "ssh") can override the value set at configuration time.

> GASNET_TRACEFILE - specify a file name to recieve the trace output may also be "stdout" or "stderr", (or "-" to indicate stderr) each node may have its output directed to a separate file, and any '%' character in the value is replaced by the node number at runtime (e.g. GASNET_TRACEFILE="mytrace-%") unsetting this environment variable (or setting it to empty) disables tracing output (although the trace code still has performance impact)

> For some reason, mpi-spawner of gasnetrun_ibv do not get along well with the batch environment of lonestar. It requires a machinefile list. Luckily, ssh-spawner automatically looks at $LSB_HOSTS after if cannot find $GASNET_SSH_NODEFILE variable.

bsub -I -n 4 -W 0:05 -q development gasnetrun_ibv -n 4 -spawner=ssh ./testgas

Installing BOOST.MPI to DataStar

1) Download boost into Project directory.

2) Make sure mpCC works.

3) Go to "../boost/tools/jam/src" folder

4) Type "./build.sh vacpp"

5) Go to "../boost/tools/jam/src/bin.aixppc" folder and copy the "bjam" executable to "../boost" directory. (i.e. top-level boost directory)

6) Copy "../boost/tools/build/v2/user-config.jam" to $HOME and add line "using mpi; "

7) "using mpi; " will probably fail. Thus you might need to configure MPI yourself. In order to do that, you need to know which libraries are related to mpi. Such libraries are inside the PE (parallel environment) folder of dspoe: "/usr/lpp/ppe.poe/lib"

using mpi : : <find-shared-library>library1 <find-shared-library>library2 <find-shared-library>library3 ;

8) Type "bjam --with-mpi --toolset=vacpp" in your top-level boost directory.

to see what is going on: "bjam --with-mpi --toolset=vacpp --debug-configuration 2 > debugconferr.txt"


C++ Lessons Learned

1) Don't use operator BT() inside the composition closure object MMul. SparseDColumn (or whatever template is used for BT) is going to take care of implementing the necessary operation through the assignment & constructor accepting MMult<T> &

2) Distinguish between assigning a pointer or assigning the value of a pointer very clearly. If you're gonna malloc a pointer inside a function, you should pass it as a reference to a pointer:

void (int * & array)

But if you're gonna just change the contents (for example write NULL to the memory location)

int * array = malloc(100 * sizeof(int));

  • ((MemoryPool**) array) = NULL;


CUDA Stuff

Running decuda:

>> make data/dlur.cubin

>> python decuda dlur.cubin > mygemm.decuda


Start-up script:

Inside /etc/rc.local, you'll see the following line:

./root/nvidia.sh start

Personal tools