Parallel SpGEMM

From Cs240aproject

(Difference between revisions)

Revision as of 21:00, 6 October 2008

This page is mostly for my own notes (to myself) regarding the implementation of Parallel SpGEMM library.

http://gauss.cs.ucsb.edu/~aydin

Project Description

The project is to write a 2D parallel implementation of the sparse matrix times sparse matrix routine. It is projected to be implemented in C++ using MPI. The project will have multiple iterations:

1) The simplest decomposition is the regular block checkerboard partitioning and to use a SUMMA-like algorithm on top of that. Here we assume the existence of a fast sequential algorithm for sparse matrix multiplication that is used as a block box.

2) The second way to decompose to data is to try assigning equal number of edges to each processor. The initial idea is to map each non-zero of the matrix to a point in the rectangular (m x n) region and use a quadtree [1] data structure. Here one may or may not use a sequential matrix multiplication as block box [Yet to be explored].

3) The third and the hardest way is to assign edges to processors such that the following metric is minimized.

 Failed to parse (Can't write to or create math temp directory): min(max_i(communication(i) + computation(i)))

In other words, we try to load balance across the processors. The load balancing is generally done using graph partitioning [2] [3] but it is unclear how to do it for Parallel SpGEMM. Whether this load balancing is going to be done statically or dynamically is yet to be determined.

Resources

Boost.MPI [4] Note that the data pointed by the pointers in a class are transmitted automatically due to the 4th requirement of the serialization library, which is "Deep pointer save and restore. That is, save and restore of pointers saves and restores the data pointed to"

Boost Build v2 [5]

Serizalization of Derived Pointers

SDSC Resources [6]

Caveats

1) When implementing MPI with threads, you will be oversubscribing (running more processes than processors) the nodes without OpenMPI figuring this out. For example, if each of your processes has 3 threads, where 2 of these contain MPI_SEND, MPI_RECV operators, you would want your thread to yield() if it blocks on a MPI_RECV or MPI_SEND. However, by default, MPI will run on aggressive mode when the number of slots = the number of mpi_processes [7].

So, you wanna set mpi_yield_when_idle manually to 1 when oversubscribing implicitly.

   mpirun --mca mpi_yield_when_idle 1 -np 16 ./testpar

Warning: This may degrade the performance when processors = threads ! [not always though]

mpi_yield_when_idle 1 --> 4.13 sec

mpi_yield_when_idle 0 --> 2.95 sec

2) OpenMPI installed on Neumann has the following default:

   ompi_info | grep Thread

  Thread support: posix (mpi: no, progress: no)

Meaning that the build doesn't support MPI_THREAD_MULTIPLE !

Solution: OpenMPI has thread-support disabled on default. You have to re-build OpenMPI from source and use '--enable-mpi-threads' during 'configure'.

However, OpenMPI's thread support is only lightly tested and there are many complaints about it on the web such as "it hangs". Therefore, I am installing MPICH2 for my own sake using:

   ./mpich2-1.0.6/configure -prefix=/usr/local/bin/mpich2  --enable-threads=multiple --with-thread-package

Now, the system has two different MPI implementations (thus two different mpirun, mpicc, etc) we need to tell the system which one we want to use by:

   export PATH=/usr/local/bin/mpich2/bin:$PATH --> this will give precedence to MPICH2

However, MPICH system has the following problem (http://www.mpi-softtech.com/article.php?id=r1037051037):

"MPICH and its derivatives rely on polling for synchronization and notification, which leaves little space for exploiting programming techniques that would achieve computation and communication overlapping and optimal resource utilization [13]. Although polling can ultimately achieve the lowest message passing latency, it wastes CPU cycles for spinning on notification flags. A user thread that polls for synchronization does not yield the CPU until this thread is de-scheduled by the OS, thus reducing the CPU time for useful computation. Latency is generally not considered as a realistic measure of performance for applications that can overlap computation and communication."

Again, this might not be true either. MPICH2 installation guide says:

"build the channel of your choice. The options are sock, shm, and ssm. The shm channel is for small numbers of processes that will run on a single machine using shared memory. The shm channel should not be used for more than about 8 processes. The ssm (sock shared memory) channel is for clusters of smp nodes. This channel should not be used if you plan to over-subscribe the CPUÃ�ï¿½Ã�Â¢Ã�Â¯Ã�Â¿Ã�Â½Ã�Â¯Ã�Â¿Ã�Â½s. If you plan on launching more processes than you have processors you should use the default sock channel. The ssm channel uses a polling progress engine that can perform poorly when multiple processes compete for individual processors."

This sounds as if sock doesn't use a polling progress engine?

3) Forcing the processes to be launched on a subset of processors only:

   mpirun -np 16 taskset -c 12-15 ./testpar   --> 100 sec.

This might also be helpful in the case of processor affinity.

So, combine the two: "mpirun -np 16 --mca mpi_yield_when_idle 1 taskset -c 12-15 ./testpar " --> 7.7 sec. only

MPICH2's MPD's solution to the problem: "If you tell the mpd how many cpus it has to work with by using the --ncpus argument, as in

   mpd --ncpus=2

then the number of processes started the first time the startup message circles the ring will be determined by this argument.

4) Check to see the compiler options for mpic++ by:

mpic++ --showme:compile

  mpic++ --showme:link

5) Difference between boost::timer and boost::mpi::timer

boost::mpi::timer is a wrapper around the MPI_TIMER so it provides much finer grain timings up to 5 digits precision whereas boost::timer only provides 2 digits precision such as "0.04 sec".

6) Using gprof2dot to get call graphs of your program:

- Compile with -pg option

- Run that executable with whatever input you want (./exp p256), that'll create the gmon.out file

- gprof expnogas > explite.txt

- python gprof2dot.py -e 0.00 -n 0.00 explite.txt > gvexplite (includes all the nodes and edges)

- python xdot.py gvexplite

Regular Installation of Boost with MPICH2

0) If you have another boost installed before and you want to keep the installation, first remove "boost/bin.v2/libs" contents

1) Add the following lines to $(HOME)/user-config.jam:

using gcc : : : <cxxflags>-DMPICH_IGNORE_CXX_SEEK ;

using mpi : $(PATH_MPICH2)/mpicxx ;

2) Then execute the following command from the top boost directory:

sudo ./bjam --with-mpi --with-thread

3) Finally the installation:

sudo ./bjam --prefix=/usr/local/boostmpich2 --with-mpi --with-thread install

4) If you still insist on keeping the old MPI installation, then you either explicitly point out the path to new mpicxx/mpirun for each call, or put the following line to your .bashrc file so that it finds the new mpicxx/mpirun during calls.

export PATH=/usr/local/bin/mpich2/bin:$PATH

Threading Issues

Data copying from SparseDColumn<T> ** M[i][j] to local matrix by each thread takes the following times:

Without Hoard. Using kinner=i (no contention avoidance) :

Data copy took 0.134991 seconds

Data copy took 0.074193 seconds

Data copy took 0.083097 seconds

Data copy took 0.088695 seconds

For a total of 0.379 seconds.

Without Hoard. Using kinner= (i + ((dimx+dimy)% gridy)) % gridy (contention avoidance) :

Data copy took 0.090011 seconds

Data copy took 0.059552 seconds

Data copy took 0.069522 seconds

Data copy took 0.072168 seconds

For a total of 0.290 seconds.

With Hoard. Using kinner=i (no contention avoidance) :

Data copy took 0.217126 seconds

Data copy took 0.125259 seconds

Data copy took 0.142138 seconds

Data copy took 0.175438 seconds

For a total of 0.659 seconds

With Hoard. Using kinner= (i + ((dimx+dimy)% gridy)) % gridy (contention avoidance) :

Data copy took 0.217977 seconds

Data copy took 0.054324 seconds

Data copy took 0.124619 seconds

Data copy took 0.107576 seconds

For a total of 0.502 seconds

Building GASNET C++ Clients

Change .../smp-conduit/smp-par.mak file (and any .../xxx-conduit/xxx-par.mak file you wanna use):

- In two places, there are /usr/bin/gcc, make them g++

- In one place, there is -lgcc, make it -lstdc++

- Remove the -Winline flag used in compilation.

Starting them:

> export PATH=$PATH:/home/aydin/localinstall/gasnet/bin

> amudprun -spawn 'L' -np 4 ./comp input1 input2_1

Checking memory errors in them:

> export GASNET_SPAWNFN='L'

> valgrind --trace-children=yes ./testgas 4

To see the details:

> GASNET_VERBOSEENV=1

LONESTAR (VAPI/IBV) Details:

> GASNET_VAPI_SPAWNER (set to "mpi" or "ssh") can override the value set at configuration time.

> GASNET_TRACEFILE - specify a file name to recieve the trace output may also be "stdout" or "stderr", (or "-" to indicate stderr) each node may have its output directed to a separate file, and any '%' character in the value is replaced by the node number at runtime (e.g. GASNET_TRACEFILE="mytrace-%") unsetting this environment variable (or setting it to empty) disables tracing output (although the trace code still has performance impact)

> For some reason, mpi-spawner of gasnetrun_ibv do not get along well with the batch environment of lonestar. It requires a machinefile list. Luckily, ssh-spawner automatically looks at $LSB_HOSTS after if cannot find $GASNET_SSH_NODEFILE variable.

bsub -I -n 4 -W 0:05 -q development gasnetrun_ibv -n 4 -spawner=ssh ./testgas

Installing BOOST.MPI to DataStar

1) Download boost into Project directory.

2) Make sure mpCC works.

3) Go to "../boost/tools/jam/src" folder

4) Type "./build.sh vacpp"

5) Go to "../boost/tools/jam/src/bin.aixppc" folder and copy the "bjam" executable to "../boost" directory. (i.e. top-level boost directory)

6) Copy "../boost/tools/build/v2/user-config.jam" to $HOME and add line "using mpi; "

7) "using mpi; " will probably fail. Thus you might need to configure MPI yourself. In order to do that, you need to know which libraries are related to mpi. Such libraries are inside the PE (parallel environment) folder of dspoe: "/usr/lpp/ppe.poe/lib"

using mpi : : <find-shared-library>library1 <find-shared-library>library2 <find-shared-library>library3 ;

8) Type "bjam --with-mpi --toolset=vacpp" in your top-level boost directory.

to see what is going on: "bjam --with-mpi --toolset=vacpp --debug-configuration 2 > debugconferr.txt"

C++ Lessons Learned

1) Don't use operator BT() inside the composition closure object MMul. SparseDColumn (or whatever template is used for BT) is going to take care of implementing the necessary operation through the assignment & constructor accepting MMult<T> &

2) Distinguish between assigning a pointer or assigning the value of a pointer very clearly. If you're gonna malloc a pointer inside a function, you should pass it as a reference to a pointer:

void (int * & array)

But if you're gonna just change the contents (for example write NULL to the memory location)

int * array = malloc(100 * sizeof(int));

*((MemoryPool**) array) = NULL;

3) Optional parameters for members of a class:

Inside header: SparseDColumn (const SparseTriplets<T> & rhs, bool transpose, MemoryPool * mpool = NULL);

Inside cpp: SparseDColumn<T>::SparseDColumn(const SparseTriplets<T> & rhs, bool transpose, MemoryPool * mpool){...}

The reason for this is natural. Other parts of the code that uses this class should know that mpool is an optional argument, and it is OK not to supply it. We can only achieve that effect by putting it inside the header file.

CUDA Stuff

Running decuda:

>> make data/dlur.cubin

>> python decuda dlur.cubin > mygemm.decuda

Start-up script:

Inside /etc/rc.local, you'll see the following line:

./root/nvidia.sh start

CUDA SDK 1.1 caveat:

>> make USECUBLAS=1 verbose=1

Check the version of the driver:

>> cat /proc/driver/nvidia/version

No asshole rule: Don't try to mix incompatible Nvidia device drivers with Cuda SDK's. For example, driver 17x.xx only works with CUDA 2.0, not CUDA 1.1. Look here: [[8]]

Our NVIDIA 8800 Ultra is installed on a 4x PC-I Express port. It should be switched to a 16x port. Here is how to get extra info:

[root@neumann cpufreq]# nvclock -i

-- General info --

Card: nVidia Geforce 8800Ultra

Architecture: NV50/G80 A3

PCI id: 0x194

GPU clock: 648.000 MHz

Bustype: PCI-Express

-- Shader info --

Clock: 1512.000 MHz

Stream units: 128 (11111111b)

ROP units: 24 (111111b)

-- Memory info --

Amount: 768 MB

Type: 384 bit DDR3

Clock: 1152.000 MHz

-- PCI-Express info --

Current Rate: 4X

Maximum rate: 16X

-- Sensor info --

Sensor: Analog Devices ADT7473

Board temperature: 48C

GPU temperature: 57C

Fanspeed: 82 RPM

Fanspeed mode: manual

PWM duty cycle: 60.0%

-- VideoBios information --

Version: 60.80.18.00.12

Signon message: G80 P355 SKU 0002 VGA BIOS

Performance level 0: gpu 660MHz/shader 1512MHz/memory 1150MHz/1.35V/100%

VID mask: 3

Voltage level 0: 1.10V, VID: 0

Voltage level 1: 1.20V, VID: 1

Voltage level 2: 1.35V, VID: 2

Painful X11VNC Connections

- Before installing x11vnc, make sure your system has XTERM. Otherwise keyboard strokes won't work. x11vnc ./configure script will actually warn you if you don't have xterm.

./configure > myconfig.txt

configure: WARNING:

A working build environment for the XTEST extension was not found

(libXtst). An x11vnc built this way will be only barely usable.

You will be able to move the mouse but not click or type. There can

also be deadlocks if an application grabs the X server.

It is recommended that you install the necessary development packages

for XTEST (perhaps it is named something like libxtst-dev) and run

configure again.

- So you do:

>> yum install libXtst-devel

- You'll also need a dummy frame buffer (that you can do after installing X11VNC too)

>> yum install xorg-x11-server-Xvfb

- Then start x11vnc server (-create option opens up a new X for you)

>> x11vnc -create

- Connect from client (actually you're tunneling through ssh and connecting VNC through localhost:0 of the remote machine)

>> ./ssvnc -ssh cs290N@neumann.cs.ucsb.edu:0

- Now, you'll see that you need an x-server... Something missing on Neumann

>> yum install gdm

>> whereis gdm

- For those of you who wanna cry, Neumann probably doesn't have an "X" either

>> yum grouplist

>> yum groupinstall "X Window System" "GNOME Desktop Environment"

- Now the problem seems to be the following: X Window System on the remote machine doesn't use the NVIDIA drivers. That might be due to the lack of monitor (impossible to fix) or we may find a way out... Just in case keep in mind that it's possible to see a "normal X session" via X forwarding VNC (but not with X forwarding).

1) Xvfb :1 -screen 0 800x600x16 -ac & (creates an x-server with a 800x600 screen on display :15 and disable access controls). Check if it really went to background by typing "jobs"

2) nvidia-settings --display=:15 (starts nvidia-settings in display : 15)

3) x11vnc -create -localhost -display :15

4) Connect through ssh-vnc viewer... (cs290N@neumann.cs.ucsb.edu:0)

@@ Line 455: / Line 455: @@
-- Then start x11vnc server (-create options opens up a new X for you)
+- Then start x11vnc server (-create option opens up a new X for you)
 >> x11vnc -create
@@ Line 481: / Line 481: @@
 - Now the problem seems to be the following: X Window System on the remote machine doesn't use the NVIDIA drivers. That might be due to the lack of monitor (impossible to fix) or we may find a way out... Just in case keep in mind that it's possible to see a "normal X session" via X forwarding VNC (but not with X forwarding).
-) ''Xvfb :1 -screen 0 800x600x16 -ac &''  (creates an x-server with a 800x600 screen on display :15 and disable access controls). Check if it really went o background by typing "jobs"
+) ''Xvfb :1 -screen 0 800x600x16 -ac &''  (creates an x-server with a 800x600 screen on display :15 and disable access controls). Check if it really went to background by typing "jobs"
 ) ''nvidia-settings --display=:15''  (starts nvidia-settings in display : 15)

Parallel SpGEMM

From Cs240aproject

Revision as of 21:00, 6 October 2008

Contents

Project Description

Resources

Caveats

Regular Installation of Boost with MPICH2

Threading Issues

Building GASNET C++ Clients

Installing BOOST.MPI to DataStar

C++ Lessons Learned

CUDA Stuff

Painful X11VNC Connections

Views

Personal tools

Navigation

Search

EditThis.info tools

Toolbox

Other sites