Parallel SpGEMM
From Cs240aproject
|  (→Caveats) |  (→C++ Lessons Learned) | ||
| Line 319: | Line 319: | ||
|          // char * begin, size_t size   |          // char * begin, size_t size   | ||
|          outfile << "chunk " << i << " starts:" <<  (void*) begin << " and ends: " << (void*) (begin + size) << endl; |          outfile << "chunk " << i << " starts:" <<  (void*) begin << " and ends: " << (void*) (begin + size) << endl; | ||
| + | |||
| + | 5) If you're calling a C-library from C++ code, you have to wrap those C function that you're calling around an ''extern C {...}'' wrapper (or if you're sure that the whole library is C++ free, you can wrap the whole ''#include <myclib.h>'' command. | ||
| + | [http://www.parashift.com/c++-faq-lite/mixing-c-and-cpp.html Click for more info] | ||
| + | |||
| + | 6) Difference between various type of quotes in linux shell: | ||
| + | [http://blackfin.uclinux.org/gf/project/bfin-colinux/forum/?action=ForumBrowse&forum_id=116&_forum_action=ForumMessageBrowse&thread_id=1410 Click for more info] | ||
| == CUDA Stuff == | == CUDA Stuff == | ||
Revision as of 15:48, 30 December 2008
This page is mostly for my own notes (to myself) regarding the implementation of Parallel SpGEMM library.
http://gauss.cs.ucsb.edu/~aydin
| Contents | 
Project Description
The project is to write a 2D parallel implementation of the sparse matrix times sparse matrix routine. It is projected to be implemented in C++ using MPI. The project will have multiple iterations:
1) The simplest decomposition is the regular block checkerboard partitioning and to use a SUMMA-like algorithm on top of that. Here we assume the existence of a fast sequential algorithm for sparse matrix multiplication that is used as a block box.
2) The second way to decompose to data is to try assigning equal number of edges to each processor. The initial idea is to map each non-zero of the matrix to a point in the rectangular (m x n) region and use a quadtree [1] data structure. Here one may or may not use a sequential matrix multiplication as block box [Yet to be explored].
3) The third and the hardest way is to assign edges to processors such that the following metric is minimized.
Failed to parse (Can't write to or create math temp directory): min(max_i(communication(i) + computation(i)))
In other words, we try to load balance across the processors. The load balancing is generally done using graph partitioning [2] [3] but it is unclear how to do it for Parallel SpGEMM. Whether this load balancing is going to be done statically or dynamically is yet to be determined.
Resources
Boost.MPI [4] Note that the data pointed by the pointers in a class are transmitted automatically due to the 4th requirement of the serialization library, which is "Deep pointer save and restore. That is, save and restore of pointers saves and restores the data pointed to"
Boost Build v2 [5]
Serizalization of Derived Pointers
SDSC Resources [6]
Caveats
1) When implementing MPI with threads, you will be oversubscribing (running more processes than processors) the nodes without OpenMPI figuring this out. For example, if each of your processes has 3 threads, where 2 of these contain MPI_SEND, MPI_RECV operators, you would want your thread to yield() if it blocks on a MPI_RECV or MPI_SEND. However, by default, MPI will run on aggressive mode when the number of slots = the number of mpi_processes [7].
So, you wanna set mpi_yield_when_idle manually to 1 when oversubscribing implicitly.
mpirun --mca mpi_yield_when_idle 1 -np 16 ./testpar
Warning: This may degrade the performance when processors = threads ! [not always though]
mpi_yield_when_idle 1 --> 4.13 sec
mpi_yield_when_idle 0 --> 2.95 sec
2) OpenMPI installed on Neumann has the following default:
ompi_info | grep Thread
Thread support: posix (mpi: no, progress: no)
Meaning that the build doesn't support MPI_THREAD_MULTIPLE !
Solution: OpenMPI has thread-support disabled on default. You have to re-build OpenMPI from source and use '--enable-mpi-threads' during 'configure'.
However, OpenMPI's thread support is only lightly tested and there are many complaints about it on the web such as "it hangs". Therefore, I am installing MPICH2 for my own sake using:
./mpich2-1.0.6/configure -prefix=/usr/local/bin/mpich2 --enable-threads=multiple --with-thread-package
Now, the system has two different MPI implementations (thus two different mpirun, mpicc, etc) we need to tell the system which one we want to use by:
export PATH=/usr/local/bin/mpich2/bin:$PATH --> this will give precedence to MPICH2
However, MPICH system has the following problem (http://www.mpi-softtech.com/article.php?id=r1037051037):
"MPICH and its derivatives rely on polling for synchronization and notification, which leaves little space for exploiting programming techniques that would achieve computation and communication overlapping and optimal resource utilization [13]. Although polling can ultimately achieve the lowest message passing latency, it wastes CPU cycles for spinning on notification flags. A user thread that polls for synchronization does not yield the CPU until this thread is de-scheduled by the OS, thus reducing the CPU time for useful computation. Latency is generally not considered as a realistic measure of performance for applications that can overlap computation and communication."
Again, this might not be true either. MPICH2 installation guide says:
"build the channel of your choice. The options are sock, shm, and ssm. The shm channel is for small numbers of processes that will run on a single machine using shared memory. The shm channel should not be used for more than about 8 processes. The ssm (sock shared memory) channel is for clusters of smp nodes. This channel should not be used if you plan to over-subscribe the CPU���¢�¯�¿�½�¯�¿�½s. If you plan on launching more processes than you have processors you should use the default sock channel. The ssm channel uses a polling progress engine that can perform poorly when multiple processes compete for individual processors."
This sounds as if sock doesn't use a polling progress engine?
3) Forcing the processes to be launched on a subset of processors only: 
mpirun -np 16 taskset -c 12-15 ./testpar --> 100 sec.
This might also be helpful in the case of processor affinity.
So, combine the two: "mpirun -np 16 --mca mpi_yield_when_idle 1 taskset -c 12-15 ./testpar " --> 7.7 sec. only
MPICH2's MPD's solution to the problem: "If you tell the mpd how many cpus it has to work with by using the --ncpus argument, as in
mpd --ncpus=2
then the number of processes started the first time the startup message circles the ring will be determined by this argument.
4) Check to see the compiler options for mpic++ by:
mpic++ --showme:compile
mpic++ --showme:link
5) Difference between boost::timer and boost::mpi::timer
boost::mpi::timer is a wrapper around the MPI_TIMER so it provides much finer grain timings up to 5 digits precision whereas boost::timer only provides 2 digits precision such as "0.04 sec".
6) Using gprof2dot to get call graphs of your program:
- Compile with -pg option
- Run that executable with whatever input you want (./exp p256), that'll create the gmon.out file
- gprof expnogas > explite.txt
- python gprof2dot.py -e 0.00 -n 0.00 explite.txt > gvexplite (includes all the nodes and edges)
- python xdot.py gvexplite
6) Generating annotated Assembly with g++:
g++ $(INCADD) -Wa,-ahls=main.lst,-L -g -c Bit2DSparse.cpp
Regular Installation of Boost with MPICH2
0) If you have another boost installed before and you want to keep the installation, first remove "boost/bin.v2/libs" contents
1) Add the following lines to $(HOME)/user-config.jam:
using gcc : : : <cxxflags>-DMPICH_IGNORE_CXX_SEEK ;
using mpi : $(PATH_MPICH2)/mpicxx ;
2) Then execute the following command from the top boost directory:
sudo ./bjam --with-mpi --with-thread
3) Finally the installation:
sudo ./bjam --prefix=/usr/local/boostmpich2 --with-mpi --with-thread install
4) If you still insist on keeping the old MPI installation, then you either explicitly point out the path to new mpicxx/mpirun for each call, or put the following line to your .bashrc file so that it finds the new mpicxx/mpirun during calls.
export PATH=/usr/local/bin/mpich2/bin:$PATH
Threading Issues
Data copying from SparseDColumn<T> ** M[i][j] to local matrix by each thread takes the following times:
Without Hoard. Using kinner=i (no contention avoidance) :
Data copy took 0.134991 seconds
Data copy took 0.074193 seconds
Data copy took 0.083097 seconds
Data copy took 0.088695 seconds
For a total of 0.379 seconds.
Without Hoard. Using kinner= (i + ((dimx+dimy)% gridy)) % gridy (contention avoidance) :
Data copy took 0.090011 seconds
Data copy took 0.059552 seconds
Data copy took 0.069522 seconds
Data copy took 0.072168 seconds
For a total of 0.290 seconds.
With Hoard. Using kinner=i (no contention avoidance) : 
Data copy took 0.217126 seconds
Data copy took 0.125259 seconds
Data copy took 0.142138 seconds
Data copy took 0.175438 seconds
For a total of 0.659 seconds
With Hoard. Using kinner= (i + ((dimx+dimy)% gridy)) % gridy (contention avoidance) :
Data copy took 0.217977 seconds
Data copy took 0.054324 seconds
Data copy took 0.124619 seconds
Data copy took 0.107576 seconds
For a total of 0.502 seconds
Building GASNET C++ Clients
Change .../smp-conduit/smp-par.mak file (and any .../xxx-conduit/xxx-par.mak file you wanna use):
- In two places, there are /usr/bin/gcc, make them g++
- In one place, there is -lgcc, make it -lstdc++
- Remove the -Winline flag used in compilation.
Starting them:
> export PATH=$PATH:/home/aydin/localinstall/gasnet/bin
> amudprun -spawn 'L' -np 4 ./comp input1 input2_1
Checking memory errors in them:
> export GASNET_SPAWNFN='L'
> valgrind --trace-children=yes ./testgas 4
To see the details:
> GASNET_VERBOSEENV=1
 LONESTAR (VAPI/IBV) Details:
> GASNET_VAPI_SPAWNER (set to "mpi" or "ssh") can override the value set at configuration time.
> GASNET_TRACEFILE - specify a file name to recieve the trace output may also be "stdout" or "stderr", (or "-" to indicate stderr) each node may have its output directed to a separate file, and any '%' character in the value is replaced by the node number at runtime (e.g. GASNET_TRACEFILE="mytrace-%") unsetting this environment variable (or setting it to empty) disables tracing output (although the trace code still has performance impact)
> For some reason, mpi-spawner of gasnetrun_ibv do not get along well with the batch environment of lonestar. It requires a machinefile list. Luckily, ssh-spawner automatically looks at $LSB_HOSTS after if cannot find $GASNET_SSH_NODEFILE variable.
bsub -I -n 4 -W 0:05 -q development gasnetrun_ibv -n 4 -spawner=ssh ./testgas
Installing BOOST.MPI to DataStar
1) Download boost into Project directory.
2) Make sure mpCC works.
3) Go to "../boost/tools/jam/src" folder
4) Type "./build.sh vacpp"
5) Go to "../boost/tools/jam/src/bin.aixppc" folder and copy the "bjam" executable to "../boost" directory. (i.e. top-level boost directory)
6) Copy "../boost/tools/build/v2/user-config.jam" to $HOME and add line "using mpi; "
7) "using mpi; " will probably fail. Thus you might need to configure MPI yourself. In order to do that, you need to know which libraries are related to mpi. Such libraries are inside the PE (parallel environment) folder of dspoe: "/usr/lpp/ppe.poe/lib"
using mpi : : <find-shared-library>library1 <find-shared-library>library2 <find-shared-library>library3 ;
8) Type "bjam --with-mpi --toolset=vacpp" in your top-level boost directory.
to see what is going on: "bjam --with-mpi --toolset=vacpp --debug-configuration 2 > debugconferr.txt"
C++ Lessons Learned
1) Don't use operator BT() inside the composition closure object MMul. SparseDColumn (or whatever template is used for BT) is going to take care of implementing the necessary operation through the assignment & constructor accepting MMult<T> &
2) Distinguish between assigning a pointer or assigning the value of a pointer very clearly. If you're gonna malloc a pointer inside a function, you should pass it as a reference to a pointer:
void (int * & array)
But if you're gonna just change the contents (for example write NULL to the memory location)
int * array = malloc(100 * sizeof(int));
*((MemoryPool**) array) = NULL;
3) Optional parameters for members of a class:
Inside header: SparseDColumn (const SparseTriplets<T> & rhs, bool transpose, MemoryPool * mpool = NULL);
Inside cpp: SparseDColumn<T>::SparseDColumn(const SparseTriplets<T> & rhs, bool transpose, MemoryPool * mpool){...}
The reason for this is natural. Other parts of the code that uses this class should know that mpool is an optional argument, and it is OK not to supply it. We can only achieve that effect by putting it inside the header file.
4) This is about printing pointer addresses using cout.
If you're gonna do any pointer arithmetic, it should be a pointer to a well defined type (you can't do arithmetic on void*)
Yet, if you're gonna print an address, first cast it to void*. Otherwise, the polymorphic nature of cout (and any ofstream) will try to print something else. In the case of char*, for example, it will try to print the string itself, instead of the address.
       // char * begin, size_t size 
       outfile << "chunk " << i << " starts:" <<  (void*) begin << " and ends: " << (void*) (begin + size) << endl;
5) If you're calling a C-library from C++ code, you have to wrap those C function that you're calling around an extern C {...} wrapper (or if you're sure that the whole library is C++ free, you can wrap the whole #include <myclib.h> command. Click for more info
6) Difference between various type of quotes in linux shell: Click for more info
CUDA Stuff
Running decuda:
>> make data/dlur.cubin
>> python decuda dlur.cubin > mygemm.decuda
Start-up script:
Inside /etc/rc.local, you'll see the following line:
./root/nvidia.sh start
CUDA SDK 1.1 caveat:
>> make USECUBLAS=1 verbose=1
Check the version of the driver:
>> cat /proc/driver/nvidia/version
No asshole rule: Don't try to mix incompatible Nvidia device drivers with Cuda SDK's.
For example, driver 17x.xx only works with CUDA 2.0, not CUDA 1.1.
Look here: [[8]]
Our NVIDIA 8800 Ultra is installed on a 4x PC-I Express port. It should be switched to a 16x port.
Here is how to get extra info:
[root@neumann cpufreq]# nvclock -i
-- General info --
Card: nVidia Geforce 8800Ultra
Architecture: NV50/G80 A3
PCI id: 0x194
GPU clock: 648.000 MHz
Bustype: PCI-Express
-- Shader info --
Clock: 1512.000 MHz
Stream units: 128 (11111111b)
ROP units: 24 (111111b)
-- Memory info --
Amount: 768 MB
Type: 384 bit DDR3
Clock: 1152.000 MHz
-- PCI-Express info --
Current Rate: 4X
Maximum rate: 16X
-- Sensor info --
Sensor: Analog Devices ADT7473
Board temperature: 48C
GPU temperature: 57C
Fanspeed: 82 RPM
Fanspeed mode: manual
PWM duty cycle: 60.0%
-- VideoBios information --
Version: 60.80.18.00.12
Signon message: G80 P355 SKU 0002 VGA BIOS
Performance level 0: gpu 660MHz/shader 1512MHz/memory 1150MHz/1.35V/100%
VID mask: 3
Voltage level 0: 1.10V, VID: 0
Voltage level 1: 1.20V, VID: 1
Voltage level 2: 1.35V, VID: 2
Now that if you do everything I listed in the next section (X11VNC), you can tweak clocks by:
1) Log into Neumann, and start X11VNC (with the option to create a real physical X session)
2) Connect with ssvnc.exe to neumann:0
3) Adjust with (as root):
>> nvclock -b coolbits -n 500
YES !!!
Painful X11VNC Connections
- Before installing x11vnc, make sure your system has XTERM. Otherwise keyboard strokes won't work. x11vnc ./configure script will actually warn you if you don't have xterm.
>> ./configure > myconfig.txt
       configure: WARNING:
       A working build environment for the XTEST extension was not found 
       (libXtst).  An x11vnc built this way will be only barely usable.
       You will be able to move the mouse but not click or type.  There can
       also be deadlocks if an application grabs the X server.
       It is recommended that you install the necessary development packages
       for XTEST (perhaps it is named something like libxtst-dev) and run
       configure again.
- So you do:
>> yum install libXtst-devel
- You'll also need a dummy frame buffer (that you can do after installing X11VNC too)
>> yum install xorg-x11-server-Xvfb
- Then start x11vnc server (-create option opens up a new X for you)
>> x11vnc -create
Note that the -create option is an alias for "-display WAIT:cmd=FINDCREATEDISPLAY-Xvfb".
- Connect from client (actually you're tunneling through ssh and connecting VNC through localhost:0 of the remote machine)
>> ./ssvnc -ssh cs290N@neumann.cs.ucsb.edu:0
- Now, you'll see that you need an x-server... Something missing on Neumann
>> yum install gdm
>> whereis gdm
- For those of you who wanna cry, Neumann probably doesn't have an "X" either
>> yum grouplist
>> yum groupinstall "X Window System" "GNOME Desktop Environment"
- Now the problem seems to be the following: 
X Window System on the remote machine doesn't use the NVIDIA drivers. That might be due to the lack of monitor (impossible to fix) or we may find a way out... Just in case keep in mind that it's possible to see a "normal X session" via X forwarding VNC (but not with X forwarding). 
1) Xvfb :1 -screen 0 800x600x16 -ac & (creates an x-server with a 800x600 screen on display :15 and disable access controls). Check if it really went to background by typing "jobs"
2) nvidia-settings --display=:15 (starts nvidia-settings in display : 15)
3) x11vnc -create -localhost -display :15
4) Connect through ssh-vnc viewer... (cs290N@neumann.cs.ucsb.edu:0)
- To see what's going on with X11
>> more /var/log/Xorg.0.log
Which says:
>> (WW) NVIDIA: No matching Device section for instance (BusID PCI:4:0:0) found
But "lspci" says:
>> 04:00.0 VGA compatible controller: nVidia Corporation Unknown device 0194
So, there is indeed a device on that Bus, why does it complain? Because that information is missing from "/etc/X11/xorg.conf"
>> nvidia-xconfig --cool-bits=1
This will update xorg.conf, but you also need to add an extra line for the second GPU (NVIDIA 8800 Ultra):
       Section "Device"
               Identifier      "NVIDIA 8800 Card"
               Driver          "nvidia"
               BusID           "PCI:4:0:0"
               Option          "TwinViewXineramaInfoOrder" "DFP-0"
       EndSection
Finally, update the "Screen" section accordingly (should have "NVIDIA 8800 Card" instead of the other).
- To see supported extensions (make sure NV-Control is listed):
>> xdpyinfo | more
No! Not done (yet). This is also important: [[Headless server] http://www.karlrunge.com/x11vnc/#faq-headless]
>> x11vnc -display WAIT:cmd=FINDCREATEDISPLAY-X --> also does startx ! but it doesn't seem to read xorg.conf, and doesn't give any command line in vncviewer
- Ok, now when X starts, we get the following error:
could not open default font 'fixed'
Problem is that your font server (xfs) is not running by default. So, start it (as root):
>> /etc/init.d/xfs start
