NR Group logo

Numerical Relativity at UT Brownsville

Installing Cactus WaveDemo in the Lobizon cluster

1.1 Compiling program files

Follow the instructions on installing Cactus and other utilities to install and compile Cactus on the master computer (in this case, Lobizon). The following steps assume that the main directory of Cactus (the directory that's created after issuing the command "gmake WaveDemo") is: /home/Cactus. You have to compile WaveDemo (or whatever other program you want to use) before distributing it to the cluster.

1.2 Distributing WaveDemo in the cluster

The easiest way to distribute (install) WaveDemo to all nodes is by making a perl script in which you can put commands as if installing to a single node. There is a complete version of the script on the master computer of Lobizon cluster. The following code on perl is for reference only. If you want to install WaveDemo on a different cluster or make changes in it, you have to acquire the complete version at Lobizon. We present a portion of the code so you can understand what it does. The numbers on the left are for reference only.
 cd /home/Cactus
 vi CopyWaveDemotoAll
this will start vi in the Cactus directory and create a new file named CopyWaveDemotoAll (remember this is only a portion of the file). In it, put:
 1---$start=0;
 2---$end=96;
 3---$filename1="/home/Cactus/WaveDemo.par";
 4---$filename2="/home/Cactus/exe/cactus_WaveDemo";
 5---
 6---for($i=$start;$i<$end;$i++)
 7---{
 8---system("echo ...Copying $filename1 and $filename2 to $ip[$i]");
 9---system("rcp $filename1 $ip[$i]:/Cactus/");
 10--system("rcp $filename2 $ip[$i]:/Cactus/exe");
 11--
 12--if($i!=end)
 13--{sleep 0;
 14--}
Lines 1 and 2 are to specify the starting and ending nodes. Lines 3 and 4 contain variables were you can assign files to it, as if they were pointers to them. Make sure you specify the absolute path. The command in line 8 prints to the screen the current iteration in the for cycle. You can use it to keep track of the progress of the script while it is executing. Lines 9 and 10 copy the files you specified to the current remote node. Instead of working with variables, you can use absolute paths, when handling the files if you want. The success on running WaveDemo depends only in the executable file of WaveDemo and on the parameter file. This is why you just have to copy these files to all nodes in the cluster. To finally install these files to all nodes simply put in your shell (at the master computer):
 perl CopyWaveDemotoAll
Once you've installed WaveDemo, change the security options on the directory containing Cactus on the master and all the nodes, so that every node in the group can read, execute and write on remote nodes. This is done by the command:
 chmod -R 777 Cactus
on the directory containing Cactus.

2. Run WaveDemo

First of all, you have to create a file that contains all the names of the machines you want to run WaveDemo, eg. all nodes. On the Lobizon cluster the nodes can be specified by their relative position in the cluster: from n000 to n437. The first of them meaning the rack in where they are, the second number is the level and the third number is the machine number. Each rack has 3 levels and each level has 8 machines. Running WaveDemo in parallel is as directed in section 7 of installing Cactus and other utilities, but this time, there's no simulation, it's the real deal. You may put X as 1, 2 or any number you want up to 96. Don't forget to run lamboot before running in parallel. You must login to one node (generally the first one) of the cluster from Lobizon itself (the master node) using the following command:
 rlogin n000
Then, you must login as a user:
 su myusername
and go to Cactus directory:
 cd /home/Cactus
You can now issue the command:
 lamboot -v machines
where machines is the name of the text file you used to put all the nodes. As you see, we can run any application from a node, not necessarily the master. You now can run:
 mpirun -np X ./exe/cactus_WaveDemo WaveDemo.par
with X being the number of nodes you want to use. Before you run WaveDemo in the cluster you would probably want to read the next section if you want efficient results.

3. WaveDemo results in the cluster.

Before we show you the results of running WaveDemo in the cluster, we'll show how to alter the parameter file to view the necessary results. First of all, the parameter file was changed in to ways:

-Signal the program to output time data
-Modify the parameter file for specific run options

The most important being the first.

-Signal the program to output time data
This means that you have to add to the parameter files the following lines:
 
pugh::timer_output              	 ="yes"
cactus::cctk_timer_output               ="full"
The first line indicates WaveDemo to display two numbers, communication time in:
-gettimeofday
-getrusage
The line corresponding to gettimeofday shows the time spent in communication above ALL processes (or nodes). The line corresponding to getrusage shows the time spent in communication for the spawning node (the one you've used to run WaveDemo).
The second line puts a nice number of data containing, in detail, how long the program took to finish and under what circumstances in (again) two different columns: -gettimeofday
-getrusage
gettimeofday( ) returns the time (in seconds and msecs) since 01/01/1970 and thus can be used to measure differences between calls to this function. Therefore, it is used to measure the total (real) time spent for running WaveDemo. getrusage( ) counts the CPU time that the calling process has used since startup. So this is used by Cactus to denote how much CPU time that a Cactus job was using (eg. for calculations) not including any time spent for I/O.

-Modify the parameter file for specific run options
You can modify the parameter file to accommodate to your needs how the program will be run. For example, the default number of iterations of WaveDemo is 50000, you can change this if you want the application not to delay to long. Another example is how many grid points you want to include in your simulation, this also affects directly the time the application will run.

And now we'll show you some results of running WaveToyDemo in the Lobizon cluster. To actually see efficient results in the cluster it was necessary to change the parameter file. By default, the parameter file puts the algorithm to work on a single "chunk" of space for each job, a permanent set of grid points in which nodes evolve. In other words, the grid points are set to "local" variables. This means that if you change the number of nodes working on the problem, the simulated "space" also grows in the same proportion, regardless. For example, let's imagine the grid points are set to 40 (in only one dimension, to ease matters) and we run the demo in one single node. When we run the application in two nodes the real time is more than the run on one node, because the grid points are now 80 for both nodes (40 points for each one). The solution to this problem was to change these lines in the parameter file:

       
driver::local_nx                
 = 40
    driver::local_ny                = 40
    driver::local_nz                = 40

to:

       
driver::global_nx                
 = 40
    driver::global_ny                = 40
    driver::global_nz                = 40

These changes tell the application to work on the same size of "space" (or grid points) regardless the number of nodes working on them, almost equivalently distributing the grid points among all nodes. There are some other more or less important details to change in the parameter file that influence the time a job is finished and the efficiency of it. We'll explain one by one in the proper time. We are now ready to run WaveToyDemo efficiently in parallel.We have some testing graphs of the results of running WaveToyDemo in the Lobizon cluster. The very first thing you would probably want to know is if your cluster runs correctly so you can use it to run programs of high-demanding computing time. The question is: will the cluster run more quickly an application with 2 nodes than one? This question is answered as simply as checking the time spent for a job to complete itself. Logicaly, the more working nodes we use, the faster the application will run. This is not a general rule. It involves many things such as, which problem one wants to solve (not all problems are easily or even solely parallelizable), how efficient the algorithm is at solving the problem, how much communication is necessary in the problem, etc.
There are two general types of parallel problems: CPU and I/O bound. If  one application is CPU bound, it means that it is necessary more processing power to resolve it and it does not need to much of a communication between processing units or nodes to solve the problem. At the contrary, if one application is I/O bound, it means that the processing units spend to much of its time communicating with other nodes, rather than processing power. It's clearly visible that an I/O application is not intended to be run in distributed "cluster" computing, especially if its communication architecture is not even comparable to one of  a supercomputer, because the machine will spend more time trying to communicate that the time "saved" in the parallelization process. WaveToy is somewhere in between these concepts, depending on the parameters you use to run it. For example, the grid size of the variables before shown is 40. We will have more of an CPU bound problem if we for example, change the grid size to 100. The reason is that by increasing the grid size of the problem, we'll add more points that have to be CPU processed and thus increase the CPU/I/O ratio. Another thing we changed in the parameter file was to comment out the variables that permit output (periodically placed results) to files leaving only the basic ones. By this we neglected more the use of communication between nodes, and thus permitting WaveDemo becoming much more of a CPU bound problem. In the next link we put the complete parameter file we've changed so you can compare it to the default: file:archives/WaveDemo.par
And now we present and explain the next graph:


Speedup is the ratio between the real time WaveDemo finished sequentially (in 1 node) and the time it finished in X number of nodes . For example if a job was run sequentially in 10 sec. and it was run in 6 sec. when run in two nodes, the "speedup" was of 1.67, meaning that it was run 1.67 times faster than sequentially. Now you have probably noted 4 different lines in the graph. The reason for this is simple. The distribution of the simulated space among nodes, depends on how well they fit to the most efficient topology for the given number of them. There are certain number of nodes which are more efficient than others at run time. Therefore, we've classified them in 4 different classes (be wary it's only applicable to WaveDemo). Primes, least symmetric pairs, first order pairs (having anything in common with "first order") and highly symmetric pairs. In that order, the topology of each of these classes, is such as to increasing performance at run time. This is because, the WaveDemo algorithm needs neighbor to neighbor communication to evolve. This is of great importance at frontiers between the spaces corresponding at adjacent nodes and the more information one node needs from another the more time it will require in communication time. In prime number of nodes used to work on WaveDemo the distribution is such that overcomes the communication needs compared to the other 3 classes. At the contrary, the "highly symmetric pairs" class is the most efficient at this respect. This is why the smoothed lines representing speedup are better for "highly symmetric pairs" than for "prime" number of nodes. The "prime" class is defined by the usual sense of the word: if the resulting sum of nodes involved at run time is prime, they are fit to the "primes" class. One example of the primes class is when using 11 processing units. This number cannot be divided in any way so, the distribution is such as to leave maximum space of communication between frontiers, for example, 11 divisions placed one one dimension and no division on the other 2 dimensions. The "highly symmetric pairs" class contains those unions of nodes that when used are distributed in such a way as to minimize the communication time between processes, they are called so because of the symmetry in their topology: for example 8 working nodes are part of this class because the topology is 2 nodes in one dimension, 2 on another and 2 on the final one. Imagine a cube with 40 "grid" points on each side so, with 8 nodes, the cube will torn in 8 equal pieces, with each node having it's own cube of size 20*20*20 leaving minimum frontiers for needed communication. Highly symmetric pairs are characterized be either having equal distributions among nodes to form lesser cubes (topologies of X*X*X divisions, for example 8) or symmetric topologies of Y*X*X divisions (for example 18 nodes, 2*3*3). It's easier now to understand the "least symmetric pairs" class. This is based on all groupings which nodes sum to a pair number, but have no symmetric topology at all. For example, using 51 nodes results in a 1*3*17 anti-symmetric topology. The "first order pairs" class is all that is left by now, for example 20: 2*2*5. The majority of groups in the "first order pairs" are comparable in performance to that of "highly symmetric pairs" ones, but it is guaranteed that the last one always has the better performance of all. In the next graph, we present the efficiency nodes for the same test. Efficiency is defined as the speedup achieved by X nodes, divided by X.




As you can easily see, the efficiency of the "primes" class decreases rapidly as more number of nodes is added. The behavior of the "least symmetric pairs" class is better but, erratic, and the efficiency of the "highly symmetric pairs" is the most predominant of all, just barely above to that of the "first order pairs" class. As said before, the main reason for these differences is the time a job has to spend in I/O. Although it was intended to do it so, one can never fully change the behavior of WaveDemo to CPU predominant, possibly because the computation of evolving grid points are not heavy enough, so cases like these among similar programs it is mandatory to use the machinery as efficiently and reliable as possible. Reliability is affected when using the "primes" class above 43 nodes, because we had problems running WaveDemo at this level. The next graph demonstrates this:



As clearly stated, this shows the percentage of time a job spent using communication. At the most severe cases, this went up to more than 70% of the time. Apparently, a CPU bound problem should not exceed more than 30% of their time spent in communication. This can be done, as said before, by choosing the right and most efficient solving algorithms to one problem. And as the general rule says, "conventional cluster computing performance is better achieved by using CPU bound problems". There is one more important remark to end this lecture. To achieve the best performance possible with given parameters, one has to exploit as maximum as possible the existent cluster hardware. By this we mean using each node's real memory as far as possible to a given simulation, maybe 95%. For example, the past 3 graphs where made with 100 grid points each at 10000 iterations (other parameters neglected for minimal importance). The next graph was made from 180 to 230 grid points at 50 iterations each run. The number of iterations greatly affects the CPU/I/O ratio. Having more iterations means better chance of increasing this ratio.



It clearly shows that by increasing the grid points betters the performance achieved. We remember the speedup with 2 nodes with 100 grid points and 10000 iterations was 1.7277, but the best speedup shown in the graph is 2.8379! We did not even hesitate to draw any more points to the graph because it would fall far to above! to give you an example, the speedup at 240 grid points was that of 36.084! Why is it that changing the grid points might affect the overall performance? you might ask, and the reason couldn't be more simpler. The number of grid points affects directly, the space in real memory needed to store all data.100 grid points means a computer must compute 1000000 (100*100*100) points at each passing iteration, because there's 100 points for each dimension. There would be a limit, though, in the space of real memory available for storing such immense number of points, and where's no more available space in town, for example when managing 240 grid points (13824000 points) in one node, the computer then switches to "virtual" memory, to allocate the necessary space to store all data. Virtual memory is resident in the hard disk, and the time it takes for the CPU to reach data in the hard disk is far more time spending than reaching some data on real memory. This is why it takes the computer far more time with increasing numbers of points, and the main reason for speedups higher than 2 (for 2 nodes) when the theoretical limit should be 2. Isn't that interesting?
Now that we have demonstrated the need to put as much grid points distributed among all nodes as possible in order to greatly increase the performance, we'll show you the neatest thing. In the next graph:



we have estimated the needed grid size for maximum memory usage and thus, for maximum cluster performance up to 64 nodes. Experimenting, we saw the limit for 1 node was 200 and the limit for two nodes, exceeding 260. As before, the most logical thing to think is that by doubling the number of nodes, it doubles the space available for computation, which again is not a general rule. The thing it doubles though is the mean "points per byte per node". For example, under real memory-only usage, there are (an average taken from 180-200 grid points) 17.693 points per byte under one node and, there are an average (taken from 180-260 grid points) of  34.884 points per byte per node with 2 working processing units, obviously this are not the actual points per byte per node (or KB), but it is a helping figure. This is taking into account that each node has 512Mb RAM. The past graph was made by the next equation:

figuring an almost lineal dependence of the KB's with changing nodes.