Installing Cactus WaveDemo in the Lobizon cluster
1.1 Compiling program files
Follow the instructions on
installing Cactus and other
utilities to install and compile Cactus on the master computer (in
this case, Lobizon). The following steps assume that the main
directory of Cactus (the directory that's created after issuing the
command "gmake WaveDemo") is: /home/Cactus. You have to compile
WaveDemo (or whatever other program you want to use) before
distributing it to the cluster.
1.2 Distributing WaveDemo in the cluster
The easiest way to distribute (install)
WaveDemo to all nodes is by making a perl script in which you can put
commands as if installing to a single node. There is a complete
version of the script on the master computer of Lobizon cluster. The
following code on perl is for reference only. If you want to install
WaveDemo on a different cluster or make changes in it, you have to
acquire the complete version at Lobizon. We present a portion of the
code so you can understand what it does. The numbers on the left are
for reference only.
cd /home/Cactus
vi CopyWaveDemotoAll
this will start vi in the Cactus directory and create
a new file named CopyWaveDemotoAll (remember this is only a portion of
the file). In it, put:
1---$start=0;
2---$end=96;
3---$filename1="/home/Cactus/WaveDemo.par";
4---$filename2="/home/Cactus/exe/cactus_WaveDemo";
5---
6---for($i=$start;$i<$end;$i++)
7---{
8---system("echo ...Copying $filename1 and $filename2 to $ip[$i]");
9---system("rcp $filename1 $ip[$i]:/Cactus/");
10--system("rcp $filename2 $ip[$i]:/Cactus/exe");
11--
12--if($i!=end)
13--{sleep 0;
14--}
Lines 1 and 2 are to specify the starting and
ending nodes. Lines 3 and 4 contain variables were you can assign
files to it, as if they were pointers to them. Make sure you specify
the absolute path. The command in line 8 prints to the screen the
current iteration in the for cycle. You can use it to keep track of
the progress of the script while it is executing. Lines 9 and 10
copy the files you specified to the current remote node. Instead of
working with variables, you can use absolute paths, when handling
the files if you want. The success on running WaveDemo depends only
in the executable file of WaveDemo and on the parameter file. This
is why you just have to copy these files to all nodes in the
cluster. To finally install these files to all nodes simply put in
your shell (at the master computer):
perl CopyWaveDemotoAll
Once you've installed WaveDemo, change the security
options on the directory containing Cactus on the master and all the nodes,
so that every node in the group can read, execute and write on remote nodes.
This is done by the command:
chmod -R 777 Cactus
on the directory containing Cactus.
2. Run WaveDemo
First of all, you have to create a file that contains
all the names of the machines you want to run WaveDemo, eg. all nodes.
On the Lobizon cluster the nodes can be specified by their relative position
in the cluster: from n000 to n437. The first of them meaning the
rack in where they are, the second number is the level and the third
number is the machine number. Each rack has 3 levels and each level
has 8 machines. Running WaveDemo in parallel is as directed in section
7 of
installing Cactus and other
utilities, but this time, there's no simulation, it's the real
deal. You may put X as 1, 2 or any number you want up to 96. Don't
forget to run lamboot before running in parallel. You must login to
one node (generally the first one) of the cluster from Lobizon itself
(the master node) using the following command:
rlogin n000
Then, you must login as a user:
su myusername
and go to Cactus directory:
cd /home/Cactus
You can now issue the command:
lamboot -v machines
where machines is the name of the text file you used
to put all the nodes. As you see, we can run any application from a node,
not necessarily the master. You now can run:
mpirun -np X ./exe/cactus_WaveDemo WaveDemo.par
with X being the number of nodes you want to use.
Before you run WaveDemo in the cluster you would probably want to read the
next section if you want efficient results.
3. WaveDemo results in the cluster.
Before we show you the results of running WaveDemo
in the cluster, we'll show how to alter the parameter file to view the
necessary results. First of all, the parameter file was changed in to ways:
-Signal the program to output time data
-Modify the parameter file for specific run options
The most important being the first.
-Signal the program to output time data
This means that you have to add to the parameter files
the following lines:
pugh::timer_output ="yes"
cactus::cctk_timer_output ="full"
The first line indicates WaveDemo to display two
numbers, communication time in:
-gettimeofday
-getrusage
The line corresponding to gettimeofday shows the time spent in communication
above ALL processes (or nodes). The line corresponding to getrusage shows
the time spent in communication for the spawning node (the one you've
used to run WaveDemo).
The second line puts a nice number of data containing, in detail, how
long the program took to finish and under what circumstances in (again) two
different columns: -gettimeofday
-getrusage
gettimeofday( ) returns the time (in seconds and msecs) since 01/01/1970
and thus can be used to measure differences between calls to this function.
Therefore, it is used to measure the total (real) time spent for running
WaveDemo. getrusage( ) counts the CPU time that the calling process has
used since startup. So this is used by Cactus to denote how much CPU time
that a Cactus job was using (eg. for calculations) not including any time
spent for I/O.
-Modify the parameter file for specific run options
You can modify the parameter file to accommodate to your needs how the
program will be run. For example, the default number of iterations of WaveDemo
is 50000, you can change this if you want the application not to delay to
long. Another example is how many grid points you want to include in your
simulation, this also affects directly the time the application will run.
And now we'll show you some results of running WaveToyDemo in the
Lobizon cluster. To actually see efficient results in the cluster it was necessary
to change the parameter file. By default, the parameter file puts the algorithm
to work on a single "chunk" of space for each job, a permanent set of grid
points in which nodes evolve. In other words, the grid points are set to
"local" variables. This means that if you change the number of nodes working
on the problem, the simulated "space" also grows in the same proportion,
regardless.
For example, let's imagine the grid points are set to 40 (in only one
dimension, to ease matters) and we run the demo in one single node. When
we run the application in two nodes the real time is more than the run on
one node, because the grid points are now 80 for both nodes (40 points for
each one). The solution to this problem was to change these lines in the
parameter file:
driver::local_nx
= 40
driver::local_ny
= 40
driver::local_nz
= 40
to:
driver::global_nx
= 40
driver::global_ny
= 40
driver::global_nz
= 40
These changes tell the application to work on the same size of "space"
(or grid points) regardless the number of nodes working on them, almost
equivalently
distributing the grid points among all nodes. There are some other more
or less important details to change in the parameter file that influence
the time a job is finished and the efficiency of it. We'll explain one
by one in the proper time. We are now ready to run WaveToyDemo efficiently
in parallel.We have some testing graphs of the results of running WaveToyDemo
in the Lobizon cluster. The very first thing you would probably want to know
is if your cluster runs correctly so you can use it to run programs of
high-demanding
computing time. The question is: will the cluster run more quickly an
application
with 2 nodes than one? This question is answered as simply as checking the
time spent for a job to complete itself. Logicaly, the more working nodes
we use, the faster the application will run. This is not a general rule.
It involves many things such as, which problem one wants to solve (not all
problems are easily or even solely parallelizable), how efficient the algorithm
is at solving the problem, how much communication is necessary in the problem,
etc.
There are two general types of parallel problems: CPU and I/O bound.
If one application is CPU bound, it means that it is necessary more
processing power to resolve it and it does not need to much of a communication
between processing units or nodes to solve the problem. At the contrary,
if one application is I/O bound, it means that the processing units spend
to much of its time communicating with other nodes, rather than processing
power. It's clearly visible that an I/O application is not intended
to be run in distributed "cluster" computing, especially if its communication
architecture is not even comparable to one of a supercomputer, because
the machine will spend more time trying to communicate that the time "saved"
in the parallelization process. WaveToy is somewhere in between these concepts,
depending on the parameters you use to run it. For example, the grid size
of the variables before shown is 40. We will have more of an CPU bound problem
if we for example, change the grid size to 100. The reason is that by increasing
the grid size of the problem, we'll add more points that have to be
CPU processed and thus increase the CPU/I/O ratio. Another thing we changed
in the parameter file was to comment out the variables that permit output
(periodically placed results) to files leaving only the basic ones. By this
we neglected more the use of communication between nodes, and thus permitting
WaveDemo becoming much more of a CPU bound problem. In the next link we put
the complete parameter file we've changed so you can compare it to
the default:
file:archives/WaveDemo.par
And now we present and explain the next graph:
Speedup is the ratio between the real time WaveDemo
finished sequentially (in 1 node) and the time it finished in X number of
nodes . For example if a job was run sequentially in 10 sec. and it was run
in 6 sec. when run in two nodes, the "speedup" was of 1.67, meaning that
it was run 1.67 times faster than sequentially. Now you have probably noted
4 different lines in the graph. The reason for this is simple. The distribution
of the simulated space among nodes, depends on how well they fit to the most
efficient topology for the given number of them. There are certain number
of nodes which are more efficient than others at run time. Therefore, we've
classified them in 4 different classes (be wary it's only applicable
to WaveDemo). Primes, least symmetric pairs, first order pairs (having anything
in common with "first order") and highly symmetric pairs. In that order, the
topology of each of these classes, is such as to increasing performance at
run time. This is because, the WaveDemo algorithm needs neighbor to neighbor
communication to evolve. This is of great importance at frontiers between
the spaces corresponding at adjacent nodes and the more information one node
needs from another the more time it will require in communication time. In
prime number of nodes used to work on WaveDemo the distribution is such that
overcomes the communication needs compared to the other 3 classes. At the
contrary, the "highly symmetric pairs" class is the most efficient at this
respect. This is why the smoothed lines representing speedup are better for
"highly symmetric pairs" than for "prime" number of nodes. The "prime" class
is defined by the usual sense of the word: if the resulting sum of nodes
involved at run time is prime, they are fit to the "primes" class. One example
of the primes class is when using 11 processing units. This number cannot
be divided in any way so, the distribution is such as to leave maximum space
of communication between frontiers, for example, 11 divisions placed one
one dimension and no division on the other 2 dimensions. The "highly symmetric
pairs" class contains those unions of nodes that when used are distributed
in such a way as to minimize the communication time between processes, they
are called so because of the symmetry in their topology: for example 8 working
nodes are part of this class because the topology is 2 nodes in one dimension,
2 on another and 2 on the final one. Imagine a cube with 40 "grid" points
on each side so, with 8 nodes, the cube will torn in 8 equal pieces, with
each node having it's own cube of size 20*20*20 leaving minimum frontiers
for needed communication. Highly symmetric pairs are characterized be either
having equal distributions among nodes to form lesser cubes (topologies of
X*X*X divisions, for example 8) or symmetric topologies of Y*X*X divisions
(for example 18 nodes, 2*3*3). It's easier now to understand the "least
symmetric pairs" class. This is based on all groupings which nodes sum to
a pair number, but have no symmetric topology at all. For example, using 51
nodes results in a 1*3*17 anti-symmetric topology. The "first order pairs"
class is all that is left by now, for example 20: 2*2*5. The majority of
groups in the "first order pairs" are comparable in performance to that of
"highly symmetric pairs" ones, but it is guaranteed that the last one always
has the better performance of all. In the next graph, we present the efficiency
nodes for the same test. Efficiency is defined as the speedup achieved by
X nodes, divided by X.
As you can easily see, the efficiency of the "primes"
class decreases rapidly as more number of nodes is added. The behavior
of the "least symmetric pairs" class is better but, erratic, and the efficiency
of the "highly symmetric pairs" is the most predominant of all, just barely
above to that of the "first order pairs" class. As said before, the main
reason for these differences is the time a job has to spend in I/O. Although
it was intended to do it so, one can never fully change the behavior of
WaveDemo to CPU predominant, possibly because the computation of evolving
grid points are not heavy enough, so cases like these among similar programs
it is mandatory to use the machinery as efficiently and reliable as possible.
Reliability is affected when using the "primes" class above 43 nodes, because
we had problems running WaveDemo at this level. The next graph demonstrates
this:
As clearly stated, this shows the percentage of time
a job spent using communication. At the most severe cases, this went up to
more than 70% of the time. Apparently, a CPU bound problem should not exceed
more than 30% of their time spent in communication. This can be done, as
said before, by choosing the right and most efficient solving algorithms to
one problem. And as the general rule says, "conventional cluster computing
performance is better achieved by using CPU bound problems". There is one
more important remark to end this lecture. To achieve the best performance
possible with given parameters, one has to exploit as maximum as possible
the existent cluster hardware. By this we mean using each node's real
memory as far as possible to a given simulation, maybe 95%. For example, the
past 3 graphs where made with 100 grid points each at 10000 iterations (other
parameters neglected for minimal importance). The next graph was made from
180 to 230 grid points at 50 iterations each run. The number of iterations
greatly affects the CPU/I/O ratio. Having more iterations means better chance
of increasing this ratio.
It clearly shows that by increasing the grid points
betters the performance achieved. We remember the speedup with 2 nodes with
100 grid points and 10000 iterations was 1.7277, but the best speedup shown
in the graph is 2.8379! We did not even hesitate to draw any more points
to the graph because it would fall far to above! to give you an example,
the speedup at 240 grid points was that of 36.084! Why is it that changing
the grid points might affect the overall performance? you might ask, and
the reason couldn't be more simpler. The number of grid points affects
directly, the space in real memory needed to store all data.100 grid points
means a computer must compute 1000000 (100*100*100) points at each passing
iteration, because there's 100 points for each dimension. There would
be a limit, though, in the space of real memory available for storing such
immense number of points, and where's no more available space in town,
for example when managing 240 grid points (13824000 points) in one node,
the computer then switches to "virtual" memory, to allocate the necessary space
to store all data. Virtual memory is resident in the hard disk, and the time
it takes for the CPU to reach data in the hard disk is far more time spending
than reaching some data on real memory. This is why it takes the computer
far more time with increasing numbers of points, and the main reason for
speedups higher than 2 (for 2 nodes) when the theoretical limit should be
2. Isn't that interesting?
Now that we have demonstrated the need to put as much grid points distributed
among all nodes as possible in order to greatly increase the performance,
we'll show you the neatest thing. In the next graph:
we have estimated the needed grid size for maximum
memory usage and thus, for maximum cluster performance up to 64 nodes.
Experimenting,
we saw the limit for 1 node was 200 and the limit for two nodes, exceeding
260. As before, the most logical thing to think is that by doubling the number
of nodes, it doubles the space available for computation, which again is
not a general rule. The thing it doubles though is the mean "points per byte
per node". For example, under real memory-only usage, there are (an average
taken from 180-200 grid points) 17.693 points per byte under one node and,
there are an average (taken from 180-260 grid points) of 34.884 points
per byte per node with 2 working processing units, obviously this are not
the actual points per byte per node (or KB), but it is a helping figure.
This is taking into account that each node has 512Mb RAM. The past graph
was made by the next equation:
figuring an almost lineal dependence of the KB's
with changing nodes.