The ParGAP (Parallel GAP) package provides a way of writing parallel programs using the GAP language. Former names of the package were ParGAP/MPI and GAP/MPI; the word MPI refers to Message Passing Interface, a well-known standard for parallelism. ParGAP is based on the MPI standard, and this distribution includes a subset implementation of MPI, to provide a portable layer with a high level interface to BSD sockets. Since knowledge of MPI is not required for use of this software, we now refer to the package as simply ParGAP. For more information visit the author's ParGAP home page at: http://www.ccs.neu.edu/home/gene/pargap.html
For some background reading, see Coo95 and Coo97.
This first chapter is intended to help a new user set up ParGAP and run through some quick examples: see
LoadPackage
); and
The later chapters present detailed explanations of the facilities of
ParGAP. Because parallel programming is sufficiently different from
sequential programming, this author recommends printing out at least
Chapters 1 through MasterSlave Tutorial, and skimming through those
chapters for areas of interest, before returning to the terminal to try
out some of the ideas. This document can be found in
.../pkg/pargap/doc/manual.dvi
of the software distribution. You may
also want to print the index at the end of manual.dvi
. In particular,
the heading example
in the index, or ??example
from within GAP,
should be useful. If you prefer postscript, the UNIX command dvips
will
convert that file to postscript form.
The development of ParGAP was partially supported by National Science Foundation grants CCR-9509783 and CCR-9732330.
ParGAP is installed on top of an existing GAP installation. It comes with its own subset MPI implementation (currently functional only on UNIX installations), or it can use your system MPI libraries, if present. See Section Installing ParGAP for instructions on installation of ParGAP. At the time that ParGAP is invoked, a special file or command line parameter must be used to tell ParGAP how many local processes or which remote machines to use for slave processors. See section Running ParGAP for instructions on invoking ParGAP. If there are questions or bugs concerning ParGAP, please write to: gene@ccs.neu.edu
If one wishes only to try out the parallel features, the first five pages of this manual (through the section on the slave listener) will suffice for installation, and using it. For the more advanced user who wishes to design new parallel algorithms or port old sequential code to a parallel environment, it is strongly recommended to also read the sections following on from Section Basic Concepts for the TOP-C model (MasterSlave).
ParGAP should be invoked via the script bin/pargap.sh
created by the
installation process which invokes GAP_ROOT_DIR
/bin/
ARCH/pargapmpi
,
where ARCH depends on your system but is the same directory in which
the gap
binary is found. MPI and the higher layers will not be
available if the binary is invoked in the standard way as gap
. This is
a feature, since a single binary and source distribution serves both for
the standard GAP and for ParGAP.
ParGAP is implemented in three layers: 1) MPI, 2) Slave Listener, and 3) Master Slave (TOP-C abstraction). Most users will find that the two highest layers (Slave Listener and Master Slave) meet all their needs.
1) MPI:
Error
break in the presence of
errors. MPI_Init()
(see MPI_Init) and MPI_Finalize()
(see MPI_Finalize) are invoked automatically by ParGAP.
MPI_
tab
tab
to see all
implemented MPI functions and variables. However, typing the symbol
name alone (e.g.: MPI_Send;
) will cause it to display the calling
syntax. The same information is displayed after an incorrect call.
The return value is typically obvious. MPI is implemented in
src/pargap.c
. ParGAP will use a sysem MPI implementation if one is
present, and the distribution also includes two versions of a simple, subset
implementation of MPI in pkg/gapmpi/mpinu/
and pkg/gapmpi/mpinu2/
,
which is implemented on top of a standard sockets interface, which can be
used instead..
2) Slave Listener:
*Msg*
e.g. SendMsg()
(see SendMsg), RecvMsg()
(see RecvMsg),
ProbeMsg()
(see ProbeMsg). Since the slave is in a
receive-eval-send loop, every SendMsg(
cmd)
on the master must be
balanced by a later RecvMsg()
. SendRecvMsg()
(see SendRecvMsg)
is provided to combine these steps. A few parallel utilities are also
included, such as ParRead()
(ParRead), ParList()
(ParList),
ParEval()
(ParEval), etc.
SendMsg()
or ParEval()
would be evaluated
locally before being sent across the network. For this reason,
arguments can also be given as strings, to delay evaluation until
reaching the destination process. Hence, real strings must be quoted:
ParEval("x:="abc";");
Additionally, multiple commands are valid,
and the final ``;
'' of the string is optional. So, one can write:
BroadcastMsg("x:=\"abc\"; Print(Length(x), \"\\n\")");;
3) Master Slave:
1)
2)
If you are using Linux and wish to try out ParGAP quickly, you can skip this section and let the ParGAP build process choose an MPI library for you. If you have a little more time, or are running on a different system, please read on.
ParGAP uses MPI, a standard Message Passing Interface for communicating between processes. Since the details of inter-process communication are system-specific, ParGAP relies on an external library to provide its MPI functions. A implementation of a sufficient subset of MPI, which runs on Linux and OS X, is included with ParGAP. Alternatively, an MPI library can be installed on your system before building ParGAP. Two popular MPI implementations are:
The MPINU library included with ParGAP provides the MPI functionality that
ParGAP needs by using Unix sockets. This implementation is sufficient for
basic ParGAP usage, but does not scale to larger systems as well as the
alternative system libraries. It is better-suited to interative ParGAP
sessions, since system MPI implementations can result in problems with line
editing in ParGAP. When built with MPINU, ParGAP also enables two
commands ParReset()
and FlushAllMsgs()
which can be useful when developing
parallel programs. See
Section Problems Running ParGAP with a System MPI Implementation for details
of these known issues with system MPI implementations. Two versions of MPINU
are included with ParGAP: the original MPINU and a newer version, called
MPINU2.
On Linux machines, we recommend that you use ParGAP with a system MPI implementation instead of MPINU, if possible. These implementations provide better performance and fault tolerance, and are compatible with a wider range of operating systems and hardware, including high speed networks and proprietory high-end computing systems.
On Macs, we recommend using the original MPINU since there are currently some problems running ParGAP with both a system MPI implementation and MPINU2. Both these issues will hopefully be resolved in a future release.
By default, the ParGAP build process (see Section Installing ParGAP) tries to use a system MPI implementation if it can find one. If not, it will use MPINU. Two versions of MPINU are included with this release of ParGAP. The recommended choice is MPINU2, but the original MPINU is included as a backup in case there are problems building or running MPINU2.
Installing ParGAP should be relatively simple. However, since there are many interactions both with the GAP kernel and with the UNIX operating system, in a minority of cases, manual intervention will be necessary. If you are part of this minority, please see the section Problems Installing or Invoking ParGAP. The most common problem is the local security policy; ParGAP is more pleasant to use when you don't have to manually provide the password for each slave. See section Problems with Passwords (Getting Around Security) for suggestions in this respect.
To install the ParGAP package, move the file pargap-
XXX.zoo
or
pargap-
XXX.tar.gz
(for some version number XXX of ParGAP) into
the pkg
directory in which you plan to install ParGAP. Usually, this
will be the directory pkg
in the hierarchy of your version of GAP
(in fact, currently it is not possible to have the pkg
directory
separate from GAP's pkg
directory; we hope to remedy this in future
versions of ParGAP so that it will also possible to keep an additional
pkg
directory in your private directories; section Installing a GAP Package
of the GAP reference manual gives details on how to do this,
when it's possible.)
Now change into the pkg
directory in which you plan to install
ParGAP. If you got a .zoo
file, unpack it with:
unzoo -x pargap-
XXX
If you got a .tar.gz
file and your tar
command supports the z
option, unpack it with:
tar zxf pargap-
XXX.tar.gz
or otherwise unpack in two steps with:
gunzip pargap-
XXX.tar
tar xvf pargap-
XXX.tar
Whether you got the .zoo
or .tar.gz
archive you should now have a new
directory pargap
. As for a generic GAP package, do:
cd pargap ./configure make
This builds the ParGAP files. ParGAP also needs to rebuild parts of
GAP to enable the MPI hooks. It may also need to re-run the GAP
configure
if you have a dedicated MPI compiler. By default, the ParGAP
configure
will prompt you to do this by hand if necessary, and then to
restart the ParGAP build. If you are happy for the ParGAP build process
to run the GAP configure
for you if needed, with no arguments, then run
ParGAP's configure
with
./configure --with-basic-gap-configure
The configure
script will attempt to find a system MPI implementation that
it can use. If if not then it will use MPINU2, the more recent of the two
MPINU subset implementations included with the ParGAP package. You can use
the --with-mpi=
configure option to specify a different behaviour, and you
can also set your own MPI compiler and options if you wish. See the help text
provided by ./configure -h
for full details.
After doing the configure
and make
steps of ParGAP's installation
process (see Section Installing ParGAP), you should find in ParGAP's
bin
subdirectory a script
pargap.sh
which you should use to start ParGAP. (ParGAP can not be started
by starting GAP 4 in the usual way, and using LoadPackage
; doing
so will result in Info
-ed advice to read this section.) Edit the
pargap.sh
script if necessary, copy it to a standard path and rename it
according to how you intend to call ParGAP (e.g. rename it: pargap
).
Note:
The script pargap.sh
defines the program that runs ParGAP as
pargapmpi
. In fact, after installation pargapmpi
is a symbolic link
to the GAP binary named gap
. The same binary runs both GAP and
ParGAP; when the binary is invoked as gap
GAP runs in the usual
way without any parallel features; only when the binary is invoked as
pargapmpi
are the parallel features incorporated. See
Section Modifying the GAP kernel for more details.
Your ParGAP should now be ready to use. Now read the next section
which decribes how to run ParGAP (if you are reading this from
GAP's on-line help, type: ?>
).
After a successful build, you will see a message saying that ParGAP is
ready to use, and confirmation of whether a system MPI library or MPINU will
be used. The method of running ParGAP depends on this MPI choice, and the
MPI library is auto-detected, or can be specified, in configure
, as
described in Section Installing ParGAP. The pros and cons of the two
different library variants are discussed in Section Choosing an MPI Library.
We will assume that you have copied the pargap.sh
script to a location
on your search path and renamed it as pargap
, as suggested in
Section Installing ParGAP.
If you are using a system MPI library:
ParGAP should be started using an MPI launcher script. The name and syntax
of the command to start MPI processes can vary, and you should check your
system MPI documentation for details. However, one common launcher is
mpiexec
, and the following command should work with both Open MPI and MPICH,
and most other MPI-2 implementations:
mpiexec -n 3 pargap
This will start three copies of the ParGAP: one master and two slaves. These processes will all run on your local machine. See Section Invoking ParGAP with Remote Slaves (when using a system MPI library) for how to configure and run processes on remote slaves.
If you are using MPINU:
In ParGAP's bin
subdirectory you should find a procgroup
file which
defines the master and slave processes that will be used by ParGAP.
When ParGAP is started, the MPINU library looks for a file called procgroup
in the current directory, unless the -p4pg
option is used. Thus if you renamed
your shell script pargap
, the following are valid ways of starting
ParGAP:
pargap
(if current directory contains the file: procgroup
), or
pargap -p4pg
myprocgroupfile
(where myprocgroupfile is the complete path of your procgroup file --
there is no restriction on how you name it). The default procgroup
file
defines one master and two slaves on the local machine. For instructions of
how to run remote slaves, see
Section Invoking ParGAP with Remote Slaves (when using MPINU).
If you had trouble installing or starting ParGAP, see the
section Problems Installing or Invoking ParGAP. Otherwise you are ready
to test your installation, Try the example in the following section (if you
are reading this from GAP's on-line help, type: ?>
).
After installation, try it out. Invoke ParGAP as described in
Section Running ParGAP and try the example below (but substitute your
own program where you see "/home/gene/myprogram.g"
). The commands in
this first example are also found in the README
file. So, you may wish
to copy text from the README
file and paste it into a ParGAP
session.
If you have not specified any additional machines to the MPI launcher, or you
are using the unmodified procgroup
file, then your remote slaves
will be other processes on your local machine. It is a good idea to run
only on your local machine for your first experiments and while you are
debugging parallel programs. When you wish to experiment with using
remote machines, you can then proceed to
section Invoking ParGAP with Remote Slaves (when using a system MPI library)
or section Invoking ParGAP with Remote Slaves (when using MPINU) depending
on which MPI library ParGAP has been built to use.
gap> # This assumes your procgroup file includes two slave processes. gap> PingSlave(1); #a `true' response indicates Slave 1 is alive true gap> # Print() on slave appears on standard output gap> # i.e. after the master's prompt. gap> SendMsg( "Print(3+4)" ); gap> 7 gap> # A <return> was input above to get a fresh prompt. gap> # gap> # To get special characters (including newline: `\n') gap> # into a string, escape them with a `\'. gap> SendMsg( "Print(3+4,\"\\n\")" ); gap> 7 gap> # Again, a <return> was input above after the 7 and new-line gap> # were printed to get a fresh prompt. gap> # gap> # Each SendMsg() is normally balanced by a RecvMsg(). gap> SendMsg( "3+4", 2); gap> RecvMsg( 2 ); 7 gap> # The following is equivalent to the two previous commands. gap> SendRecvMsg( "3+4", 2); 7 gap> # The two SendMsg() commands that were sent to Slave 1 earlier have gap> # responses that are waiting in the message queue from that slave. gap> # Check that there is a message waiting. With some MPI implementations gap> # the message is not immediately available, but when ProbeMsg() does gap> # return true then RecvMsg() is guaranteed to succeed. gap> ProbeMsgNonBlocking( 1 ); false gap> ProbeMsgNonBlocking( 1 ); true gap> # Print() is a `no-value' functions, and so the result of a RecvMsg() gap> # in both these cases is "<no_return_val>". gap> RecvMsg( 1 ); "<no_return_val>" gap> RecvMsg( 1 ); "<no_return_val>" gap> # As with Print() the result of Exec() appears on standard gap> # output, and the result is "<no_return_val>". gap> SendRecvMsg( "Exec(\"pwd\")" ); # Your pwd will differ :-) /home/gene "<no_return_val>" gap> # Define a variable on a slave gap> SendRecvMsg( "a:=45; 3+4", 1 ); 7 gap> # Note "a" is defined on slave 1, not slave 2. gap> SendMsg( "a", 2 ); # Slave prints error, output on master gap> Variable: 'a' must have a value gap> # <return> entered to get fresh prompt. gap> RecvMsg( 2 ); # No value for last SendMsg() command "<no_return_val>" gap> RecvMsg( 1 ); 45 gap> # Execute analogue of GAP's List() in parallel on slaves. gap> squares := ParList( [1..100], x->x^2 ); [ 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801, 10000 ] gap> # Send a large, local (non-remote) data structure to a slave gap> Concatenation("x := ", PrintToString([1..10]*2)); "x := [ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20 ]\n\000" gap> SendMsg( Concatenation("x := ", PrintToString([1..10]*2)) ); gap> RecvMsg(); [ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20 ] gap> # Send a local (non-remote) function to a slave gap> myfnc := function() return 42; end;; gap> # Use PrintToString() to define myfnc on all slave processes gap> BroadcastMsg( PrintToString( "myfnc := ", myfnc ) ); gap> SendRecvMsg( "myfnc()", 1 ); 42 gap> # Ensure problem shared data is read into master and slaves. gap> # Try one of your GAP program files instead. gap> ParRead( "/home/gene/myprogram.g");
Now that you have done a fairly rudimentary test of ParGAP you should be ready to do something a little bit more interesting:
gap> ParInstallTOPCGlobalFunction( "MyParList", > function( list, fnc ) > local result, iter; > result := []; > iter := Iterator(list); > MasterSlave( function() if IsDoneIterator(iter) then return NOTASK; > else return NextIterator(iter); fi; end, > fnc, > function(input,output) result[input] := output; > return NO_ACTION; end, > Error > ); > return result; > end ); gap> MyParList( [1..25], x->x^3 ); master -> 1: 1 master -> 2: 2 2 -> master: 8 1 -> master: 1 master -> 1: 3 master -> 2: 4 2 -> master: 64 1 -> master: 27 master -> 1: 5 master -> 2: 6 2 -> master: 216 1 -> master: 125 master -> 1: 7 master -> 2: 8 2 -> master: 512 1 -> master: 343 master -> 1: 9 master -> 2: 10 2 -> master: 1000 1 -> master: 729 master -> 1: 11 master -> 2: 12 2 -> master: 1728 1 -> master: 1331 master -> 1: 13 master -> 2: 14 2 -> master: 2744 1 -> master: 2197 master -> 1: 15 master -> 2: 16 2 -> master: 4096 1 -> master: 3375 master -> 1: 17 master -> 2: 18 2 -> master: 5832 1 -> master: 4913 master -> 1: 19 master -> 2: 20 2 -> master: 8000 1 -> master: 6859 master -> 1: 21 master -> 2: 22 2 -> master: 10648 1 -> master: 9261 master -> 1: 23 master -> 2: 24 2 -> master: 13824 1 -> master: 12167 master -> 1: 25 1 -> master: 15625 [ 1, 8, 27, 64, 125, 216, 343, 512, 729, 1000, 1331, 1728, 2197, 2744, 3375, 4096, 4913, 5832, 6859, 8000, 9261, 10648, 12167, 13824, 15625 ] gap> ParInstallTOPCGlobalFunction( "MyParListWithAglom", > function( list, fnc, aglomCount ) > local result, iter; > result := []; > iter := Iterator(list); > MasterSlave( function() if IsDoneIterator(iter) then return NOTASK; > else return NextIterator(iter); fi; end, > fnc, > function(input,output) > local i; > for i in [1..Length(input)] do > result[input[i]] := output[i]; > od; > return NO_ACTION; > end, > Error, # Never called, can specify anything > aglomCount > ); > return result; > end ); gap> MyParListWithAglom( [1..25], x->x^3, 4 ); master -> 1: (AGGLOM_TASK): [ 1, 2, 3, 4 ] master -> 2: (AGGLOM_TASK): [ 5, 6, 7, 8 ] 1 -> master: [ 1, 8, 27, 64 ] 2 -> master: [ 125, 216, 343, 512 ] master -> 1: (AGGLOM_TASK): [ 9, 10, 11, 12 ] master -> 2: (AGGLOM_TASK): [ 13, 14, 15, 16 ] 1 -> master: [ 729, 1000, 1331, 1728 ] 2 -> master: [ 2197, 2744, 3375, 4096 ] master -> 1: (AGGLOM_TASK): [ 17, 18, 19, 20 ] master -> 2: (AGGLOM_TASK): [ 21, 22, 23, 24 ] 1 -> master: [ 4913, 5832, 6859, 8000 ] 2 -> master: [ 9261, 10648, 12167, 13824 ] master -> 1: (AGGLOM_TASK): [ 25 ] 1 -> master: [ 15625 ] [ 1, 8, 27, 64, 125, 216, 343, 512, 729, 1000, 1331, 1728, 2197, 2744, 3375, 4096, 4913, 5832, 6859, 8000, 9261, 10648, 12167, 13824, 15625 ]
If you wish an accelerated introduction to the models of parallel programming provided here, you might wish to read the beginning of Chapter Slave Listener through section Slave Listener Commands, and then proceed immediately to Chapter Basic Concepts for the TOP-C model (MasterSlave).
The ParGAP package was designed and written by Gene Cooperman, College of Computer Science, Northeastern University, Boston, MA, U.S.A.
If you use ParGAP to solve a problem then please send a short email to gene@ccs.neu.edu about it, and cite the ParGAP package as follows:
\bibitem[Coo99]{Coo99} Cooperman, Gene, {\sl Parallel GAP/MPI (ParGAP/MPI)}, Version 1, College of Computer Science, Northeastern University, 1999, \verb+http://www.ccs.neu.edu/home/gene/pargap.html+.
ParGAP can be built to use either a system MPI library, or the included MPINU library. The command to run ParGAP is different in the two cases. If ParGAP has been built using MPINU then you should skip this section and proceed to section Invoking ParGAP with Remote Slaves (using MPINU). Otherwise, please read on.
After ParGAP has been installed, a script bin/pargap.sh
will have been
created which (after any changes you needed to make; see
Section Installing ParGAP) you should use to invoke ParGAP. Installers
are encouraged to treat pargap.sh
in analogy to gap.sh
. For example, if
your site has copied gap.sh
to /usr/local/bin/gap
, then you should
also look for the pargap.sh
script as /usr/local/bin/pargap
. It simplifies
the remoste slave configuration if ParGAP can be found on the standard
path on each machine, and we'll assume that in this section ParGAP can
be invoked simply as pargap
.
When built with a system MPI installation, ParGAP must be invoked using
the system's MPI launcher. This may go under several names, but the command
name mpiexec
is suggested in the MPI-2 specification, and is supported by
both Open MPI and MPICH, two common implementations of that specification.
The basic usage is
mpiexec -n
num pargap
to launch num
copies of ParGAP (i.e. one master and (num−1) slaves).
With no other parameters, these will all be launched on the host machine.
A configuration file can be used to specify hosts for remote slaves. The syntax of this file different for Open MPI and MPICH, but in both cases the configuration file is a text file listing the host names and the number of processes to run on each host, one per line. The default number of processes per node is one by default.
When using Open MPI, an example hostfile is
# Example Open MPI hostfile. Comments begin with # # # The following node is a single processor machine: foo.example.com # The following two nodes are dual-processor machines: bar.example.com slots=2 yow.example.com slots=2This hostfile is passed to
mpiexec
using
mpiexec -n
num -hostfile
hostfile pargap
Processes are allocated round-robin style. For example, if we choose num to
be seven then the first process (the master) will run on foo
. The
slaves will run two on bar
, two on yow
and a further one each on foo
and
bar
.
When using MPICH, the equivalent machinefile is
# Example MPICH machinefile. Comments begin with # # # The following node is a single processor machine: foo.example.com # The following two nodes are dual-processor machines: bar.example.com:2 yow.example.com:2and the command to start ParGAP using these hosts will be
mpiexec -n
num -machinefile
machinefile pargap
For further information, such as specifying hosts on the command line, or finer control of how processes are distributed between hosts, or if you have a different MPI implementation, then please see your MPI documentation.
Unless you have any problems with the installation or running ParGAP, you can skip the rest of this chapter and move on to Chapter Slave Listener.
If ParGAP has been built to use the supplied MPINU library then ParGAP includes the facility (on Linux) to start up and manage remove slaves without needing an external MPI launcher. If ParGAP is built using a system MPI library then please read to section Invoking ParGAP with Remote Slaves (when using a system MPI library) instead.
We'll assume that when ParGAP was built the scipt bin/pargap.sh
was
copied to /usr/local/bin/pargap
(see Section Installing ParGAP).
ParGAP can then be run by calling pargap
. In addition, there must be
a file, procgroup
, in the current directory, or alternatively, if you wish
to use a single procgroup file for all jobs, and that procgroup file is in
/home/joe
, then you can alias pargap
to pargap -p4pg /home/joe/procgroup
.
The procgroup file has a simple syntax, taken from the MPICH
(not MPICH2) implementation of MPI. A #
in column 1 introduces
a comment line. The first non-comment line should be local 0
, verbatim.
This line declares the master process as the local process. Other lines
are of the form:
host-machine
1
pargap-script
e.g.
regulus.ccs.neu.edu 1 /usr/local/bin/pargap
The first field is the hostname for a remote process. The second field
specifies one thread per process. (ParGAP recognizes only the value 1
for the second field.) The third field is an absolute pathname for
ParGAP, as it would be called on the remote process. Note that you can
repeat the same line twice if you want two remote ParGAP processes on
the same processor. The default procgroup
provided in the distribution
will have lines of form:
localhost 1
path-of-provided-pargap.sh
If you change path-of-provided-pargap.sh to just, say, pargap
, this
will work only if pargap
is in your path on the remote machine shell
(localhost
in this case), using your default shell. On most machines,
localhost
is an alias for the local processor. This is a good default
for debugging, so that you don't disturb users on other machines.
MPI will use a line
host-machine
1
pargap-script
to create a UNIX subprocess executing:
ssh
host-machine
pargap-script
Suppose host-machine is regulus.ccs.neu.edu
and pargap-script is
/usr/local/bin/pargap
as in the above example, and we were to have
trouble invoking ParGAP, then it would be a good idea to try invoking
ssh regulus.ccs.neu.edu
from a UNIX prompt and if that succeeds, to
then try executing the full ssh
command.
A typical problem is that the remote processor requires a password to
login. MPI requires a login without passwords. This can
be set up for ssh
. See man ssh
. Sometimes, PAM is also used for user
authentication (see /etc/pam.conf
). Consult your system staff for
further analysis. If your site uses an alternative to ssh
, there is a
solution here: add the lines
############################################################################# ## ## SSH . . . .. . . . . . . . . . . . . . . . . remote shell used by ParGAP ## ## SSH=myssh export SSH
before the GAP
block with the exec
line. (Of course, the #
lines
are not needed; they are comments.)
Note that the remote ParGAP process will not read from standard input,
although signals such as SIGINT (^C
) may be received by the remote
process. However, the remote ParGAP process will write to standard
output, which is relayed to the local process. So,
gap> SendMsg("Exec(\"hostname\")", 2);
will execute and print from the remote process.
If you still have problems, here is a list of things to check. This section considers general problems when installing or running ParGAP. The two sections after this one consider problems specific to using MPINU or a system MPI library respectively.
./configure --with-mpi=MPINU
top
. The Linux
version of top
sorts by memory usage if you type M
.
make
tries to automatically create:
pkg/pargap/bin/pargap.sh
GAP_ROOT/bin/gap.sh
. GAP_ROOT was
specified when you executed ./configure
GAP_ROOT
to install
ParGAP. This can be error-prone if your site has an unusual setup. If
you execute
GAP_ROOT/bin/gap.sh
, does gap come up? If so, compare
it with pargap.sh
and check for correct settings in
.../pkg/pargap/bin/pargap.sh
?
ssh
remote-hostname
to see if the issue is with
security. If your site uses ssh
instead of ssh
, then there is a
security issue. Read Section Problems with Passwords (Getting Around Security), and possibly man sshd
.
man ssh
tells you
the security model at your site. Then read Section Problems with Passwords (Getting Around Security).
pargap
listed in .../pkg/ALLPKG
?
[It's needed to autostart slaves.]
gap> MPI_Initialized();
cd
to a directory of the same name as your local
directory. Check your assumptions about the remote machine. Try:
gap> SendRecvMsg("Exec(pwd)"); SendRecvMsg("UNIX_Hostname()"); gap> SendRecvMsg("UNIX_Getpid()");
-b
and/or -q
switches to ParGAP when it starts, to disable the banner or all messages
respectively. See Section Ref:Command Line Options of the GAP Reference
Manual for further details.
If you have problems running ParGAP, and ParGAP is built to use the supplied MPINU library, then this section lists some things to check, in addition to the general issues listed in the previous section. If you are using a system MPI implementation instead of MPINU, this section can be ignored, but you should read the next section instead.
procgroup
file?
[It looks in the current directory for procgroup
, or for:
... -p4pg
PATH/procgroup
procgroup
file in your current directory
set correctly? Test it. If you are calling it on a remote host, manually
type:
ssh
HOSTNAME
ParGAP
procgroup
, e.g.
ssh denali.ccs.neu.edu /usr/local/gap4r3/bin/pargap.sh
exec
is used to save process overhead. Also try:
ssh
HOSTNAME exec
ParGAP
/tmp/pargapmpi-ssh.xx
xx
is replaced by the the process id of the ParGAP
process.
SO_KEEPALIVE
and variants.
(See man setsockopt
.)
This periodically sends null messages so the remote machine does
not think that the originating machine is dead. However, if the
remote machine fails to reply, the local process sends a SIGPIPE
signal to notify current processes of a broken socket, even though
there might have been only a temporary lapse in connectivity.
ssh
specifies KeepAlive yes
by default, but setting KeepAlive no
might get you through some transient lapses in connectivity due to
high congestion.
You may also want to experiment with: setenv SSH "ssh -n"
CALLBACK_HOST
, as in the example below.
# [ in sh/bash/... ] CALLBACK_HOST=denali.ccs.neu.edu; export CALLBACK_HOST # [ in csh/tcsh/... ] setenv CALLBACK_HOST=denali.ccs.neu.edu
sh
) somewhere between the first
and last line of .../pkg/pargap/bin/pargap.sh
.
./configure --with-mpi=MPINU
Here are a list of known issues when using a system MPI library with ParGAP, and some solutions or workarounds. Not all of these issues will manifest themselves on all architectures and all MPI implementations. If you are having problems building or running ParGAP, you should check this section as well as Section Problems Installing or Invoking ParGAP
rlwrap
utility, if available. For example, if
ParGAP is run using mpiexec
, then try
rlwrap mpiexec -n 3 pargap
rlwrap
has already seen you use. For more
information, try man rlwrap
.
FlushAllMsgs()
(see FlushAllMsgs) is not available
when using a system MPI implementation, since it tests show that
ProbeMsgNonBlocking()
, which it uses (see ProbeMsgNonBlocking) cannot be
relied upon to always return true
the first time that it is called after a
message has been sent. If your system MPI implementation does exhibit this
desired behaviour for ProbeMsgNonBlocking()
then you can install your own
local copy of FlushAllMsgs()
by copying the code for this function from
lib/slavelist.g
, removing the if
statement and renaming the function.
ParReset()
(see ParReset) is not available when using a
system MPI implementation. When using a MPINU library, the slaves are launched
by ParGAP itself and so can be contacted and restarted, but with a system
MPI library the slaves are launched by mpiexec
(or whichever MPI launcher
you use) and so cannot be reset from within ParGAP. There is no known
workaround for this.
malloc
to allocate their own
memory. MPINU avoids the use of malloc
as much as possible, but
system MPI implementations may not be as careful. This can be resolved by
starting ParGAP with the -s
command-line switch, which asks
ParGAP to pre-allocate memory before it starts. You can safely
pre-allocate more memory than you will actually need since physical memory
will only be mapped when it is actually used, so for example you could
allocate 3Gb:
mpiexec -n 3 pargap -s 3gThe
-a
and -m
switches can also be used to control memory usage. See
Section Ref:Command Line Options of the GAP Reference Manuel for
further information.
News of any other issues or solutions would be gratefully accepted.
There is a simple test to see if you need to read this section. Pick a
remote machine, HOSTNAME, that you wish to execute on, and type: ssh
HOSTNAME. If this did not work, also try
ssh
HOSTNAME. If you were
asked for your password, then you and your system administrator may need
to talk about security policy. If you were successful with an alternative
to
ssh
then set the environment variable, SSH
, to the alternative
value, as described in item 3 below.
.shosts
file to your home directory (for ssh
).
ssh
to
start remote processes. However, if the environment variable SSH
was set, the script uses the value of the environment variable
instead of ssh
. This may be useful, if you have your own script,
myssh
, that automatically gets around the security issues. Then
just type:
SSH=myrsh; export SSH # [ in sh/bash/... ] setenv SSH myrsh # [ in csh/tcsh/... ]
sh
) somewhere between the
first and last line of .../pkg/pargap/bin/pargap.sh
. (The example
for ssh
was given earlier.)
ssh
: man ssh
mentions some possibilities for giving the password
the first time, and then having ssh remember that future logins to
that machine are authorized for the duration of the session. Don't
overlook the use of $HOME/.ssh/config
to set special parameters,
such as specifying a different login name on the remote machine. Some
parameters of interest might be KeepAlive
, RSAAuthentication
,
UseRsh
. You may also find useful information in man sshd
.
/tmp/pargapmpi-ssh.$$
Note that this package modifies the GAP src
and bin
files, and
creates a new GAP kernel. This new GAP kernel can be shared by
traditional users of the old, sequential GAP kernel, and by those
doing parallel processing.
The GAP kernel will have identical behavior to the old GAP kernel
when invoked through the gap.sh
script or the bin/@GAParch@/gap
binary. The new ParGAP variables will appear to the end user ONLY if
the GAP binary was invoked as pargapmpi
: a symbolic link to the
actual GAP binary. The script, pargap.sh
, does this.
So, in a multi-user environment, traditional users can continue to use
gap.sh
without noticing any difference. Only an invocation of
pargap.sh
will add the new features.
In a future version of GAP, it is hoped that the GAP kernel will
have enough ``hooks'', so that no modification of the GAP kernel is
required. At that time, it will also be possible to speed up the startup
time for ParGAP. Much of the startup time is caused by waiting for
GAP to read its library files. It will be possible to use the GAP
function, SaveWorkspace()
to save a version with the GAP library
pre-loaded. That saved version can then be used to start up ParGAP.
This is not currently possible, because ParGAP needs to get at the
command line of GAP before the GAP kernel sees it.
Comments and contributions to a ParGAP user library, or any other type of assistance, are gratefully accepted.
Gene Cooperman gene@ccs.neu.edu
ParGAP manual