/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Diff of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.4 by cnh, Thu Oct 25 18:36:55 2001 UTC revision 1.18 by afe, Tue Mar 23 16:47:05 2004 UTC
# Line 1  Line 1 
1  % $Header$  % $Header$
2    
3  In this chapter we describe the software architecture and  This chapter focuses on describing the {\bf WRAPPER} environment within which
4  implementation strategy for the MITgcm code. The first part of this  both the core numerics and the pluggable packages operate. The description
5  chapter discusses the MITgcm architecture at an abstract level. In the second  presented here is intended to be a detailed exposition and contains significant
6  part of the chapter we described practical details of the MITgcm implementation  background material, as well as advanced details on working with the WRAPPER.
7  and of current tools and operating system features that are employed.  The tutorial sections of this manual (see sections
8    \ref{sect:tutorials}  and \ref{sect:tutorialIII})
9    contain more succinct, step-by-step instructions on running basic numerical
10    experiments, of varous types, both sequentially and in parallel. For many
11    projects simply starting from an example code and adapting it to suit a
12    particular situation
13    will be all that is required.
14    The first part of this chapter discusses the MITgcm architecture at an
15    abstract level. In the second part of the chapter we described practical
16    details of the MITgcm implementation and of current tools and operating system
17    features that are employed.
18    
19  \section{Overall architectural goals}  \section{Overall architectural goals}
20    
# Line 28  of Line 38  of
38    
39  \begin{enumerate}  \begin{enumerate}
40  \item A core set of numerical and support code. This is discussed in detail in  \item A core set of numerical and support code. This is discussed in detail in
41  section \ref{sec:partII}.  section \ref{sect:partII}.
42  \item A scheme for supporting optional "pluggable" {\bf packages} (containing  \item A scheme for supporting optional "pluggable" {\bf packages} (containing
43  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
44  These packages are used both to overlay alternate dynamics and to introduce  These packages are used both to overlay alternate dynamics and to introduce
# Line 66  floating point operations.} Line 76  floating point operations.}
76  \end{figure}  \end{figure}
77    
78  \section{WRAPPER}  \section{WRAPPER}
79    \begin{rawhtml}
80    <!-- CMIREDIR:wrapper: -->
81    \end{rawhtml}
82    
83  A significant element of the software architecture utilized in  A significant element of the software architecture utilized in
84  MITgcm is a software superstructure and substructure collectively  MITgcm is a software superstructure and substructure collectively
# Line 74  Environment Resource). All numerical and Line 87  Environment Resource). All numerical and
87  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
88  the WRAPPER means that coding has to follow certain, relatively  the WRAPPER means that coding has to follow certain, relatively
89  straightforward, rules and conventions ( these are discussed further in  straightforward, rules and conventions ( these are discussed further in
90  section \ref{sec:specifying_a_decomposition} ).  section \ref{sect:specifying_a_decomposition} ).
91    
92  The approach taken by the WRAPPER is illustrated in figure  The approach taken by the WRAPPER is illustrated in figure
93  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
# Line 87  and operating systems. This allows numer Line 100  and operating systems. This allows numer
100  \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}  \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
101  \end{center}  \end{center}
102  \caption{  \caption{
103  Numerical code is written too fit within a software support  Numerical code is written to fit within a software support
104  infrastructure called WRAPPER. The WRAPPER is portable and  infrastructure called WRAPPER. The WRAPPER is portable and
105  can be specialized for a wide range of specific target hardware and  can be specialized for a wide range of specific target hardware and
106  programming environments, without impacting numerical code that fits  programming environments, without impacting numerical code that fits
# Line 98  optimized for that platform.} Line 111  optimized for that platform.}
111  \end{figure}  \end{figure}
112    
113  \subsection{Target hardware}  \subsection{Target hardware}
114  \label{sec:target_hardware}  \label{sect:target_hardware}
115    
116  The WRAPPER is designed to target as broad as possible a range of computer  The WRAPPER is designed to target as broad as possible a range of computer
117  systems. The original development of the WRAPPER took place on a  systems. The original development of the WRAPPER took place on a
# Line 110  uniprocessor and multi-processor Sun sys Line 123  uniprocessor and multi-processor Sun sys
123  (UMA) and non-uniform memory access (NUMA) designs. Significant work has also  (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
124  been undertaken on x86 cluster systems, Alpha processor based clustered SMP  been undertaken on x86 cluster systems, Alpha processor based clustered SMP
125  systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.  systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
126  The MITgcm code, operating within the WRAPPER, is also used routinely used on  The MITgcm code, operating within the WRAPPER, is also routinely used on
127  large scale MPP systems (for example T3E systems and IBM SP systems). In all  large scale MPP systems (for example T3E systems and IBM SP systems). In all
128  cases numerical code, operating within the WRAPPER, performs and scales very  cases numerical code, operating within the WRAPPER, performs and scales very
129  competitively with equivalent numerical code that has been modified to contain  competitively with equivalent numerical code that has been modified to contain
# Line 118  native optimizations for a particular sy Line 131  native optimizations for a particular sy
131    
132  \subsection{Supporting hardware neutrality}  \subsection{Supporting hardware neutrality}
133    
134  The different systems listed in section \ref{sec:target_hardware} can be  The different systems listed in section \ref{sect:target_hardware} can be
135  categorized in many different ways. For example, one common distinction is  categorized in many different ways. For example, one common distinction is
136  between shared-memory parallel systems (SMP's, PVP's) and distributed memory  between shared-memory parallel systems (SMP's, PVP's) and distributed memory
137  parallel systems (for example x86 clusters and large MPP systems). This is one  parallel systems (for example x86 clusters and large MPP systems). This is one
# Line 136  particular machine (for example an IBM S Line 149  particular machine (for example an IBM S
149  class of machines (for example Parallel Vector Processor Systems). Instead the  class of machines (for example Parallel Vector Processor Systems). Instead the
150  WRAPPER provides applications with an  WRAPPER provides applications with an
151  abstract {\it machine model}. The machine model is very general, however, it can  abstract {\it machine model}. The machine model is very general, however, it can
152  easily be specialized to fit, in a computationally effificent manner, any  easily be specialized to fit, in a computationally efficient manner, any
153  computer architecture currently available to the scientific computing community.  computer architecture currently available to the scientific computing community.
154    
155  \subsection{Machine model parallelism}  \subsection{Machine model parallelism}
156    \begin{rawhtml}
157    <!-- CMIREDIR:domain_decomp: -->
158    \end{rawhtml}
159    
160   Codes operating under the WRAPPER target an abstract machine that is assumed to   Codes operating under the WRAPPER target an abstract machine that is assumed to
161  consist of one or more logical processors that can compute concurrently.    consist of one or more logical processors that can compute concurrently.  
162  Computational work is divided amongst the logical  Computational work is divided among the logical
163  processors by allocating ``ownership'' to  processors by allocating ``ownership'' to
164  each processor of a certain set (or sets) of calculations. Each set of  each processor of a certain set (or sets) of calculations. Each set of
165  calculations owned by a particular processor is associated with a specific  calculations owned by a particular processor is associated with a specific
# Line 211  computational phases a processor will re Line 227  computational phases a processor will re
227  whenever it requires values that outside the domain it owns. Periodically  whenever it requires values that outside the domain it owns. Periodically
228  processors will make calls to WRAPPER functions to communicate data between  processors will make calls to WRAPPER functions to communicate data between
229  tiles, in order to keep the overlap regions up to date (see section  tiles, in order to keep the overlap regions up to date (see section
230  \ref{sec:communication_primitives}). The WRAPPER functions can use a  \ref{sect:communication_primitives}). The WRAPPER functions can use a
231  variety of different mechanisms to communicate data between tiles.  variety of different mechanisms to communicate data between tiles.
232    
233  \begin{figure}  \begin{figure}
# Line 298  value to be communicated between CPU's. Line 314  value to be communicated between CPU's.
314  \end{figure}  \end{figure}
315    
316  \subsection{Shared memory communication}  \subsection{Shared memory communication}
317  \label{sec:shared_memory_communication}  \label{sect:shared_memory_communication}
318    
319  Under shared communication independent CPU's are operating  Under shared communication independent CPU's are operating
320  on the exact same global address space at the application level.  on the exact same global address space at the application level.
# Line 324  the systems main-memory interconnect. Th Line 340  the systems main-memory interconnect. Th
340  communication very efficient provided it is used appropriately.  communication very efficient provided it is used appropriately.
341    
342  \subsubsection{Memory consistency}  \subsubsection{Memory consistency}
343  \label{sec:memory_consistency}  \label{sect:memory_consistency}
344    
345  When using shared memory communication between  When using shared memory communication between
346  multiple processors the WRAPPER level shields user applications from  multiple processors the WRAPPER level shields user applications from
# Line 348  memory, the WRAPPER provides a place to Line 364  memory, the WRAPPER provides a place to
364  ensure memory consistency for a particular platform.  ensure memory consistency for a particular platform.
365    
366  \subsubsection{Cache effects and false sharing}  \subsubsection{Cache effects and false sharing}
367  \label{sec:cache_effects_and_false_sharing}  \label{sect:cache_effects_and_false_sharing}
368    
369  Shared-memory machines often have local to processor memory caches  Shared-memory machines often have local to processor memory caches
370  which contain mirrored copies of main memory. Automatic cache-coherence  which contain mirrored copies of main memory. Automatic cache-coherence
# Line 367  in an application are potentially visibl Line 383  in an application are potentially visibl
383  threads operating within a single process is the standard mechanism for  threads operating within a single process is the standard mechanism for
384  supporting shared memory that the WRAPPER utilizes. Configuring and launching  supporting shared memory that the WRAPPER utilizes. Configuring and launching
385  code to run in multi-threaded mode on specific platforms is discussed in  code to run in multi-threaded mode on specific platforms is discussed in
386  section \ref{sec:running_with_threads}.  However, on many systems, potentially  section \ref{sect:running_with_threads}.  However, on many systems, potentially
387  very efficient mechanisms for using shared memory communication between  very efficient mechanisms for using shared memory communication between
388  multiple processes (in contrast to multiple threads within a single  multiple processes (in contrast to multiple threads within a single
389  process) also exist. In most cases this works by making a limited region of  process) also exist. In most cases this works by making a limited region of
# Line 380  distributed with the default WRAPPER sou Line 396  distributed with the default WRAPPER sou
396  nature.  nature.
397    
398  \subsection{Distributed memory communication}  \subsection{Distributed memory communication}
399  \label{sec:distributed_memory_communication}  \label{sect:distributed_memory_communication}
400  Many parallel systems are not constructed in a way where it is  Many parallel systems are not constructed in a way where it is
401  possible or practical for an application to use shared memory  possible or practical for an application to use shared memory
402  for communication. For example cluster systems consist of individual computers  for communication. For example cluster systems consist of individual computers
# Line 394  described in \ref{hoe-hill:99} substitut Line 410  described in \ref{hoe-hill:99} substitut
410  highly optimized library.  highly optimized library.
411    
412  \subsection{Communication primitives}  \subsection{Communication primitives}
413  \label{sec:communication_primitives}  \label{sect:communication_primitives}
414    
415  \begin{figure}  \begin{figure}
416  \begin{center}  \begin{center}
# Line 402  highly optimized library. Line 418  highly optimized library.
418    \includegraphics{part4/comm-primm.eps}    \includegraphics{part4/comm-primm.eps}
419   }   }
420  \end{center}  \end{center}
421  \caption{Three performance critical parallel primititives are provided  \caption{Three performance critical parallel primitives are provided
422  by the WRAPPER. These primititives are always used to communicate data  by the WRAPPER. These primitives are always used to communicate data
423  between tiles. The figure shows four tiles. The curved arrows indicate  between tiles. The figure shows four tiles. The curved arrows indicate
424  exchange primitives which transfer data between the overlap regions at tile  exchange primitives which transfer data between the overlap regions at tile
425  edges and interior regions for nearest-neighbor tiles.  edges and interior regions for nearest-neighbor tiles.
# Line 538  WRAPPER are Line 554  WRAPPER are
554  computing CPU's.  computing CPU's.
555  \end{enumerate}  \end{enumerate}
556  This section describes the details of each of these operations.  This section describes the details of each of these operations.
557  Section \ref{sec:specifying_a_decomposition} explains how the way in which  Section \ref{sect:specifying_a_decomposition} explains how the way in which
558  a domain is decomposed (or composed) is expressed. Section  a domain is decomposed (or composed) is expressed. Section
559  \ref{sec:starting_a_code} describes practical details of running codes  \ref{sect:starting_a_code} describes practical details of running codes
560  in various different parallel modes on contemporary computer systems.  in various different parallel modes on contemporary computer systems.
561  Section \ref{sec:controlling_communication} explains the internal information  Section \ref{sect:controlling_communication} explains the internal information
562  that the WRAPPER uses to control how information is communicated between  that the WRAPPER uses to control how information is communicated between
563  tiles.  tiles.
564    
565  \subsection{Specifying a domain decomposition}  \subsection{Specifying a domain decomposition}
566  \label{sec:specifying_a_decomposition}  \label{sect:specifying_a_decomposition}
567    
568  At its heart much of the WRAPPER works only in terms of a collection of tiles  At its heart much of the WRAPPER works only in terms of a collection of tiles
569  which are interconnected to each other. This is also true of application  which are interconnected to each other. This is also true of application
# Line 599  be created within a single process. Each Line 615  be created within a single process. Each
615  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
616  allocated to different threads of a process that are then bound to  allocated to different threads of a process that are then bound to
617  different physical processors ( see the multi-threaded  different physical processors ( see the multi-threaded
618  execution discussion in section \ref{sec:starting_the_code} ) then  execution discussion in section \ref{sect:starting_the_code} ) then
619  computation will be performed concurrently on each tile. However, it is also  computation will be performed concurrently on each tile. However, it is also
620  possible to run the same decomposition within a process running a single thread on  possible to run the same decomposition within a process running a single thread on
621  a single processor. In this case the tiles will be computed over sequentially.  a single processor. In this case the tiles will be computed over sequentially.
# Line 651  Within a {\em bi}, {\em bj} loop Line 667  Within a {\em bi}, {\em bj} loop
667  computation is performed concurrently over as many processes and threads  computation is performed concurrently over as many processes and threads
668  as there are physical processors available to compute.  as there are physical processors available to compute.
669    
670    An exception to the the use of {\em bi} and {\em bj} in loops arises in the
671    exchange routines used when the exch2 package is used with the cubed
672    sphere.  In this case {\em bj} is generally set to 1 and the loop runs from
673    1,{\em bi}.  Within the loop {\em bi} is used to retrieve the tile number,
674    which is then used to reference exchange parameters.
675    
676  The amount of computation that can be embedded  The amount of computation that can be embedded
677  a single loop over {\em bi} and {\em bj} varies for different parts of the  a single loop over {\em bi} and {\em bj} varies for different parts of the
678  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
# Line 771  The global domain size is again ninety g Line 793  The global domain size is again ninety g
793  forty grid points in y. The two sub-domains in each process will be computed  forty grid points in y. The two sub-domains in each process will be computed
794  sequentially if they are given to a single thread within a single process.  sequentially if they are given to a single thread within a single process.
795  Alternatively if the code is invoked with multiple threads per process  Alternatively if the code is invoked with multiple threads per process
796  the two domains in y may be computed on concurrently.  the two domains in y may be computed concurrently.
797  \item  \item
798  \begin{verbatim}  \begin{verbatim}
799        PARAMETER (        PARAMETER (
# Line 790  There are six tiles allocated to six sep Line 812  There are six tiles allocated to six sep
812  This set of values can be used for a cube sphere calculation.  This set of values can be used for a cube sphere calculation.
813  Each tile of size $32 \times 32$ represents a face of the  Each tile of size $32 \times 32$ represents a face of the
814  cube. Initializing the tile connectivity correctly ( see section  cube. Initializing the tile connectivity correctly ( see section
815  \ref{sec:cube_sphere_communication}. allows the rotations associated with  \ref{sect:cube_sphere_communication}. allows the rotations associated with
816  moving between the six cube faces to be embedded within the  moving between the six cube faces to be embedded within the
817  tile-tile communication code.  tile-tile communication code.
818  \end{enumerate}  \end{enumerate}
819    
820    
821  \subsection{Starting the code}  \subsection{Starting the code}
822  \label{sec:starting_the_code}  \label{sect:starting_the_code}
823  When code is started under the WRAPPER, execution begins in a main routine {\em  When code is started under the WRAPPER, execution begins in a main routine {\em
824  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
825  to the application through a routine called {\em THE\_MODEL\_MAIN()}  to the application through a routine called {\em THE\_MODEL\_MAIN()}
# Line 807  by the application code. The startup cal Line 829  by the application code. The startup cal
829  WRAPPER is shown in figure \ref{fig:wrapper_startup}.  WRAPPER is shown in figure \ref{fig:wrapper_startup}.
830    
831  \begin{figure}  \begin{figure}
832    {\footnotesize
833  \begin{verbatim}  \begin{verbatim}
834    
835         MAIN           MAIN  
# Line 835  WRAPPER is shown in figure \ref{fig:wrap Line 858  WRAPPER is shown in figure \ref{fig:wrap
858    
859    
860  \end{verbatim}  \end{verbatim}
861    }
862  \caption{Main stages of the WRAPPER startup procedure.  \caption{Main stages of the WRAPPER startup procedure.
863  This process proceeds transfer of control to application code, which  This process proceeds transfer of control to application code, which
864  occurs through the procedure {\em THE\_MODEL\_MAIN()}.  occurs through the procedure {\em THE\_MODEL\_MAIN()}.
# Line 842  occurs through the procedure {\em THE\_M Line 866  occurs through the procedure {\em THE\_M
866  \end{figure}  \end{figure}
867    
868  \subsubsection{Multi-threaded execution}  \subsubsection{Multi-threaded execution}
869  \label{sec:multi-threaded-execution}  \label{sect:multi-threaded-execution}
870  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
871  WRAPPER may cause several coarse grain threads to be initialized. The routine  WRAPPER may cause several coarse grain threads to be initialized. The routine
872  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
873  stack argument which is the thread number, stored in the  stack argument which is the thread number, stored in the
874  variable {\em myThid}. In addition to specifying a decomposition with  variable {\em myThid}. In addition to specifying a decomposition with
875  multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})  multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
876  configuring and starting a code to run using multiple threads requires the following  configuring and starting a code to run using multiple threads requires the following
877  steps.\\  steps.\\
878    
# Line 917  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1 Line 941  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
941  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
942  File: {\em model/src/THE\_MODEL\_MAIN.F}\\  File: {\em model/src/THE\_MODEL\_MAIN.F}\\
943  File: {\em eesupp/src/MAIN.F}\\  File: {\em eesupp/src/MAIN.F}\\
944  File: {\em tools/genmake}\\  File: {\em tools/genmake2}\\
945  File: {\em eedata}\\  File: {\em eedata}\\
946  CPP:  {\em TARGET\_SUN}\\  CPP:  {\em TARGET\_SUN}\\
947  CPP:  {\em TARGET\_DEC}\\  CPP:  {\em TARGET\_DEC}\\
# Line 930  Parameter:  {\em nTy} Line 954  Parameter:  {\em nTy}
954  } \\  } \\
955    
956  \subsubsection{Multi-process execution}  \subsubsection{Multi-process execution}
957  \label{sec:multi-process-execution}  \label{sect:multi-process-execution}
958    
959  Despite its appealing programming model, multi-threaded execution remains  Despite its appealing programming model, multi-threaded execution remains
960  less common then multi-process execution. One major reason for this  less common then multi-process execution. One major reason for this
# Line 942  models varies between systems. Line 966  models varies between systems.
966    
967  Multi-process execution is more ubiquitous.  Multi-process execution is more ubiquitous.
968  In order to run code in a multi-process configuration a decomposition  In order to run code in a multi-process configuration a decomposition
969  specification ( see section \ref{sec:specifying_a_decomposition})  specification ( see section \ref{sect:specifying_a_decomposition})
970  is given ( in which the at least one of the  is given ( in which the at least one of the
971  parameters {\em nPx} or {\em nPy} will be greater than one)  parameters {\em nPx} or {\em nPy} will be greater than one)
972  and then, as for multi-threaded operation,  and then, as for multi-threaded operation,
# Line 956  critical communication. However, in orde Line 980  critical communication. However, in orde
980  of controlling and coordinating the start up of a large number  of controlling and coordinating the start up of a large number
981  (hundreds and possibly even thousands) of copies of the same  (hundreds and possibly even thousands) of copies of the same
982  program, MPI is used. The calls to the MPI multi-process startup  program, MPI is used. The calls to the MPI multi-process startup
983  routines must be activated at compile time. This is done  routines must be activated at compile time.  Currently MPI libraries are
984  by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}  invoked by
985  flags in the {\em CPP\_EEOPTIONS.h} file.\\  specifying the appropriate options file with the
986    {\tt-of} flag when running the {\em genmake2}
987    script, which generates the Makefile for compiling and linking MITgcm.
988    (Previously this was done by setting the {\em ALLOW\_USE\_MPI} and
989    {\em ALWAYS\_USE\_MPI} flags in the {\em CPP\_EEOPTIONS.h} file.)  More
990    detailed information about the use of {\em genmake2} for specifying
991    local compiler flags is located in section \ref{sect:genmake}.\\  
992    
 \fbox{  
 \begin{minipage}{4.75in}  
 File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\  
 CPP:  {\em ALLOW\_USE\_MPI}\\  
 CPP:  {\em ALWAYS\_USE\_MPI}\\  
 Parameter:  {\em nPx}\\  
 Parameter:  {\em nPy}  
 \end{minipage}  
 } \\  
   
 Additionally, compile time options are required to link in the  
 MPI libraries and header files. Examples of these options  
 can be found in the {\em genmake} script that creates makefiles  
 for compilation. When this script is executed with the {bf -mpi}  
 flag it will generate a makefile that includes  
 paths for search for MPI head files and for linking in  
 MPI libraries. For example the {\bf -mpi} flag on a  
  Silicon Graphics IRIX system causes a  
 Makefile with the compilation command  
 Graphics IRIX system \begin{verbatim}  
 mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI  
 \end{verbatim}  
 to be generated.  
 This is the correct set of options for using the MPICH open-source  
 version of MPI, when it has been installed under the subdirectory  
 /usr/local/mpi.  
 However, on many systems there may be several  
 versions of MPI installed. For example many systems have both  
 the open source MPICH set of libraries and a vendor specific native form  
 of the MPI libraries. The correct setup to use will depend on the  
 local configuration of your system.\\  
993    
994  \fbox{  \fbox{
995  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
996  File: {\em tools/genmake}  Directory: {\em tools/build\_options}\\
997    File: {\em tools/genmake2}
998  \end{minipage}  \end{minipage}
999  } \\  } \\
1000  \paragraph{\bf Execution} The mechanics of starting a program in  \paragraph{\bf Execution} The mechanics of starting a program in
# Line 1006  using a command such as Line 1006  using a command such as
1006  \begin{verbatim}  \begin{verbatim}
1007  mpirun -np 64 -machinefile mf ./mitgcmuv  mpirun -np 64 -machinefile mf ./mitgcmuv
1008  \end{verbatim}  \end{verbatim}
1009  In this example the text {\em -np 64} specifices the number of processes  In this example the text {\em -np 64} specifies the number of processes
1010  that will be created. The numeric value {\em 64} must be equal to the  that will be created. The numeric value {\em 64} must be equal to the
1011  product of the processor grid settings of {\em nPx} and {\em nPy}  product of the processor grid settings of {\em nPx} and {\em nPy}
1012  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
1013  called ``mf'' will be read to get a list of processor names on  called ``mf'' will be read to get a list of processor names on
1014  which the sixty-four processes will execute. The syntax of this file  which the sixty-four processes will execute. The syntax of this file
1015  is specified by the MPI distribution  is specified by the MPI distribution.
1016  \\  \\
1017    
1018  \fbox{  \fbox{
# Line 1063  to processor identification are are used Line 1063  to processor identification are are used
1063  Allocation of processes to tiles in controlled by the routine  Allocation of processes to tiles in controlled by the routine
1064  {\em INI\_PROCS()}. For each process this routine sets  {\em INI\_PROCS()}. For each process this routine sets
1065  the variables {\em myXGlobalLo} and {\em myYGlobalLo}.  the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
1066  These variables specify (in index space) the coordinate  These variables specify in index space the coordinates
1067  of the southern most and western most corner of the  of the southernmost and westernmost corner of the
1068  southern most and western most tile owned by this process.  southernmost and westernmost tile owned by this process.
1069  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
1070  are also set in this routine. These are used to identify  are also set in this routine. These are used to identify
1071  processes holding tiles to the west, east, south and north  processes holding tiles to the west, east, south and north
1072  of this process. These values are stored in global storage  of this process. These values are stored in global storage
1073  in the header file {\em EESUPPORT.h} for use by  in the header file {\em EESUPPORT.h} for use by
1074  communication routines.  communication routines.  The above does not hold when the
1075    exch2 package is used -- exch2 sets its own parameters to
1076    specify the global indices of tiles and their relationships
1077    to each other.  See the documentation on the exch2 package
1078    (\ref{sec:exch2})  for
1079    details.
1080  \\  \\
1081    
1082  \fbox{  \fbox{
# Line 1097  operations and that can be customized fo Line 1102  operations and that can be customized fo
1102  describes the information that is held and used.  describes the information that is held and used.
1103    
1104  \begin{enumerate}  \begin{enumerate}
1105  \item {\bf Tile-tile connectivity information} For each tile the WRAPPER  \item {\bf Tile-tile connectivity information}
1106  sets a flag that sets the tile number to the north, south, east and  For each tile the WRAPPER
1107    sets a flag that sets the tile number to the north,
1108    south, east and
1109  west of that tile. This number is unique over all tiles in a  west of that tile. This number is unique over all tiles in a
1110  configuration. The number is held in the variables {\em tileNo}  configuration. Except when using the cubed sphere and the exch2 package,
1111    the number is held in the variables {\em tileNo}
1112  ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},  ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
1113  {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile  {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
1114  that specifies the type of communication that is used between tiles.  that specifies the type of communication that is used between tiles.
# Line 1112  A value of {\em COMM\_NONE} is used to i Line 1120  A value of {\em COMM\_NONE} is used to i
1120  neighbor to communicate with on a particular face. A value  neighbor to communicate with on a particular face. A value
1121  of {\em COMM\_MSG} is used to indicated that some form of distributed  of {\em COMM\_MSG} is used to indicated that some form of distributed
1122  memory communication is required to communicate between  memory communication is required to communicate between
1123  these tile faces ( see section \ref{sec:distributed_memory_communication}).  these tile faces ( see section \ref{sect:distributed_memory_communication}).
1124  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1125  forms of shared memory communication ( see section  forms of shared memory communication ( see section
1126  \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates  \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1127  that a CPU should communicate by writing to data structures owned by another  that a CPU should communicate by writing to data structures owned by another
1128  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1129  from data structures owned by another CPU. These flags affect the behavior  from data structures owned by another CPU. These flags affect the behavior
# Line 1123  of the WRAPPER exchange primitive Line 1131  of the WRAPPER exchange primitive
1131  (see figure \ref{fig:communication_primitives}). The routine  (see figure \ref{fig:communication_primitives}). The routine
1132  {\em ini\_communication\_patterns()} is responsible for setting the  {\em ini\_communication\_patterns()} is responsible for setting the
1133  communication mode values for each tile.  communication mode values for each tile.
1134  \\  
1135    When using the cubed sphere configuration with the exch2 package, the
1136    relationships between tiles and their communication methods are set
1137    by the package in other variables.  See the exch2 package documentation
1138    (\ref{sec:exch2} for details.
1139    
1140    
1141    
1142  \fbox{  \fbox{
1143  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1166  the product of the parameters {\em nTx} Line 1180  the product of the parameters {\em nTx}
1180  are read from the file {\em eedata}. If the value of {\em nThreads}  are read from the file {\em eedata}. If the value of {\em nThreads}
1181  is inconsistent with the number of threads requested from the  is inconsistent with the number of threads requested from the
1182  operating system (for example by using an environment  operating system (for example by using an environment
1183  variable as described in section \ref{sec:multi_threaded_execution})  variable as described in section \ref{sect:multi_threaded_execution})
1184  then usually an error will be reported by the routine  then usually an error will be reported by the routine
1185  {\em CHECK\_THREADS}.\\  {\em CHECK\_THREADS}.\\
1186    
# Line 1184  Parameter: {\em nTy} \\ Line 1198  Parameter: {\em nTy} \\
1198  }  }
1199    
1200  \item {\bf memsync flags}  \item {\bf memsync flags}
1201  As discussed in section \ref{sec:memory_consistency}, when using shared memory,  As discussed in section \ref{sect:memory_consistency}, when using shared memory,
1202  a low-level system function may be need to force memory consistency.  a low-level system function may be need to force memory consistency.
1203  The routine {\em MEMSYNC()} is used for this purpose. This routine should  The routine {\em MEMSYNC()} is used for this purpose. This routine should
1204  not need modifying and the information below is only provided for  not need modifying and the information below is only provided for
# Line 1210  asm("lock; addl $0,0(%%esp)": : :"memory Line 1224  asm("lock; addl $0,0(%%esp)": : :"memory
1224  \end{verbatim}  \end{verbatim}
1225    
1226  \item {\bf Cache line size}  \item {\bf Cache line size}
1227  As discussed in section \ref{sec:cache_effects_and_false_sharing},  As discussed in section \ref{sect:cache_effects_and_false_sharing},
1228  milti-threaded codes explicitly avoid penalties associated with excessive  milti-threaded codes explicitly avoid penalties associated with excessive
1229  coherence traffic on an SMP system. To do this the sgared memory data structures  coherence traffic on an SMP system. To do this the shared memory data structures
1230  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1231  are padded. The variables that control the padding are set in the  are padded. The variables that control the padding are set in the
1232  header file {\em EEPARAMS.h}. These variables are called  header file {\em EEPARAMS.h}. These variables are called
# Line 1220  header file {\em EEPARAMS.h}. These vari Line 1234  header file {\em EEPARAMS.h}. These vari
1234  {\em lShare8}. The default values should not normally need changing.  {\em lShare8}. The default values should not normally need changing.
1235  \item {\bf \_BARRIER}  \item {\bf \_BARRIER}
1236  This is a CPP macro that is expanded to a call to a routine  This is a CPP macro that is expanded to a call to a routine
1237  which synchronises all the logical processors running under the  which synchronizes all the logical processors running under the
1238  WRAPPER. Using a macro here preserves flexibility to insert  WRAPPER. Using a macro here preserves flexibility to insert
1239  a specialized call in-line into application code. By default this  a specialized call in-line into application code. By default this
1240  resolves to calling the procedure {\em BARRIER()}. The default  resolves to calling the procedure {\em BARRIER()}. The default
# Line 1228  setting for the \_BARRIER macro is given Line 1242  setting for the \_BARRIER macro is given
1242    
1243  \item {\bf \_GSUM}  \item {\bf \_GSUM}
1244  This is a CPP macro that is expanded to a call to a routine  This is a CPP macro that is expanded to a call to a routine
1245  which sums up a floating point numner  which sums up a floating point number
1246  over all the logical processors running under the  over all the logical processors running under the
1247  WRAPPER. Using a macro here provides extra flexibility to insert  WRAPPER. Using a macro here provides extra flexibility to insert
1248  a specialized call in-line into application code. By default this  a specialized call in-line into application code. By default this
1249  resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for  resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
1250  84=bit floating point operands)  64-bit floating point operands)
1251  or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default  or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
1252  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1253  The \_GSUM macro is a performance critical operation, especially for  The \_GSUM macro is a performance critical operation, especially for
1254  large processor count, small tile size configurations.  large processor count, small tile size configurations.
1255  The custom communication example discussed in section \ref{sec:jam_example}  The custom communication example discussed in section \ref{sect:jam_example}
1256  shows how the macro is used to invoke a custom global sum routine  shows how the macro is used to invoke a custom global sum routine
1257  for a specific set of hardware.  for a specific set of hardware.
1258    
# Line 1252  physical fields and whether fields are 3 Line 1266  physical fields and whether fields are 3
1266  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1267  \_EXCH operation plays a crucial role in scaling to small tile,  \_EXCH operation plays a crucial role in scaling to small tile,
1268  large logical and physical processor count configurations.  large logical and physical processor count configurations.
1269  The example in section \ref{sec:jam_example} discusses defining an  The example in section \ref{sect:jam_example} discusses defining an
1270  optimised and specialized form on the \_EXCH operation.  optimized and specialized form on the \_EXCH operation.
1271    
1272  The \_EXCH operation is also central to supporting grids such as  The \_EXCH operation is also central to supporting grids such as
1273  the cube-sphere grid. In this class of grid a rotation may be required  the cube-sphere grid. In this class of grid a rotation may be required
1274  between tiles. Aligning the coordinate requiring rotation with the  between tiles. Aligning the coordinate requiring rotation with the
1275  tile decomposistion, allows the coordinate transformation to  tile decomposition, allows the coordinate transformation to
1276  be embedded within a custom form of the \_EXCH primitive.  be embedded within a custom form of the \_EXCH primitive.  In these
1277    cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1278    package documentation  \ref{sec:exch2}.
1279    
1280  \item {\bf Reverse Mode}  \item {\bf Reverse Mode}
1281  The communication primitives \_EXCH and \_GSUM both employ  The communication primitives \_EXCH and \_GSUM both employ
1282  hand-written adjoint forms (or reverse mode) forms.  hand-written adjoint forms (or reverse mode) forms.
1283  These reverse mode forms can be found in the  These reverse mode forms can be found in the
1284  sourc code directory {\em pkg/autodiff}.  source code directory {\em pkg/autodiff}.
1285  For the global sum primitive the reverse mode form  For the global sum primitive the reverse mode form
1286  calls are to {\em GLOBAL\_ADSUM\_R4} and  calls are to {\em GLOBAL\_ADSUM\_R4} and
1287  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1288  exchamge primitives are found in routines  exchange primitives are found in routines
1289  prefixed {\em ADEXCH}. The exchange routines make calls to  prefixed {\em ADEXCH}. The exchange routines make calls to
1290  the same low-level communication primitives as the forward mode  the same low-level communication primitives as the forward mode
1291  operations. However, the routine argument {\em simulationMode}  operations. However, the routine argument {\em simulationMode}
1292  is set to the value {\em REVERSE\_SIMULATION}. This signifies  is set to the value {\em REVERSE\_SIMULATION}. This signifies
1293  ti the low-level routines that the adjoint forms of the  ti the low-level routines that the adjoint forms of the
1294  appropriate communication operation should be performed.  appropriate communication operation should be performed.
1295    
1296  \item {\bf MAX\_NO\_THREADS}  \item {\bf MAX\_NO\_THREADS}
1297  The variable {\em MAX\_NO\_THREADS} is used to indicate the  The variable {\em MAX\_NO\_THREADS} is used to indicate the
1298  maximum number of OS threads that a code will use. This  maximum number of OS threads that a code will use. This
1299  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1300  For single threaded execution it can be reduced to one if required.  For single threaded execution it can be reduced to one if required.
1301  The va;lue is largely private to the WRAPPER and application code  The value; is largely private to the WRAPPER and application code
1302  will nor normally reference the value, except in the following scenario.  will nor normally reference the value, except in the following scenario.
1303    
1304  For certain physical parametrization schemes it is necessary to have  For certain physical parametrization schemes it is necessary to have
# Line 1292  This can be achieved using a Fortran 90 Line 1309  This can be achieved using a Fortran 90
1309  if this might be unavailable then the work arrays can be extended  if this might be unavailable then the work arrays can be extended
1310  with dimensions use the tile dimensioning scheme of {\em nSx}  with dimensions use the tile dimensioning scheme of {\em nSx}
1311  and {\em nSy} ( as described in section  and {\em nSy} ( as described in section
1312  \ref{sec:specifying_a_decomposition}). However, if the configuration  \ref{sect:specifying_a_decomposition}). However, if the configuration
1313  being specified involves many more tiles than OS threads then  being specified involves many more tiles than OS threads then
1314  it can save memory resources to reduce the variable  it can save memory resources to reduce the variable
1315  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1316  will be used and to declare the physical parameterisation  will be used and to declare the physical parameterization
1317  work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.  work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
1318  An example of this is given in the verification experiment  An example of this is given in the verification experiment
1319  {\em aim.5l\_cs}. Here the default setting of  {\em aim.5l\_cs}. Here the default setting of
1320  {\em MAX\_NO\_THREADS} is altered to  {\em MAX\_NO\_THREADS} is altered to
# Line 1310  created with declarations of the form. Line 1327  created with declarations of the form.
1327  \begin{verbatim}  \begin{verbatim}
1328        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1329  \end{verbatim}  \end{verbatim}
1330  This declaration scheme is not used widely, becuase most global data  This declaration scheme is not used widely, because most global data
1331  is used for permanent not temporary storage of state information.  is used for permanent not temporary storage of state information.
1332  In the case of permanent state information this approach cannot be used  In the case of permanent state information this approach cannot be used
1333  because there has to be enough storage allocated for all tiles.  because there has to be enough storage allocated for all tiles.
1334  However, the technique can sometimes be a useful scheme for reducing memory  However, the technique can sometimes be a useful scheme for reducing memory
1335  requirements in complex physical paramterisations.  requirements in complex physical parameterizations.
1336  \end{enumerate}  \end{enumerate}
1337    
1338  \begin{figure}  \begin{figure}
# Line 1348  MP directives to spawn multiple threads. Line 1365  MP directives to spawn multiple threads.
1365  The isolation of performance critical communication primitives and the  The isolation of performance critical communication primitives and the
1366  sub-division of the simulation domain into tiles is a powerful tool.  sub-division of the simulation domain into tiles is a powerful tool.
1367  Here we show how it can be used to improve application performance and  Here we show how it can be used to improve application performance and
1368  how it can be used to adapt to new gridding approaches.  how it can be used to adapt to new griding approaches.
1369    
1370  \subsubsection{JAM example}  \subsubsection{JAM example}
1371  \label{sec:jam_example}  \label{sect:jam_example}
1372  On some platforms a big performance boost can be obtained by  On some platforms a big performance boost can be obtained by
1373  binding the communication routines {\em \_EXCH} and  binding the communication routines {\em \_EXCH} and
1374  {\em \_GSUM} to specialized native libraries ) fro example the  {\em \_GSUM} to specialized native libraries ) fro example the
# Line 1374  Developing specialized code for other li Line 1391  Developing specialized code for other li
1391  pattern.  pattern.
1392    
1393  \subsubsection{Cube sphere communication}  \subsubsection{Cube sphere communication}
1394  \label{sec:cube_sphere_communication}  \label{sect:cube_sphere_communication}
1395  Actual {\em \_EXCH} routine code is generated automatically from  Actual {\em \_EXCH} routine code is generated automatically from
1396  a series of template files, for example {\em exch\_rx.template}.  a series of template files, for example {\em exch\_rx.template}.
1397  This is done to allow a large number of variations on the exchange  This is done to allow a large number of variations on the exchange
1398  process to be maintained. One set of variations supports the  process to be maintained. One set of variations supports the
1399  cube sphere grid. Support for a cube sphere grid in MITgcm is based  cube sphere grid. Support for a cube sphere grid in MITgcm is based
1400  on having each face of the cube as a separate tile (or tiles).  on having each face of the cube as a separate tile or tiles.
1401  The exchange routines are then able to absorb much of the  The exchange routines are then able to absorb much of the
1402  detailed rotation and reorientation required when moving around the  detailed rotation and reorientation required when moving around the
1403  cube grid. The set of {\em \_EXCH} routines that contain the  cube grid. The set of {\em \_EXCH} routines that contain the
# Line 1407  quantities at the C-grid vorticity point Line 1424  quantities at the C-grid vorticity point
1424    
1425  Fitting together the WRAPPER elements, package elements and  Fitting together the WRAPPER elements, package elements and
1426  MITgcm core equation elements of the source code produces calling  MITgcm core equation elements of the source code produces calling
1427  sequence shown in section \ref{sec:calling_sequence}  sequence shown in section \ref{sect:calling_sequence}
1428    
1429  \subsection{Annotated call tree for MITgcm and WRAPPER}  \subsection{Annotated call tree for MITgcm and WRAPPER}
1430  \label{sec:calling_sequence}  \label{sect:calling_sequence}
1431    
1432  WRAPPER layer.  WRAPPER layer.
1433    
1434    {\footnotesize
1435  \begin{verbatim}  \begin{verbatim}
1436    
1437         MAIN           MAIN  
# Line 1441  WRAPPER layer. Line 1459  WRAPPER layer.
1459         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine
1460    
1461  \end{verbatim}  \end{verbatim}
1462    }
1463    
1464  Core equations plus packages.  Core equations plus packages.
1465    
1466    {\footnotesize
1467  \begin{verbatim}  \begin{verbatim}
1468  C  C
1469  C  C
# Line 1782  C    |-COMM_STATS     :: Summarise inter Line 1802  C    |-COMM_STATS     :: Summarise inter
1802  C                     :: events.  C                     :: events.
1803  C  C
1804  \end{verbatim}  \end{verbatim}
1805    }
1806    
1807  \subsection{Measuring and Characterizing Performance}  \subsection{Measuring and Characterizing Performance}
1808    

Legend:
Removed from v.1.4  
changed lines
  Added in v.1.18

  ViewVC Help
Powered by ViewVC 1.1.22