/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Diff of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.1 by cnh, Tue Oct 9 10:33:17 2001 UTC revision 1.20 by edhill, Sat Oct 16 03:40:16 2004 UTC
# Line 1  Line 1 
1    % $Header$
2    
3  In this chapter we describe the software architecture and  This chapter focuses on describing the {\bf WRAPPER} environment within which
4  implementation strategy for the MITgcm code. The first part of this  both the core numerics and the pluggable packages operate. The description
5  chapter discusses the MITgcm architecture at an abstract level. In the second  presented here is intended to be a detailed exposition and contains significant
6  part of the chapter we described practical details of the MITgcm implementation  background material, as well as advanced details on working with the WRAPPER.
7  and of current tools and operating system features that are employed.  The tutorial sections of this manual (see sections
8    \ref{sect:tutorials}  and \ref{sect:tutorialIII})
9    contain more succinct, step-by-step instructions on running basic numerical
10    experiments, of varous types, both sequentially and in parallel. For many
11    projects simply starting from an example code and adapting it to suit a
12    particular situation
13    will be all that is required.
14    The first part of this chapter discusses the MITgcm architecture at an
15    abstract level. In the second part of the chapter we described practical
16    details of the MITgcm implementation and of current tools and operating system
17    features that are employed.
18    
19  \section{Overall architectural goals}  \section{Overall architectural goals}
20    \begin{rawhtml}
21    <!-- CMIREDIR:overall_architectural_goals: -->
22    \end{rawhtml}
23    
24  Broadly, the goals of the software architecture employed in MITgcm are  Broadly, the goals of the software architecture employed in MITgcm are
25  three-fold  three-fold
26    
27  \begin{itemize}  \begin{itemize}
   
28  \item We wish to be able to study a very broad range  \item We wish to be able to study a very broad range
29  of interesting and challenging rotating fluids problems.  of interesting and challenging rotating fluids problems.
   
30  \item We wish the model code to be readily targeted to  \item We wish the model code to be readily targeted to
31  a wide range of platforms  a wide range of platforms
   
32  \item On any given platform we would like to be  \item On any given platform we would like to be
33  able to achieve performance comparable to an implementation  able to achieve performance comparable to an implementation
34  developed and specialized specifically for that platform.  developed and specialized specifically for that platform.
   
35  \end{itemize}  \end{itemize}
36    
37  These points are summarized in figure \ref{fig:mitgcm_architecture_goals}  These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
# Line 30  a software architecture which at the hig Line 40  a software architecture which at the hig
40  of  of
41    
42  \begin{enumerate}  \begin{enumerate}
   
43  \item A core set of numerical and support code. This is discussed in detail in  \item A core set of numerical and support code. This is discussed in detail in
44  section \ref{sec:partII}.  section \ref{sect:partII}.
   
45  \item A scheme for supporting optional "pluggable" {\bf packages} (containing  \item A scheme for supporting optional "pluggable" {\bf packages} (containing
46  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
47  These packages are used both to overlay alternate dynamics and to introduce  These packages are used both to overlay alternate dynamics and to introduce
48  specialized physical content onto the core numerical code. An overview of  specialized physical content onto the core numerical code. An overview of
49  the {\bf package} scheme is given at the start of part \ref{part:packages}.  the {\bf package} scheme is given at the start of part \ref{part:packages}.
   
   
50  \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel  \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
51  Programming Environment Resource), within which the core numerics and pluggable  Programming Environment Resource), within which the core numerics and pluggable
52  packages operate.  packages operate.
   
53  \end{enumerate}  \end{enumerate}
54    
55  This chapter focuses on describing the {\bf WRAPPER} environment under which  This chapter focuses on describing the {\bf WRAPPER} environment under which
56  both the core numerics and the pluggable packages function. The description  both the core numerics and the pluggable packages function. The description
57  presented here is intended to be a detailed exposistion and contains significant  presented here is intended to be a detailed exposition and contains significant
58  background material, as well as advanced details on working with the WRAPPER.  background material, as well as advanced details on working with the WRAPPER.
59  The examples section of this manual (part \ref{part:example}) contains more  The examples section of this manual (part \ref{part:example}) contains more
60  succinct, step-by-step instructions on running basic numerical  succinct, step-by-step instructions on running basic numerical
# Line 57  experiments both sequentially and in par Line 62  experiments both sequentially and in par
62  starting from an example code and adapting it to suit a particular situation  starting from an example code and adapting it to suit a particular situation
63  will be all that is required.  will be all that is required.
64    
65    
66  \begin{figure}  \begin{figure}
67  \begin{center}  \begin{center}
68   \resizebox{!}{2.5in}{  \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
   \includegraphics*[1.5in,2.4in][9.5in,6.3in]{part4/mitgcm_goals.eps}  
  }  
69  \end{center}  \end{center}
70  \caption{The MITgcm architecture is designed to allow simulation of a wide  \caption{
71    The MITgcm architecture is designed to allow simulation of a wide
72  range of physical problems on a wide range of hardware. The computational  range of physical problems on a wide range of hardware. The computational
73  resource requirements of the applications targeted range from around  resource requirements of the applications targeted range from around
74  $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes  $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
75  ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of  ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
76  interest range from $10^{9}$ floating point operations to more than $10^{17}$  interest range from $10^{9}$ floating point operations to more than $10^{17}$
77  floating point operations.} \label{fig:mitgcm_architecture_goals}  floating point operations.}
78    \label{fig:mitgcm_architecture_goals}
79  \end{figure}  \end{figure}
80    
81  \section{WRAPPER}  \section{WRAPPER}
82    \begin{rawhtml}
83    <!-- CMIREDIR:wrapper: -->
84    \end{rawhtml}
85    
86  A significant element of the software architecture utilized in  A significant element of the software architecture utilized in
87  MITgcm is a software superstructure and substructure collectively  MITgcm is a software superstructure and substructure collectively
# Line 81  Environment Resource). All numerical and Line 90  Environment Resource). All numerical and
90  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
91  the WRAPPER means that coding has to follow certain, relatively  the WRAPPER means that coding has to follow certain, relatively
92  straightforward, rules and conventions ( these are discussed further in  straightforward, rules and conventions ( these are discussed further in
93  section \ref{sec:specifying_a_decomposition} ).  section \ref{sect:specifying_a_decomposition} ).
94    
95  The approach taken by the WRAPPER is illustrated in figure  The approach taken by the WRAPPER is illustrated in figure
96  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
97  that fits within it from architectural differences between hardware platforms  that fits within it from architectural differences between hardware platforms
98  and operating systems. This allows numerical code to be easily retargetted.  and operating systems. This allows numerical code to be easily retargetted.
99    
100    
101  \begin{figure}  \begin{figure}
102  \begin{center}  \begin{center}
103   \resizebox{6in}{4.5in}{  \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
   \includegraphics*[0.6in,0.7in][9.0in,8.5in]{part4/fit_in_wrapper.eps}  
  }  
104  \end{center}  \end{center}
105  \caption{ Numerical code is written too fit within a software support  \caption{
106    Numerical code is written to fit within a software support
107  infrastructure called WRAPPER. The WRAPPER is portable and  infrastructure called WRAPPER. The WRAPPER is portable and
108  can be sepcialized for a wide range of specific target hardware and  can be specialized for a wide range of specific target hardware and
109  programming environments, without impacting numerical code that fits  programming environments, without impacting numerical code that fits
110  within the WRAPPER. Codes that fit within the WRAPPER can generally be  within the WRAPPER. Codes that fit within the WRAPPER can generally be
111  made to run as fast on a particular platform as codes specially  made to run as fast on a particular platform as codes specially
112  optimized for that platform.  optimized for that platform.}
113  } \label{fig:fit_in_wrapper}  \label{fig:fit_in_wrapper}
114  \end{figure}  \end{figure}
115    
116  \subsection{Target hardware}  \subsection{Target hardware}
117  \label{sec:target_hardware}  \label{sect:target_hardware}
118    
119  The WRAPPER is designed to target as broad as possible a range of computer  The WRAPPER is designed to target as broad as possible a range of computer
120  systems. The original development of the WRAPPER took place on a  systems. The original development of the WRAPPER took place on a
# Line 116  uniprocessor and multi-processor Sun sys Line 126  uniprocessor and multi-processor Sun sys
126  (UMA) and non-uniform memory access (NUMA) designs. Significant work has also  (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
127  been undertaken on x86 cluster systems, Alpha processor based clustered SMP  been undertaken on x86 cluster systems, Alpha processor based clustered SMP
128  systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.  systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
129  The MITgcm code, operating within the WRAPPER, is also used routinely used on  The MITgcm code, operating within the WRAPPER, is also routinely used on
130  large scale MPP systems (for example T3E systems and IBM SP systems). In all  large scale MPP systems (for example T3E systems and IBM SP systems). In all
131  cases numerical code, operating within the WRAPPER, performs and scales very  cases numerical code, operating within the WRAPPER, performs and scales very
132  competitively with equivalent numerical code that has been modified to contain  competitively with equivalent numerical code that has been modified to contain
# Line 124  native optimizations for a particular sy Line 134  native optimizations for a particular sy
134    
135  \subsection{Supporting hardware neutrality}  \subsection{Supporting hardware neutrality}
136    
137  The different systems listed in section \ref{sec:target_hardware} can be  The different systems listed in section \ref{sect:target_hardware} can be
138  categorized in many different ways. For example, one common distinction is  categorized in many different ways. For example, one common distinction is
139  between shared-memory parallel systems (SMP's, PVP's) and distributed memory  between shared-memory parallel systems (SMP's, PVP's) and distributed memory
140  parallel systems (for example x86 clusters and large MPP systems). This is one  parallel systems (for example x86 clusters and large MPP systems). This is one
# Line 142  particular machine (for example an IBM S Line 152  particular machine (for example an IBM S
152  class of machines (for example Parallel Vector Processor Systems). Instead the  class of machines (for example Parallel Vector Processor Systems). Instead the
153  WRAPPER provides applications with an  WRAPPER provides applications with an
154  abstract {\it machine model}. The machine model is very general, however, it can  abstract {\it machine model}. The machine model is very general, however, it can
155  easily be specialized to fit, in a computationally effificent manner, any  easily be specialized to fit, in a computationally efficient manner, any
156  computer architecture currently available to the scientific computing community.  computer architecture currently available to the scientific computing community.
157    
158  \subsection{Machine model parallelism}  \subsection{Machine model parallelism}
159    \begin{rawhtml}
160    <!-- CMIREDIR:domain_decomp: -->
161    \end{rawhtml}
162    
163   Codes operating under the WRAPPER target an abstract machine that is assumed to   Codes operating under the WRAPPER target an abstract machine that is assumed to
164  consist of one or more logical processors that can compute concurrently.    consist of one or more logical processors that can compute concurrently.  
165  Computational work is divided amongst the logical  Computational work is divided among the logical
166  processors by allocating ``ownership'' to  processors by allocating ``ownership'' to
167  each processor of a certain set (or sets) of calculations. Each set of  each processor of a certain set (or sets) of calculations. Each set of
168  calculations owned by a particular processor is associated with a specific  calculations owned by a particular processor is associated with a specific
# Line 172  Computationally, associated with each re Line 185  Computationally, associated with each re
185  space allocated to a particular logical processor, there will be data  space allocated to a particular logical processor, there will be data
186  structures (arrays, scalar variables etc...) that hold the simulated state of  structures (arrays, scalar variables etc...) that hold the simulated state of
187  that region. We refer to these data structures as being {\bf owned} by the  that region. We refer to these data structures as being {\bf owned} by the
188  pprocessor to which their  processor to which their
189  associated region of physical space has been allocated. Individual  associated region of physical space has been allocated. Individual
190  regions that are allocated to processors are called {\bf tiles}. A  regions that are allocated to processors are called {\bf tiles}. A
191  processor can own more  processor can own more
# Line 186  independently of the other tiles, in a s Line 199  independently of the other tiles, in a s
199    
200  \begin{figure}  \begin{figure}
201  \begin{center}  \begin{center}
202   \resizebox{7in}{3in}{   \resizebox{5in}{!}{
203    \includegraphics*[0.5in,2.7in][12.5in,6.4in]{part4/domain_decomp.eps}    \includegraphics{part4/domain_decomp.eps}
204   }   }
205  \end{center}  \end{center}
206  \caption{ The WRAPPER provides support for one and two dimensional  \caption{ The WRAPPER provides support for one and two dimensional
# Line 217  computational phases a processor will re Line 230  computational phases a processor will re
230  whenever it requires values that outside the domain it owns. Periodically  whenever it requires values that outside the domain it owns. Periodically
231  processors will make calls to WRAPPER functions to communicate data between  processors will make calls to WRAPPER functions to communicate data between
232  tiles, in order to keep the overlap regions up to date (see section  tiles, in order to keep the overlap regions up to date (see section
233  \ref{sec:communication_primitives}). The WRAPPER functions can use a  \ref{sect:communication_primitives}). The WRAPPER functions can use a
234  variety of different mechanisms to communicate data between tiles.  variety of different mechanisms to communicate data between tiles.
235    
236  \begin{figure}  \begin{figure}
237  \begin{center}  \begin{center}
238   \resizebox{7in}{3in}{   \resizebox{5in}{!}{
239    \includegraphics*[4.5in,3.7in][12.5in,6.7in]{part4/tiled-world.eps}    \includegraphics{part4/tiled-world.eps}
240   }   }
241  \end{center}  \end{center}
242  \caption{ A global grid subdivided into tiles.  \caption{ A global grid subdivided into tiles.
# Line 304  value to be communicated between CPU's. Line 317  value to be communicated between CPU's.
317  \end{figure}  \end{figure}
318    
319  \subsection{Shared memory communication}  \subsection{Shared memory communication}
320  \label{sec:shared_memory_communication}  \label{sect:shared_memory_communication}
321    
322  Under shared communication independent CPU's are operating  Under shared communication independent CPU's are operating
323  on the exact same global address space at the application level.  on the exact same global address space at the application level.
# Line 330  the systems main-memory interconnect. Th Line 343  the systems main-memory interconnect. Th
343  communication very efficient provided it is used appropriately.  communication very efficient provided it is used appropriately.
344    
345  \subsubsection{Memory consistency}  \subsubsection{Memory consistency}
346  \label{sec:memory_consistency}  \label{sect:memory_consistency}
347    
348  When using shared memory communication between  When using shared memory communication between
349  multiple processors the WRAPPER level shields user applications from  multiple processors the WRAPPER level shields user applications from
# Line 354  memory, the WRAPPER provides a place to Line 367  memory, the WRAPPER provides a place to
367  ensure memory consistency for a particular platform.  ensure memory consistency for a particular platform.
368    
369  \subsubsection{Cache effects and false sharing}  \subsubsection{Cache effects and false sharing}
370  \label{sec:cache_effects_and_false_sharing}  \label{sect:cache_effects_and_false_sharing}
371    
372  Shared-memory machines often have local to processor memory caches  Shared-memory machines often have local to processor memory caches
373  which contain mirrored copies of main memory. Automatic cache-coherence  which contain mirrored copies of main memory. Automatic cache-coherence
# Line 373  in an application are potentially visibl Line 386  in an application are potentially visibl
386  threads operating within a single process is the standard mechanism for  threads operating within a single process is the standard mechanism for
387  supporting shared memory that the WRAPPER utilizes. Configuring and launching  supporting shared memory that the WRAPPER utilizes. Configuring and launching
388  code to run in multi-threaded mode on specific platforms is discussed in  code to run in multi-threaded mode on specific platforms is discussed in
389  section \ref{sec:running_with_threads}.  However, on many systems, potentially  section \ref{sect:running_with_threads}.  However, on many systems, potentially
390  very efficient mechanisms for using shared memory communication between  very efficient mechanisms for using shared memory communication between
391  multiple processes (in contrast to multiple threads within a single  multiple processes (in contrast to multiple threads within a single
392  process) also exist. In most cases this works by making a limited region of  process) also exist. In most cases this works by making a limited region of
# Line 386  distributed with the default WRAPPER sou Line 399  distributed with the default WRAPPER sou
399  nature.  nature.
400    
401  \subsection{Distributed memory communication}  \subsection{Distributed memory communication}
402  \label{sec:distributed_memory_communication}  \label{sect:distributed_memory_communication}
403  Many parallel systems are not constructed in a way where it is  Many parallel systems are not constructed in a way where it is
404  possible or practical for an application to use shared memory  possible or practical for an application to use shared memory
405  for communication. For example cluster systems consist of individual computers  for communication. For example cluster systems consist of individual computers
# Line 400  described in \ref{hoe-hill:99} substitut Line 413  described in \ref{hoe-hill:99} substitut
413  highly optimized library.  highly optimized library.
414    
415  \subsection{Communication primitives}  \subsection{Communication primitives}
416  \label{sec:communication_primitives}  \label{sect:communication_primitives}
417    
418  \begin{figure}  \begin{figure}
419  \begin{center}  \begin{center}
420   \resizebox{5in}{3in}{   \resizebox{5in}{!}{
421    \includegraphics*[1.5in,0.7in][7.9in,4.4in]{part4/comm-primm.eps}    \includegraphics{part4/comm-primm.eps}
422   }   }
423  \end{center}  \end{center}
424  \caption{Three performance critical parallel primititives are provided  \caption{Three performance critical parallel primitives are provided
425  by the WRAPPER. These primititives are always used to communicate data  by the WRAPPER. These primitives are always used to communicate data
426  between tiles. The figure shows four tiles. The curved arrows indicate  between tiles. The figure shows four tiles. The curved arrows indicate
427  exchange primitives which transfer data between the overlap regions at tile  exchange primitives which transfer data between the overlap regions at tile
428  edges and interior regions for nearest-neighbor tiles.  edges and interior regions for nearest-neighbor tiles.
# Line 485  sub-domains. Line 498  sub-domains.
498    
499  \begin{figure}  \begin{figure}
500  \begin{center}  \begin{center}
501   \resizebox{5in}{3in}{   \resizebox{5in}{!}{
502    \includegraphics*[0.5in,1.3in][7.9in,5.7in]{part4/tiling_detail.eps}    \includegraphics{part4/tiling_detail.eps}
503   }   }
504  \end{center}  \end{center}
505  \caption{The tiling strategy that the WRAPPER supports allows tiles  \caption{The tiling strategy that the WRAPPER supports allows tiles
# Line 533  of almost all successful scientific comp Line 546  of almost all successful scientific comp
546  last 50 years.  last 50 years.
547    
548  \section{Using the WRAPPER}  \section{Using the WRAPPER}
549    \begin{rawhtml}
550    <!-- CMIREDIR:using_the_wrapper: -->
551    \end{rawhtml}
552    
553  In order to support maximum portability the WRAPPER is implemented primarily  In order to support maximum portability the WRAPPER is implemented primarily
554  in sequential Fortran 77. At a practical level the key steps provided by the  in sequential Fortran 77. At a practical level the key steps provided by the
# Line 544  WRAPPER are Line 560  WRAPPER are
560  computing CPU's.  computing CPU's.
561  \end{enumerate}  \end{enumerate}
562  This section describes the details of each of these operations.  This section describes the details of each of these operations.
563  Section \ref{sec:specifying_a_decomposition} explains how the way in which  Section \ref{sect:specifying_a_decomposition} explains how the way in which
564  a domain is decomposed (or composed) is expressed. Section  a domain is decomposed (or composed) is expressed. Section
565  \ref{sec:starting_a_code} describes practical details of running codes  \ref{sect:starting_a_code} describes practical details of running codes
566  in various different parallel modes on contemporary computer systems.  in various different parallel modes on contemporary computer systems.
567  Section \ref{sec:controlling_communication} explains the internal information  Section \ref{sect:controlling_communication} explains the internal information
568  that the WRAPPER uses to control how information is communicated between  that the WRAPPER uses to control how information is communicated between
569  tiles.  tiles.
570    
571  \subsection{Specifying a domain decomposition}  \subsection{Specifying a domain decomposition}
572  \label{sec:specifying_a_decomposition}  \label{sect:specifying_a_decomposition}
573    
574  At its heart much of the WRAPPER works only in terms of a collection of tiles  At its heart much of the WRAPPER works only in terms of a collection of tiles
575  which are interconnected to each other. This is also true of application  which are interconnected to each other. This is also true of application
# Line 589  not cause any other problems. Line 605  not cause any other problems.
605    
606  \begin{figure}  \begin{figure}
607  \begin{center}  \begin{center}
608   \resizebox{5in}{7in}{   \resizebox{5in}{!}{
609    \includegraphics*[0.5in,0.3in][7.9in,10.7in]{part4/size_h.eps}    \includegraphics{part4/size_h.eps}
610   }   }
611  \end{center}  \end{center}
612  \caption{ The three level domain decomposition hierarchy employed by the  \caption{ The three level domain decomposition hierarchy employed by the
# Line 605  be created within a single process. Each Line 621  be created within a single process. Each
621  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
622  allocated to different threads of a process that are then bound to  allocated to different threads of a process that are then bound to
623  different physical processors ( see the multi-threaded  different physical processors ( see the multi-threaded
624  execution discussion in section \ref{sec:starting_the_code} ) then  execution discussion in section \ref{sect:starting_the_code} ) then
625  computation will be performed concurrently on each tile. However, it is also  computation will be performed concurrently on each tile. However, it is also
626  possible to run the same decomposition within a process running a single thread on  possible to run the same decomposition within a process running a single thread on
627  a single processor. In this case the tiles will be computed over sequentially.  a single processor. In this case the tiles will be computed over sequentially.
# Line 657  Within a {\em bi}, {\em bj} loop Line 673  Within a {\em bi}, {\em bj} loop
673  computation is performed concurrently over as many processes and threads  computation is performed concurrently over as many processes and threads
674  as there are physical processors available to compute.  as there are physical processors available to compute.
675    
676    An exception to the the use of {\em bi} and {\em bj} in loops arises in the
677    exchange routines used when the exch2 package is used with the cubed
678    sphere.  In this case {\em bj} is generally set to 1 and the loop runs from
679    1,{\em bi}.  Within the loop {\em bi} is used to retrieve the tile number,
680    which is then used to reference exchange parameters.
681    
682  The amount of computation that can be embedded  The amount of computation that can be embedded
683  a single loop over {\em bi} and {\em bj} varies for different parts of the  a single loop over {\em bi} and {\em bj} varies for different parts of the
684  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
# Line 777  The global domain size is again ninety g Line 799  The global domain size is again ninety g
799  forty grid points in y. The two sub-domains in each process will be computed  forty grid points in y. The two sub-domains in each process will be computed
800  sequentially if they are given to a single thread within a single process.  sequentially if they are given to a single thread within a single process.
801  Alternatively if the code is invoked with multiple threads per process  Alternatively if the code is invoked with multiple threads per process
802  the two domains in y may be computed on concurrently.  the two domains in y may be computed concurrently.
803  \item  \item
804  \begin{verbatim}  \begin{verbatim}
805        PARAMETER (        PARAMETER (
# Line 795  thirty-two grid points, and x and y over Line 817  thirty-two grid points, and x and y over
817  There are six tiles allocated to six separate logical processors ({\em nSx=6}).  There are six tiles allocated to six separate logical processors ({\em nSx=6}).
818  This set of values can be used for a cube sphere calculation.  This set of values can be used for a cube sphere calculation.
819  Each tile of size $32 \times 32$ represents a face of the  Each tile of size $32 \times 32$ represents a face of the
820  cube. Initialising the tile connectivity correctly ( see section  cube. Initializing the tile connectivity correctly ( see section
821  \ref{sec:cube_sphere_communication}. allows the rotations associated with  \ref{sect:cube_sphere_communication}. allows the rotations associated with
822  moving between the six cube faces to be embedded within the  moving between the six cube faces to be embedded within the
823  tile-tile communication code.  tile-tile communication code.
824  \end{enumerate}  \end{enumerate}
825    
826    
827  \subsection{Starting the code}  \subsection{Starting the code}
828  \label{sec:starting_the_code}  \label{sect:starting_the_code}
829  When code is started under the WRAPPER, execution begins in a main routine {\em  When code is started under the WRAPPER, execution begins in a main routine {\em
830  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
831  to the application through a routine called {\em THE\_MODEL\_MAIN()}  to the application through a routine called {\em THE\_MODEL\_MAIN()}
# Line 812  to support subsequent calls to communica Line 834  to support subsequent calls to communica
834  by the application code. The startup calling sequence followed by the  by the application code. The startup calling sequence followed by the
835  WRAPPER is shown in figure \ref{fig:wrapper_startup}.  WRAPPER is shown in figure \ref{fig:wrapper_startup}.
836    
   
837  \begin{figure}  \begin{figure}
838    {\footnotesize
839  \begin{verbatim}  \begin{verbatim}
840    
841         MAIN           MAIN  
# Line 842  WRAPPER is shown in figure \ref{fig:wrap Line 864  WRAPPER is shown in figure \ref{fig:wrap
864    
865    
866  \end{verbatim}  \end{verbatim}
867    }
868  \caption{Main stages of the WRAPPER startup procedure.  \caption{Main stages of the WRAPPER startup procedure.
869  This process proceeds transfer of control to application code, which  This process proceeds transfer of control to application code, which
870  occurs through the procedure {\em THE\_MODEL\_MAIN()}.  occurs through the procedure {\em THE\_MODEL\_MAIN()}.
# Line 849  occurs through the procedure {\em THE\_M Line 872  occurs through the procedure {\em THE\_M
872  \end{figure}  \end{figure}
873    
874  \subsubsection{Multi-threaded execution}  \subsubsection{Multi-threaded execution}
875    \label{sect:multi-threaded-execution}
876  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
877  WRAPPER may cause several coarse grain threads to be initialized. The routine  WRAPPER may cause several coarse grain threads to be initialized. The routine
878  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
879  stack argument which is the thread number, stored in the  stack argument which is the thread number, stored in the
880  variable {\em myThid}. In addition to specifying a decomposition with  variable {\em myThid}. In addition to specifying a decomposition with
881  multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})  multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
882  configuring and starting a code to run using multiple threads requires the following  configuring and starting a code to run using multiple threads requires the following
883  steps.\\  steps.\\
884    
# Line 904  parallelization the compiler may otherwi Line 928  parallelization the compiler may otherwi
928  \end{enumerate}  \end{enumerate}
929    
930    
 \paragraph{Environment variables}  
 On most systems multi-threaded execution also requires the setting  
 of a special environment variable. On many machines this variable  
 is called PARALLEL and its values should be set to the number  
 of parallel threads required. Generally the help pages associated  
 with the multi-threaded compiler on a machine will explain  
 how to set the required environment variables for that machines.  
   
 \paragraph{Runtime input parameters}  
 Finally the file {\em eedata} needs to be configured to indicate  
 the number of threads to be used in the x and y directions.  
 The variables {\em nTx} and {\em nTy} in this file are used to  
 specify the information required. The product of {\em nTx} and  
 {\em nTy} must be equal to the number of threads spawned i.e.  
 the setting of the environment variable PARALLEL.  
 The value of {\em nTx} must subdivide the number of sub-domains  
 in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the  
 number of sub-domains in y ({\em nSy}) exactly.  
   
931  An example of valid settings for the {\em eedata} file for a  An example of valid settings for the {\em eedata} file for a
932  domain with two subdomains in y and running with two threads is shown  domain with two subdomains in y and running with two threads is shown
933  below  below
# Line 942  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1 Line 947  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
947  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
948  File: {\em model/src/THE\_MODEL\_MAIN.F}\\  File: {\em model/src/THE\_MODEL\_MAIN.F}\\
949  File: {\em eesupp/src/MAIN.F}\\  File: {\em eesupp/src/MAIN.F}\\
950  File: {\em tools/genmake}\\  File: {\em tools/genmake2}\\
951  File: {\em eedata}\\  File: {\em eedata}\\
952  CPP:  {\em TARGET\_SUN}\\  CPP:  {\em TARGET\_SUN}\\
953  CPP:  {\em TARGET\_DEC}\\  CPP:  {\em TARGET\_DEC}\\
# Line 955  Parameter:  {\em nTy} Line 960  Parameter:  {\em nTy}
960  } \\  } \\
961    
962  \subsubsection{Multi-process execution}  \subsubsection{Multi-process execution}
963    \label{sect:multi-process-execution}
964    
965  Despite its appealing programming model, multi-threaded execution remains  Despite its appealing programming model, multi-threaded execution remains
966  less common then multi-process execution. One major reason for this  less common then multi-process execution. One major reason for this
# Line 966  models varies between systems. Line 972  models varies between systems.
972    
973  Multi-process execution is more ubiquitous.  Multi-process execution is more ubiquitous.
974  In order to run code in a multi-process configuration a decomposition  In order to run code in a multi-process configuration a decomposition
975  specification is given ( in which the at least one of the  specification ( see section \ref{sect:specifying_a_decomposition})
976    is given ( in which the at least one of the
977  parameters {\em nPx} or {\em nPy} will be greater than one)  parameters {\em nPx} or {\em nPy} will be greater than one)
978  and then, as for multi-threaded operation,  and then, as for multi-threaded operation,
979  appropriate compile time and run time steps must be taken.  appropriate compile time and run time steps must be taken.
# Line 979  critical communication. However, in orde Line 986  critical communication. However, in orde
986  of controlling and coordinating the start up of a large number  of controlling and coordinating the start up of a large number
987  (hundreds and possibly even thousands) of copies of the same  (hundreds and possibly even thousands) of copies of the same
988  program, MPI is used. The calls to the MPI multi-process startup  program, MPI is used. The calls to the MPI multi-process startup
989  routines must be activated at compile time. This is done  routines must be activated at compile time.  Currently MPI libraries are
990  by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}  invoked by
991  flags in the {\em CPP\_EEOPTIONS.h} file.\\  specifying the appropriate options file with the
992    {\tt-of} flag when running the {\em genmake2}
993    script, which generates the Makefile for compiling and linking MITgcm.
994    (Previously this was done by setting the {\em ALLOW\_USE\_MPI} and
995    {\em ALWAYS\_USE\_MPI} flags in the {\em CPP\_EEOPTIONS.h} file.)  More
996    detailed information about the use of {\em genmake2} for specifying
997    local compiler flags is located in section \ref{sect:genmake}.\\  
998    
 \fbox{  
 \begin{minipage}{4.75in}  
 File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\  
 CPP:  {\em ALLOW\_USE\_MPI}\\  
 CPP:  {\em ALWAYS\_USE\_MPI}\\  
 Parameter:  {\em nPx}\\  
 Parameter:  {\em nPy}  
 \end{minipage}  
 } \\  
   
 Additionally, compile time options are required to link in the  
 MPI libraries and header files. Examples of these options  
 can be found in the {\em genmake} script that creates makefiles  
 for compilation. When this script is executed with the {bf -mpi}  
 flag it will generate a makefile that includes  
 paths for search for MPI head files and for linking in  
 MPI libraries. For example the {\bf -mpi} flag on a  
  Silicon Graphics IRIX system causes a  
 Makefile with the compilation command  
 Graphics IRIX system \begin{verbatim}  
 mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI  
 \end{verbatim}  
 to be generated.  
 This is the correct set of options for using the MPICH open-source  
 version of MPI, when it has been installed under the subdirectory  
 /usr/local/mpi.  
 However, on many systems there may be several  
 versions of MPI installed. For example many systems have both  
 the open source MPICH set of libraries and a vendor specific native form  
 of the MPI libraries. The correct setup to use will depend on the  
 local configuration of your system.\\  
999    
1000  \fbox{  \fbox{
1001  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
1002  File: {\em tools/genmake}  Directory: {\em tools/build\_options}\\
1003    File: {\em tools/genmake2}
1004  \end{minipage}  \end{minipage}
1005  } \\  } \\
1006  \paragraph{\bf Execution} The mechanics of starting a program in  \paragraph{\bf Execution} The mechanics of starting a program in
# Line 1029  using a command such as Line 1012  using a command such as
1012  \begin{verbatim}  \begin{verbatim}
1013  mpirun -np 64 -machinefile mf ./mitgcmuv  mpirun -np 64 -machinefile mf ./mitgcmuv
1014  \end{verbatim}  \end{verbatim}
1015  In this example the text {\em -np 64} specifices the number of processes  In this example the text {\em -np 64} specifies the number of processes
1016  that will be created. The numeric value {\em 64} must be equal to the  that will be created. The numeric value {\em 64} must be equal to the
1017  product of the processor grid settings of {\em nPx} and {\em nPy}  product of the processor grid settings of {\em nPx} and {\em nPy}
1018  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
1019  called ``mf'' will be read to get a list of processor names on  called ``mf'' will be read to get a list of processor names on
1020  which the sixty-four processes will execute. The syntax of this file  which the sixty-four processes will execute. The syntax of this file
1021  is specified by the MPI distribution  is specified by the MPI distribution.
1022  \\  \\
1023    
1024  \fbox{  \fbox{
# Line 1046  Parameter: {\em nPy} Line 1029  Parameter: {\em nPy}
1029  \end{minipage}  \end{minipage}
1030  } \\  } \\
1031    
1032    
1033    \paragraph{Environment variables}
1034    On most systems multi-threaded execution also requires the setting
1035    of a special environment variable. On many machines this variable
1036    is called PARALLEL and its values should be set to the number
1037    of parallel threads required. Generally the help pages associated
1038    with the multi-threaded compiler on a machine will explain
1039    how to set the required environment variables for that machines.
1040    
1041    \paragraph{Runtime input parameters}
1042    Finally the file {\em eedata} needs to be configured to indicate
1043    the number of threads to be used in the x and y directions.
1044    The variables {\em nTx} and {\em nTy} in this file are used to
1045    specify the information required. The product of {\em nTx} and
1046    {\em nTy} must be equal to the number of threads spawned i.e.
1047    the setting of the environment variable PARALLEL.
1048    The value of {\em nTx} must subdivide the number of sub-domains
1049    in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
1050    number of sub-domains in y ({\em nSy}) exactly.
1051  The multiprocess startup of the MITgcm executable {\em mitgcmuv}  The multiprocess startup of the MITgcm executable {\em mitgcmuv}
1052  is controlled by the routines {\em EEBOOT\_MINIMAL()} and  is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1053  {\em INI\_PROCS()}. The first routine performs basic steps required  {\em INI\_PROCS()}. The first routine performs basic steps required
# Line 1058  number so that process number 0 will cre Line 1060  number so that process number 0 will cre
1060  output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files  output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
1061  are used for reporting status and configuration information and  are used for reporting status and configuration information and
1062  for reporting error conditions on a process by process basis.  for reporting error conditions on a process by process basis.
1063  The {{\em EEBOOT\_MINIMAL()} procedure also sets the variables  The {\em EEBOOT\_MINIMAL()} procedure also sets the variables
1064  {\em myProcId} and {\em MPI\_COMM\_MODEL}.  {\em myProcId} and {\em MPI\_COMM\_MODEL}.
1065  These variables are related  These variables are related
1066  to processor identification are are used later in the routine  to processor identification are are used later in the routine
# Line 1067  to processor identification are are used Line 1069  to processor identification are are used
1069  Allocation of processes to tiles in controlled by the routine  Allocation of processes to tiles in controlled by the routine
1070  {\em INI\_PROCS()}. For each process this routine sets  {\em INI\_PROCS()}. For each process this routine sets
1071  the variables {\em myXGlobalLo} and {\em myYGlobalLo}.  the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
1072  These variables specify (in index space) the coordinate  These variables specify in index space the coordinates
1073  of the southern most and western most corner of the  of the southernmost and westernmost corner of the
1074  southern most and western most tile owned by this process.  southernmost and westernmost tile owned by this process.
1075  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
1076  are also set in this routine. These are used to identify  are also set in this routine. These are used to identify
1077  processes holding tiles to the west, east, south and north  processes holding tiles to the west, east, south and north
1078  of this process. These values are stored in global storage  of this process. These values are stored in global storage
1079  in the header file {\em EESUPPORT.h} for use by  in the header file {\em EESUPPORT.h} for use by
1080  communication routines.  communication routines.  The above does not hold when the
1081    exch2 package is used -- exch2 sets its own parameters to
1082    specify the global indices of tiles and their relationships
1083    to each other.  See the documentation on the exch2 package
1084    (\ref{sec:exch2})  for
1085    details.
1086  \\  \\
1087    
1088  \fbox{  \fbox{
# Line 1099  Parameter: {\em pidN       } Line 1106  Parameter: {\em pidN       }
1106  The WRAPPER maintains internal information that is used for communication  The WRAPPER maintains internal information that is used for communication
1107  operations and that can be customized for different platforms. This section  operations and that can be customized for different platforms. This section
1108  describes the information that is held and used.  describes the information that is held and used.
1109    
1110  \begin{enumerate}  \begin{enumerate}
1111  \item {\bf Tile-tile connectivity information} For each tile the WRAPPER  \item {\bf Tile-tile connectivity information}
1112  sets a flag that sets the tile number to the north, south, east and  For each tile the WRAPPER
1113    sets a flag that sets the tile number to the north,
1114    south, east and
1115  west of that tile. This number is unique over all tiles in a  west of that tile. This number is unique over all tiles in a
1116  configuration. The number is held in the variables {\em tileNo}  configuration. Except when using the cubed sphere and the exch2 package,
1117    the number is held in the variables {\em tileNo}
1118  ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},  ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
1119  {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile  {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
1120  that specifies the type of communication that is used between tiles.  that specifies the type of communication that is used between tiles.
# Line 1112  This information is held in the variable Line 1123  This information is held in the variable
1123  This latter set of variables can take one of the following values  This latter set of variables can take one of the following values
1124  {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.  {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
1125  A value of {\em COMM\_NONE} is used to indicate that a tile has no  A value of {\em COMM\_NONE} is used to indicate that a tile has no
1126  neighbor to cummnicate with on a particular face. A value  neighbor to communicate with on a particular face. A value
1127  of {\em COMM\_MSG} is used to indicated that some form of distributed  of {\em COMM\_MSG} is used to indicated that some form of distributed
1128  memory communication is required to communicate between  memory communication is required to communicate between
1129  these tile faces ( see section \ref{sec:distributed_memory_communication}).  these tile faces ( see section \ref{sect:distributed_memory_communication}).
1130  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1131  forms of shared memory communication ( see section  forms of shared memory communication ( see section
1132  \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates  \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1133  that a CPU should communicate by writing to data structures owned by another  that a CPU should communicate by writing to data structures owned by another
1134  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1135  from data structures owned by another CPU. These flags affect the behavior  from data structures owned by another CPU. These flags affect the behavior
# Line 1126  of the WRAPPER exchange primitive Line 1137  of the WRAPPER exchange primitive
1137  (see figure \ref{fig:communication_primitives}). The routine  (see figure \ref{fig:communication_primitives}). The routine
1138  {\em ini\_communication\_patterns()} is responsible for setting the  {\em ini\_communication\_patterns()} is responsible for setting the
1139  communication mode values for each tile.  communication mode values for each tile.
1140  \\  
1141    When using the cubed sphere configuration with the exch2 package, the
1142    relationships between tiles and their communication methods are set
1143    by the package in other variables.  See the exch2 package documentation
1144    (\ref{sec:exch2} for details.
1145    
1146    
1147    
1148  \fbox{  \fbox{
1149  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1169  the product of the parameters {\em nTx} Line 1186  the product of the parameters {\em nTx}
1186  are read from the file {\em eedata}. If the value of {\em nThreads}  are read from the file {\em eedata}. If the value of {\em nThreads}
1187  is inconsistent with the number of threads requested from the  is inconsistent with the number of threads requested from the
1188  operating system (for example by using an environment  operating system (for example by using an environment
1189  varialble as described in section \ref{sec:multi_threaded_execution})  variable as described in section \ref{sect:multi_threaded_execution})
1190  then usually an error will be reported by the routine  then usually an error will be reported by the routine
1191  {\em CHECK\_THREADS}.\\  {\em CHECK\_THREADS}.\\
1192    
# Line 1186  Parameter: {\em nTy} \\ Line 1203  Parameter: {\em nTy} \\
1203  \end{minipage}  \end{minipage}
1204  }  }
1205    
 \begin{figure}  
 \begin{verbatim}  
 C--  
 C--  Parallel directives for MIPS Pro Fortran compiler  
 C--  
 C      Parallel compiler directives for SGI with IRIX  
 C$PAR  PARALLEL DO  
 C$PAR&  CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,  
 C$PAR&  SHARE(nThreads),LOCAL(myThid,I)  
 C  
       DO I=1,nThreads  
         myThid = I  
   
 C--     Invoke nThreads instances of the numerical model  
         CALL THE_MODEL_MAIN(myThid)  
   
       ENDDO  
 \end{verbatim}  
 \caption{Prior to transferring control to  
 the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use  
 MP directives to spawn multiple threads.  
 } \label{fig:mp_directives}  
 \end{figure}  
   
   
1206  \item {\bf memsync flags}  \item {\bf memsync flags}
1207  As discussed in section \ref{sec:memory_consistency}, when using shared memory,  As discussed in section \ref{sect:memory_consistency}, when using shared memory,
1208  a low-level system function may be need to force memory consistency.  a low-level system function may be need to force memory consistency.
1209  The routine {\em MEMSYNC()} is used for this purpose. This routine should  The routine {\em MEMSYNC()} is used for this purpose. This routine should
1210  not need modifying and the information below is only provided for  not need modifying and the information below is only provided for
# Line 1228  For an Ultra Sparc system the following Line 1220  For an Ultra Sparc system the following
1220  \begin{verbatim}  \begin{verbatim}
1221  asm("membar #LoadStore|#StoreStore");  asm("membar #LoadStore|#StoreStore");
1222  \end{verbatim}  \end{verbatim}
1223  for an Alpha based sytem the euivalent code reads  for an Alpha based system the equivalent code reads
1224  \begin{verbatim}  \begin{verbatim}
1225  asm("mb");  asm("mb");
1226  \end{verbatim}  \end{verbatim}
# Line 1238  asm("lock; addl $0,0(%%esp)": : :"memory Line 1230  asm("lock; addl $0,0(%%esp)": : :"memory
1230  \end{verbatim}  \end{verbatim}
1231    
1232  \item {\bf Cache line size}  \item {\bf Cache line size}
1233  As discussed in section \ref{sec:cache_effects_and_false_sharing},  As discussed in section \ref{sect:cache_effects_and_false_sharing},
1234  milti-threaded codes explicitly avoid penalties associated with excessive  milti-threaded codes explicitly avoid penalties associated with excessive
1235  coherence traffic on an SMP system. To do this the sgared memory data structures  coherence traffic on an SMP system. To do this the shared memory data structures
1236  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1237  are padded. The variables that control the padding are set in the  are padded. The variables that control the padding are set in the
1238  header file {\em EEPARAMS.h}. These variables are called  header file {\em EEPARAMS.h}. These variables are called
# Line 1248  header file {\em EEPARAMS.h}. These vari Line 1240  header file {\em EEPARAMS.h}. These vari
1240  {\em lShare8}. The default values should not normally need changing.  {\em lShare8}. The default values should not normally need changing.
1241  \item {\bf \_BARRIER}  \item {\bf \_BARRIER}
1242  This is a CPP macro that is expanded to a call to a routine  This is a CPP macro that is expanded to a call to a routine
1243  which synchronises all the logical processors running under the  which synchronizes all the logical processors running under the
1244  WRAPPER. Using a macro here preserves flexibility to insert  WRAPPER. Using a macro here preserves flexibility to insert
1245  a specialized call in-line into application code. By default this  a specialized call in-line into application code. By default this
1246  resolves to calling the procedure {\em BARRIER()}. The default  resolves to calling the procedure {\em BARRIER()}. The default
# Line 1256  setting for the \_BARRIER macro is given Line 1248  setting for the \_BARRIER macro is given
1248    
1249  \item {\bf \_GSUM}  \item {\bf \_GSUM}
1250  This is a CPP macro that is expanded to a call to a routine  This is a CPP macro that is expanded to a call to a routine
1251  which sums up a floating point numner  which sums up a floating point number
1252  over all the logical processors running under the  over all the logical processors running under the
1253  WRAPPER. Using a macro here provides extra flexibility to insert  WRAPPER. Using a macro here provides extra flexibility to insert
1254  a specialized call in-line into application code. By default this  a specialized call in-line into application code. By default this
1255  resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for  resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
1256  84=bit floating point operands)  64-bit floating point operands)
1257  or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default  or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
1258  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1259  The \_GSUM macro is a performance critical operation, especially for  The \_GSUM macro is a performance critical operation, especially for
1260  large processor count, small tile size configurations.  large processor count, small tile size configurations.
1261  The custom communication example discussed in section \ref{sec:jam_example}  The custom communication example discussed in section \ref{sect:jam_example}
1262  shows how the macro is used to invoke a custom global sum routine  shows how the macro is used to invoke a custom global sum routine
1263  for a specific set of hardware.  for a specific set of hardware.
1264    
# Line 1280  physical fields and whether fields are 3 Line 1272  physical fields and whether fields are 3
1272  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1273  \_EXCH operation plays a crucial role in scaling to small tile,  \_EXCH operation plays a crucial role in scaling to small tile,
1274  large logical and physical processor count configurations.  large logical and physical processor count configurations.
1275  The example in section \ref{sec:jam_example} discusses defining an  The example in section \ref{sect:jam_example} discusses defining an
1276  optimised and specialized form on the \_EXCH operation.  optimized and specialized form on the \_EXCH operation.
1277    
1278  The \_EXCH operation is also central to supporting grids such as  The \_EXCH operation is also central to supporting grids such as
1279  the cube-sphere grid. In this class of grid a rotation may be required  the cube-sphere grid. In this class of grid a rotation may be required
1280  between tiles. Aligning the coordinate requiring rotation with the  between tiles. Aligning the coordinate requiring rotation with the
1281  tile decomposistion, allows the coordinate transformation to  tile decomposition, allows the coordinate transformation to
1282  be embedded within a custom form of the \_EXCH primitive.  be embedded within a custom form of the \_EXCH primitive.  In these
1283    cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1284    package documentation  \ref{sec:exch2}.
1285    
1286  \item {\bf Reverse Mode}  \item {\bf Reverse Mode}
1287  The communication primitives \_EXCH and \_GSUM both employ  The communication primitives \_EXCH and \_GSUM both employ
1288  hand-written adjoint forms (or reverse mode) forms.  hand-written adjoint forms (or reverse mode) forms.
1289  These reverse mode forms can be found in the  These reverse mode forms can be found in the
1290  sourc code directory {\em pkg/autodiff}.  source code directory {\em pkg/autodiff}.
1291  For the global sum primitive the reverse mode form  For the global sum primitive the reverse mode form
1292  calls are to {\em GLOBAL\_ADSUM\_R4} and  calls are to {\em GLOBAL\_ADSUM\_R4} and
1293  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1294  exchamge primitives are found in routines  exchange primitives are found in routines
1295  prefixed {\em ADEXCH}. The exchange routines make calls to  prefixed {\em ADEXCH}. The exchange routines make calls to
1296  the same low-level communication primitives as the forward mode  the same low-level communication primitives as the forward mode
1297  operations. However, the routine argument {\em simulationMode}  operations. However, the routine argument {\em simulationMode}
1298  is set to the value {\em REVERSE\_SIMULATION}. This signifies  is set to the value {\em REVERSE\_SIMULATION}. This signifies
1299  ti the low-level routines that the adjoint forms of the  ti the low-level routines that the adjoint forms of the
1300  appropriate communication operation should be performed.  appropriate communication operation should be performed.
1301    
1302  \item {\bf MAX\_NO\_THREADS}  \item {\bf MAX\_NO\_THREADS}
1303  The variable {\em MAX\_NO\_THREADS} is used to indicate the  The variable {\em MAX\_NO\_THREADS} is used to indicate the
1304  maximum number of OS threads that a code will use. This  maximum number of OS threads that a code will use. This
1305  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1306  For single threaded execution it can be reduced to one if required.  For single threaded execution it can be reduced to one if required.
1307  The va;lue is largely private to the WRAPPER and application code  The value; is largely private to the WRAPPER and application code
1308  will nor normally reference the value, except in the following scenario.  will nor normally reference the value, except in the following scenario.
1309    
1310  For certain physical parametrization schemes it is necessary to have  For certain physical parametrization schemes it is necessary to have
# Line 1320  This can be achieved using a Fortran 90 Line 1315  This can be achieved using a Fortran 90
1315  if this might be unavailable then the work arrays can be extended  if this might be unavailable then the work arrays can be extended
1316  with dimensions use the tile dimensioning scheme of {\em nSx}  with dimensions use the tile dimensioning scheme of {\em nSx}
1317  and {\em nSy} ( as described in section  and {\em nSy} ( as described in section
1318  \ref{sec:specifying_a_decomposition}). However, if the configuration  \ref{sect:specifying_a_decomposition}). However, if the configuration
1319  being specified involves many more tiles than OS threads then  being specified involves many more tiles than OS threads then
1320  it can save memory resources to reduce the variable  it can save memory resources to reduce the variable
1321  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1322  will be used and to declare the physical parameterisation  will be used and to declare the physical parameterization
1323  work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.  work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
1324  An example of this is given in the verification experiment  An example of this is given in the verification experiment
1325  {\em aim.5l\_cs}. Here the default setting of  {\em aim.5l\_cs}. Here the default setting of
1326  {\em MAX\_NO\_THREADS} is altered to  {\em MAX\_NO\_THREADS} is altered to
# Line 1338  created with declarations of the form. Line 1333  created with declarations of the form.
1333  \begin{verbatim}  \begin{verbatim}
1334        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1335  \end{verbatim}  \end{verbatim}
1336  This declaration scheme is not used widely, becuase most global data  This declaration scheme is not used widely, because most global data
1337  is used for permanent not temporary storage of state information.  is used for permanent not temporary storage of state information.
1338  In the case of permanent state information this approach cannot be used  In the case of permanent state information this approach cannot be used
1339  because there has to be enough storage allocated for all tiles.  because there has to be enough storage allocated for all tiles.
1340  However, the technique can sometimes be a useful scheme for reducing memory  However, the technique can sometimes be a useful scheme for reducing memory
1341  requirements in complex physical paramterisations.  requirements in complex physical parameterizations.
   
1342  \end{enumerate}  \end{enumerate}
1343    
1344    \begin{figure}
1345    \begin{verbatim}
1346    C--
1347    C--  Parallel directives for MIPS Pro Fortran compiler
1348    C--
1349    C      Parallel compiler directives for SGI with IRIX
1350    C$PAR  PARALLEL DO
1351    C$PAR&  CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1352    C$PAR&  SHARE(nThreads),LOCAL(myThid,I)
1353    C
1354          DO I=1,nThreads
1355            myThid = I
1356    
1357    C--     Invoke nThreads instances of the numerical model
1358            CALL THE_MODEL_MAIN(myThid)
1359    
1360          ENDDO
1361    \end{verbatim}
1362    \caption{Prior to transferring control to
1363    the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
1364    MP directives to spawn multiple threads.
1365    } \label{fig:mp_directives}
1366    \end{figure}
1367    
1368    
1369  \subsubsection{Specializing the Communication Code}  \subsubsection{Specializing the Communication Code}
1370    
1371  The isolation of performance critical communication primitives and the  The isolation of performance critical communication primitives and the
1372  sub-division of the simulation domain into tiles is a powerful tool.  sub-division of the simulation domain into tiles is a powerful tool.
1373  Here we show how it can be used to improve application performance and  Here we show how it can be used to improve application performance and
1374  how it can be used to adapt to new gridding approaches.  how it can be used to adapt to new griding approaches.
1375    
1376  \subsubsection{JAM example}  \subsubsection{JAM example}
1377  \label{sec:jam_example}  \label{sect:jam_example}
1378  On some platforms a big performance boost can be obtained by  On some platforms a big performance boost can be obtained by
1379  binding the communication routines {\em \_EXCH} and  binding the communication routines {\em \_EXCH} and
1380  {\em \_GSUM} to specialized native libraries ) fro example the  {\em \_GSUM} to specialized native libraries ) fro example the
# Line 1371  communications library ( see {\em ini\_j Line 1390  communications library ( see {\em ini\_j
1390  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1391  with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})  with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
1392  \item a highly specialized form of the exchange operator (optimized  \item a highly specialized form of the exchange operator (optimized
1393  for overlap regions of width one) is substitued into the elliptic  for overlap regions of width one) is substituted into the elliptic
1394  solver routine {\em cg2d.F}.  solver routine {\em cg2d.F}.
1395  \end{itemize}  \end{itemize}
1396  Developing specialized code for other libraries follows a similar  Developing specialized code for other libraries follows a similar
1397  pattern.  pattern.
1398    
1399  \subsubsection{Cube sphere communication}  \subsubsection{Cube sphere communication}
1400  \label{sec:cube_sphere_communication}  \label{sect:cube_sphere_communication}
1401  Actual {\em \_EXCH} routine code is generated automatically from  Actual {\em \_EXCH} routine code is generated automatically from
1402  a series of template files, for example {\em exch\_rx.template}.  a series of template files, for example {\em exch\_rx.template}.
1403  This is done to allow a large number of variations on the exchange  This is done to allow a large number of variations on the exchange
1404  process to be maintained. One set of variations supports the  process to be maintained. One set of variations supports the
1405  cube sphere grid. Support for a cube sphere gris in MITgcm is based  cube sphere grid. Support for a cube sphere grid in MITgcm is based
1406  on having each face of the cube as a separate tile (or tiles).  on having each face of the cube as a separate tile or tiles.
1407  The exchage routines are then able to absorb much of the  The exchange routines are then able to absorb much of the
1408  detailed rotation and reorientation required when moving around the  detailed rotation and reorientation required when moving around the
1409  cube grid. The set of {\em \_EXCH} routines that contain the  cube grid. The set of {\em \_EXCH} routines that contain the
1410  word cube in their name perform these transformations.  word cube in their name perform these transformations.
1411  They are invoked when the run-time logical parameter  They are invoked when the run-time logical parameter
1412  {\em useCubedSphereExchange} is set true. To facilitate the  {\em useCubedSphereExchange} is set true. To facilitate the
1413  transformations on a staggered C-grid, exchange operations are defined  transformations on a staggered C-grid, exchange operations are defined
1414  separately for both vector and scalar quantitities and for  separately for both vector and scalar quantities and for
1415  grid-centered and for grid-face and corner quantities.  grid-centered and for grid-face and corner quantities.
1416  Three sets of exchange routines are defined. Routines  Three sets of exchange routines are defined. Routines
1417  with names of the form {\em exch\_rx} are used to exchange  with names of the form {\em exch\_rx} are used to exchange
# Line 1408  quantities at the C-grid vorticity point Line 1427  quantities at the C-grid vorticity point
1427    
1428    
1429  \section{MITgcm execution under WRAPPER}  \section{MITgcm execution under WRAPPER}
1430    \begin{rawhtml}
1431    <!-- CMIREDIR:mitgcm_wrapper: -->
1432    \end{rawhtml}
1433    
1434  Fitting together the WRAPPER elements, package elements and  Fitting together the WRAPPER elements, package elements and
1435  MITgcm core equation elements of the source code produces calling  MITgcm core equation elements of the source code produces calling
1436  sequence shown in section \ref{sec:calling_sequence}  sequence shown in section \ref{sect:calling_sequence}
1437    
1438  \subsection{Annotated call tree for MITgcm and WRAPPER}  \subsection{Annotated call tree for MITgcm and WRAPPER}
1439  \label{sec:calling_sequence}  \label{sect:calling_sequence}
1440    
1441  WRAPPER layer.  WRAPPER layer.
1442    
1443    {\footnotesize
1444  \begin{verbatim}  \begin{verbatim}
1445    
1446         MAIN           MAIN  
# Line 1445  WRAPPER layer. Line 1468  WRAPPER layer.
1468         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine
1469    
1470  \end{verbatim}  \end{verbatim}
1471    }
1472    
1473  Core equations plus packages.  Core equations plus packages.
1474    
1475    {\footnotesize
1476  \begin{verbatim}  \begin{verbatim}
1477  C  C
1478  C  C
# Line 1457  C  : Line 1482  C  :
1482  C  |  C  |
1483  C  |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm  C  |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1484  C    |              :: Called from WRAPPER level numerical  C    |              :: Called from WRAPPER level numerical
1485  C    |              :: code innvocation routine. On entry  C    |              :: code invocation routine. On entry
1486  C    |              :: to THE_MODEL_MAIN separate thread and  C    |              :: to THE_MODEL_MAIN separate thread and
1487  C    |              :: separate processes will have been established.  C    |              :: separate processes will have been established.
1488  C    |              :: Each thread and process will have a unique ID  C    |              :: Each thread and process will have a unique ID
# Line 1471  C    | |-INI_PARMS :: Routine to set ker Line 1496  C    | |-INI_PARMS :: Routine to set ker
1496  C    | |           :: By default kernel parameters are read from file  C    | |           :: By default kernel parameters are read from file
1497  C    | |           :: "data" in directory in which code executes.  C    | |           :: "data" in directory in which code executes.
1498  C    | |  C    | |
1499  C    | |-MON_INIT :: Initialises monitor pacakge ( see pkg/monitor )  C    | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1500  C    | |  C    | |
1501  C    | |-INI_GRID :: Control grid array (vert. and hori.) initialisation.  C    | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1502  C    | | |        :: Grid arrays are held and described in GRID.h.  C    | | |        :: Grid arrays are held and described in GRID.h.
1503  C    | | |  C    | | |
1504  C    | | |-INI_VERTICAL_GRID        :: Initialise vertical grid arrays.  C    | | |-INI_VERTICAL_GRID        :: Initialize vertical grid arrays.
1505  C    | | |  C    | | |
1506  C    | | |-INI_CARTESIAN_GRID       :: Cartesian horiz. grid initialisation  C    | | |-INI_CARTESIAN_GRID       :: Cartesian horiz. grid initialization
1507  C    | | |                          :: (calculate grid from kernel parameters).  C    | | |                          :: (calculate grid from kernel parameters).
1508  C    | | |  C    | | |
1509  C    | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid  C    | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1510  C    | | |                          :: initialisation (calculate grid from  C    | | |                          :: initialization (calculate grid from
1511  C    | | |                          :: kernel parameters).  C    | | |                          :: kernel parameters).
1512  C    | | |  C    | | |
1513  C    | | |-INI_CURVILINEAR_GRID     :: General orthogonal, structured horiz.  C    | | |-INI_CURVILINEAR_GRID     :: General orthogonal, structured horiz.
1514  C    | |                            :: grid initialisations. ( input from raw  C    | |                            :: grid initializations. ( input from raw
1515  C    | |                            :: grid files, LONC.bin, DXF.bin etc... )  C    | |                            :: grid files, LONC.bin, DXF.bin etc... )
1516  C    | |  C    | |
1517  C    | |-INI_DEPTHS    :: Read (from "bathyFile") or set bathymetry/orgography.  C    | |-INI_DEPTHS    :: Read (from "bathyFile") or set bathymetry/orgography.
# Line 1497  C    | | Line 1522  C    | |
1522  C    | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf  C    | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1523  C    | |  C    | |
1524  C    | |-INI_CORI          :: Set coriolis term. zero, f-plane, beta-plane,  C    | |-INI_CORI          :: Set coriolis term. zero, f-plane, beta-plane,
1525  C    | |                   :: sphere optins are coded.  C    | |                   :: sphere options are coded.
1526  C    | |  C    | |
1527  C    | |-PACAKGES_BOOT      :: Start up the optional package environment.  C    | |-PACAKGES_BOOT      :: Start up the optional package environment.
1528  C    | |                    :: Runtime selection of active packages.  C    | |                    :: Runtime selection of active packages.
# Line 1518  C    | | Line 1543  C    | |
1543  C    | |-PACKAGES_CHECK  C    | |-PACKAGES_CHECK
1544  C    | | |  C    | | |
1545  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp
1546  C    | | |-OBCS_CHECK          :: Open bndy Pacakge. pkg/obcs  C    | | |-OBCS_CHECK          :: Open bndy Package. pkg/obcs
1547  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi
1548  C    | |  C    | |
1549  C    | |-PACKAGES_INIT_FIXED  C    | |-PACKAGES_INIT_FIXED
# Line 1538  C    | Line 1563  C    |
1563  C    |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl  C    |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1564  C    |  C    |
1565  C    |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop  C    |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1566  C    !                 :: Auotmatically gerenrated by TAMC/TAF.  C    !                 :: Auotmatically generated by TAMC/TAF.
1567  C    |  C    |
1568  C    |-CTRL_PACK   :: Control vector support package. see pkg/ctrl  C    |-CTRL_PACK   :: Control vector support package. see pkg/ctrl
1569  C    |  C    |
# Line 1552  C    | | | Line 1577  C    | | |
1577  C    | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf  C    | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1578  C    | | |  C    | | |
1579  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,
1580  C    | | |              :: sphere optins are coded.  C    | | |              :: sphere options are coded.
1581  C    | | |  C    | | |
1582  C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.  C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.
1583  C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.  C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.
# Line 1560  C    | | |-INI_MIXING   :: Initialise di Line 1585  C    | | |-INI_MIXING   :: Initialise di
1585  C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical  C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical
1586  C    | | |              :: fields).  C    | | |              :: fields).
1587  C    | | |  C    | | |
1588  C    | | |-INI_FIELDS   :: Control initialising model fields to non-zero  C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero
1589  C    | | | |-INI_VEL    :: Initialize 3D flow field.  C    | | | |-INI_VEL    :: Initialize 3D flow field.
1590  C    | | | |-INI_THETA  :: Set model initial temperature field.  C    | | | |-INI_THETA  :: Set model initial temperature field.
1591  C    | | | |-INI_SALT   :: Set model initial salinity field.  C    | | | |-INI_SALT   :: Set model initial salinity field.
# Line 1638  C/\  | | |-CALC_EXACT_ETA :: Change SSH Line 1663  C/\  | | |-CALC_EXACT_ETA :: Change SSH
1663  C/\  | | |-CALC_SURF_DR   :: Calculate the new surface level thickness.  C/\  | | |-CALC_SURF_DR   :: Calculate the new surface level thickness.
1664  C/\  | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )  C/\  | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1665  C/\  | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.  C/\  | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1666  C/\  | | | |                    :: Simple interpolcation between end-points  C/\  | | | |                    :: Simple interpolation between end-points
1667  C/\  | | | |                    :: for forcing datasets.  C/\  | | | |                    :: for forcing datasets.
1668  C/\  | | | |                    C/\  | | | |                  
1669  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.
# Line 1786  C    |-COMM_STATS     :: Summarise inter Line 1811  C    |-COMM_STATS     :: Summarise inter
1811  C                     :: events.  C                     :: events.
1812  C  C
1813  \end{verbatim}  \end{verbatim}
1814    }
1815    
1816  \subsection{Measuring and Characterizing Performance}  \subsection{Measuring and Characterizing Performance}
1817    

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.20

  ViewVC Help
Powered by ViewVC 1.1.22