/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Diff of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.10 by edhill, Thu Aug 7 18:27:52 2003 UTC revision 1.23 by edhill, Tue Apr 4 19:14:14 2006 UTC
# Line 1  Line 1 
1  % $Header$  % $Header$
2    
3  This chapter focuses on describing the {\bf WRAPPER} environment within which  This chapter focuses on describing the {\bf WRAPPER} environment
4  both the core numerics and the pluggable packages operate. The description  within which both the core numerics and the pluggable packages
5  presented here is intended to be a detailed exposition and contains significant  operate. The description presented here is intended to be a detailed
6  background material, as well as advanced details on working with the WRAPPER.  exposition and contains significant background material, as well as
7  The tutorial sections of this manual (see sections  advanced details on working with the WRAPPER.  The tutorial sections
8  \ref{sect:tutorials}  and \ref{sect:tutorialIII})  of this manual (see sections \ref{sect:tutorials} and
9  contain more succinct, step-by-step instructions on running basic numerical  \ref{sect:tutorialIII}) contain more succinct, step-by-step
10  experiments, of varous types, both sequentially and in parallel. For many  instructions on running basic numerical experiments, of varous types,
11  projects simply starting from an example code and adapting it to suit a  both sequentially and in parallel. For many projects simply starting
12  particular situation  from an example code and adapting it to suit a particular situation
13  will be all that is required.  will be all that is required.  The first part of this chapter
14  The first part of this chapter discusses the MITgcm architecture at an  discusses the MITgcm architecture at an abstract level. In the second
15  abstract level. In the second part of the chapter we described practical  part of the chapter we described practical details of the MITgcm
16  details of the MITgcm implementation and of current tools and operating system  implementation and of current tools and operating system features that
17  features that are employed.  are employed.
18    
19  \section{Overall architectural goals}  \section{Overall architectural goals}
20    \begin{rawhtml}
21    <!-- CMIREDIR:overall_architectural_goals: -->
22    \end{rawhtml}
23    
24  Broadly, the goals of the software architecture employed in MITgcm are  Broadly, the goals of the software architecture employed in MITgcm are
25  three-fold  three-fold
26    
27  \begin{itemize}  \begin{itemize}
28  \item We wish to be able to study a very broad range  \item We wish to be able to study a very broad range of interesting
29  of interesting and challenging rotating fluids problems.    and challenging rotating fluids problems.
30  \item We wish the model code to be readily targeted to  \item We wish the model code to be readily targeted to a wide range of
31  a wide range of platforms    platforms
32  \item On any given platform we would like to be  \item On any given platform we would like to be able to achieve
33  able to achieve performance comparable to an implementation    performance comparable to an implementation developed and
34  developed and specialized specifically for that platform.    specialized specifically for that platform.
35  \end{itemize}  \end{itemize}
36    
37  These points are summarized in figure \ref{fig:mitgcm_architecture_goals}  These points are summarized in figure
38  which conveys the goals of the MITgcm design. The goals lead to  \ref{fig:mitgcm_architecture_goals} which conveys the goals of the
39  a software architecture which at the high-level can be viewed as consisting  MITgcm design. The goals lead to a software architecture which at the
40  of  high-level can be viewed as consisting of
41    
42  \begin{enumerate}  \begin{enumerate}
43  \item A core set of numerical and support code. This is discussed in detail in  \item A core set of numerical and support code. This is discussed in
44  section \ref{sect:partII}.    detail in section \ref{chap:discretization}.
45  \item A scheme for supporting optional "pluggable" {\bf packages} (containing  
46  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).  \item A scheme for supporting optional ``pluggable'' {\bf packages}
47  These packages are used both to overlay alternate dynamics and to introduce    (containing for example mixed-layer schemes, biogeochemical schemes,
48  specialized physical content onto the core numerical code. An overview of    atmospheric physics).  These packages are used both to overlay
49  the {\bf package} scheme is given at the start of part \ref{part:packages}.    alternate dynamics and to introduce specialized physical content
50  \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel    onto the core numerical code. An overview of the {\bf package}
51  Programming Environment Resource), within which the core numerics and pluggable    scheme is given at the start of part \ref{chap:packagesI}.
52  packages operate.  
53    \item A support framework called {\bf WRAPPER} (Wrappable Application
54      Parallel Programming Environment Resource), within which the core
55      numerics and pluggable packages operate.
56  \end{enumerate}  \end{enumerate}
57    
58  This chapter focuses on describing the {\bf WRAPPER} environment under which  This chapter focuses on describing the {\bf WRAPPER} environment under
59  both the core numerics and the pluggable packages function. The description  which both the core numerics and the pluggable packages function. The
60  presented here is intended to be a detailed exposition and contains significant  description presented here is intended to be a detailed exposition and
61  background material, as well as advanced details on working with the WRAPPER.  contains significant background material, as well as advanced details
62  The examples section of this manual (part \ref{part:example}) contains more  on working with the WRAPPER.  The examples section of this manual
63  succinct, step-by-step instructions on running basic numerical  (part \ref{chap:getting_started}) contains more succinct, step-by-step
64  experiments both sequentially and in parallel. For many projects simply  instructions on running basic numerical experiments both sequentially
65  starting from an example code and adapting it to suit a particular situation  and in parallel. For many projects simply starting from an example
66  will be all that is required.  code and adapting it to suit a particular situation will be all that
67    is required.
68    
69    
70  \begin{figure}  \begin{figure}
71  \begin{center}  \begin{center}
72  \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}  \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
73  \end{center}  \end{center}
74  \caption{  \caption{ The MITgcm architecture is designed to allow simulation of a
75  The MITgcm architecture is designed to allow simulation of a wide    wide range of physical problems on a wide range of hardware. The
76  range of physical problems on a wide range of hardware. The computational    computational resource requirements of the applications targeted
77  resource requirements of the applications targeted range from around    range from around $10^7$ bytes ($\approx 10$ megabytes) of memory to
78  $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes    $10^{11}$ bytes ($\approx 100$ gigabytes). Arithmetic operation
79  ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of    counts for the applications of interest range from $10^{9}$ floating
80  interest range from $10^{9}$ floating point operations to more than $10^{17}$    point operations to more than $10^{17}$ floating point operations.}
 floating point operations.}  
81  \label{fig:mitgcm_architecture_goals}  \label{fig:mitgcm_architecture_goals}
82  \end{figure}  \end{figure}
83    
84  \section{WRAPPER}  \section{WRAPPER}
85    \begin{rawhtml}
86  A significant element of the software architecture utilized in  <!-- CMIREDIR:wrapper: -->
87  MITgcm is a software superstructure and substructure collectively  \end{rawhtml}
88  called the WRAPPER (Wrappable Application Parallel Programming  
89  Environment Resource). All numerical and support code in MITgcm is written  A significant element of the software architecture utilized in MITgcm
90  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within  is a software superstructure and substructure collectively called the
91  the WRAPPER means that coding has to follow certain, relatively  WRAPPER (Wrappable Application Parallel Programming Environment
92  straightforward, rules and conventions ( these are discussed further in  Resource). All numerical and support code in MITgcm is written to
93  section \ref{sect:specifying_a_decomposition} ).  ``fit'' within the WRAPPER infrastructure. Writing code to ``fit''
94    within the WRAPPER means that coding has to follow certain, relatively
95  The approach taken by the WRAPPER is illustrated in figure  straightforward, rules and conventions (these are discussed further in
96  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code  section \ref{sect:specifying_a_decomposition}).
97  that fits within it from architectural differences between hardware platforms  
98  and operating systems. This allows numerical code to be easily retargetted.  The approach taken by the WRAPPER is illustrated in figure
99    \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to
100    insulate code that fits within it from architectural differences
101    between hardware platforms and operating systems. This allows
102    numerical code to be easily retargetted.
103    
104    
105  \begin{figure}  \begin{figure}
# Line 110  optimized for that platform.} Line 120  optimized for that platform.}
120  \subsection{Target hardware}  \subsection{Target hardware}
121  \label{sect:target_hardware}  \label{sect:target_hardware}
122    
123  The WRAPPER is designed to target as broad as possible a range of computer  The WRAPPER is designed to target as broad as possible a range of
124  systems. The original development of the WRAPPER took place on a  computer systems.  The original development of the WRAPPER took place
125  multi-processor, CRAY Y-MP system. On that system, numerical code performance  on a multi-processor, CRAY Y-MP system. On that system, numerical code
126  and scaling under the WRAPPER was in excess of that of an implementation that  performance and scaling under the WRAPPER was in excess of that of an
127  was tightly bound to the CRAY systems proprietary multi-tasking and  implementation that was tightly bound to the CRAY systems proprietary
128  micro-tasking approach. Later developments have been carried out on  multi-tasking and micro-tasking approach. Later developments have been
129  uniprocessor and multi-processor Sun systems with both uniform memory access  carried out on uniprocessor and multi-processor Sun systems with both
130  (UMA) and non-uniform memory access (NUMA) designs. Significant work has also  uniform memory access (UMA) and non-uniform memory access (NUMA)
131  been undertaken on x86 cluster systems, Alpha processor based clustered SMP  designs.  Significant work has also been undertaken on x86 cluster
132  systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.  systems, Alpha processor based clustered SMP systems, and on
133  The MITgcm code, operating within the WRAPPER, is also routinely used on  cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
134  large scale MPP systems (for example T3E systems and IBM SP systems). In all  systems.  The MITgcm code, operating within the WRAPPER, is also
135  cases numerical code, operating within the WRAPPER, performs and scales very  routinely used on large scale MPP systems (for example, Cray T3E and
136  competitively with equivalent numerical code that has been modified to contain  IBM SP systems). In all cases numerical code, operating within the
137  native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.  WRAPPER, performs and scales very competitively with equivalent
138    numerical code that has been modified to contain native optimizations
139    for a particular system \ref{ref hoe and hill, ecmwf}.
140    
141  \subsection{Supporting hardware neutrality}  \subsection{Supporting hardware neutrality}
142    
143  The different systems listed in section \ref{sect:target_hardware} can be  The different systems listed in section \ref{sect:target_hardware} can
144  categorized in many different ways. For example, one common distinction is  be categorized in many different ways. For example, one common
145  between shared-memory parallel systems (SMP's, PVP's) and distributed memory  distinction is between shared-memory parallel systems (SMP and PVP)
146  parallel systems (for example x86 clusters and large MPP systems). This is one  and distributed memory parallel systems (for example x86 clusters and
147  example of a difference between compute platforms that can impact an  large MPP systems). This is one example of a difference between
148  application. Another common distinction is between vector processing systems  compute platforms that can impact an application. Another common
149  with highly specialized CPU's and memory subsystems and commodity  distinction is between vector processing systems with highly
150  microprocessor based systems. There are numerous other differences, especially  specialized CPUs and memory subsystems and commodity microprocessor
151  in relation to how parallel execution is supported. To capture the essential  based systems. There are numerous other differences, especially in
152  differences between different platforms the WRAPPER uses a {\it machine model}.  relation to how parallel execution is supported. To capture the
153    essential differences between different platforms the WRAPPER uses a
154    {\it machine model}.
155    
156  \subsection{WRAPPER machine model}  \subsection{WRAPPER machine model}
157    
158  Applications using the WRAPPER are not written to target just one  Applications using the WRAPPER are not written to target just one
159  particular machine (for example an IBM SP2) or just one particular family or  particular machine (for example an IBM SP2) or just one particular
160  class of machines (for example Parallel Vector Processor Systems). Instead the  family or class of machines (for example Parallel Vector Processor
161  WRAPPER provides applications with an  Systems). Instead the WRAPPER provides applications with an abstract
162  abstract {\it machine model}. The machine model is very general, however, it can  {\it machine model}. The machine model is very general, however, it
163  easily be specialized to fit, in a computationally efficient manner, any  can easily be specialized to fit, in a computationally efficient
164  computer architecture currently available to the scientific computing community.  manner, any computer architecture currently available to the
165    scientific computing community.
166    
167  \subsection{Machine model parallelism}  \subsection{Machine model parallelism}
168    \begin{rawhtml}
169   Codes operating under the WRAPPER target an abstract machine that is assumed to  <!-- CMIREDIR:domain_decomp: -->
170  consist of one or more logical processors that can compute concurrently.    \end{rawhtml}
171  Computational work is divided among the logical  
172  processors by allocating ``ownership'' to  Codes operating under the WRAPPER target an abstract machine that is
173  each processor of a certain set (or sets) of calculations. Each set of  assumed to consist of one or more logical processors that can compute
174  calculations owned by a particular processor is associated with a specific  concurrently.  Computational work is divided among the logical
175  region of the physical space that is being simulated, only one processor will  processors by allocating ``ownership'' to each processor of a certain
176  be associated with each such region (domain decomposition).    set (or sets) of calculations. Each set of calculations owned by a
177    particular processor is associated with a specific region of the
178  In a strict sense the logical processors over which work is divided do not need  physical space that is being simulated, only one processor will be
179  to correspond to physical processors. It is perfectly possible to execute a  associated with each such region (domain decomposition).
180  configuration decomposed for multiple logical processors on a single physical  
181  processor. This helps ensure that numerical code that is written to fit  In a strict sense the logical processors over which work is divided do
182  within the WRAPPER will parallelize with no additional effort and is  not need to correspond to physical processors.  It is perfectly
183  also useful when debugging codes. Generally, however,  possible to execute a configuration decomposed for multiple logical
184  the computational domain will be subdivided over multiple logical  processors on a single physical processor.  This helps ensure that
185  processors in order to then bind those logical processors to physical  numerical code that is written to fit within the WRAPPER will
186  processor resources that can compute in parallel.  parallelize with no additional effort.  It is also useful for
187    debugging purposes.  Generally, however, the computational domain will
188    be subdivided over multiple logical processors in order to then bind
189    those logical processors to physical processor resources that can
190    compute in parallel.
191    
192  \subsubsection{Tiles}  \subsubsection{Tiles}
193    
194  Computationally, associated with each region of physical  Computationally, the data structures (\textit{eg.} arrays, scalar
195  space allocated to a particular logical processor, there will be data  variables, etc.) that hold the simulated state are associated with
196  structures (arrays, scalar variables etc...) that hold the simulated state of  each region of physical space and are allocated to a particular
197  that region. We refer to these data structures as being {\bf owned} by the  logical processor.  We refer to these data structures as being {\bf
198  processor to which their    owned} by the processor to which their associated region of physical
199  associated region of physical space has been allocated. Individual  space has been allocated.  Individual regions that are allocated to
200  regions that are allocated to processors are called {\bf tiles}. A  processors are called {\bf tiles}.  A processor can own more than one
201  processor can own more  tile.  Figure \ref{fig:domaindecomp} shows a physical domain being
202  than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being  mapped to a set of logical processors, with each processors owning a
203  mapped to a set of logical processors, with each processors owning a single  single region of the domain (a single tile).  Except for periods of
204  region of the domain (a single tile). Except for periods of  communication and coordination, each processor computes autonomously,
205  communication and coordination, each processor computes autonomously, working  working only with data from the tile (or tiles) that the processor
206  only with data from the tile (or tiles) that the processor owns. When multiple  owns.  When multiple tiles are alloted to a single processor, each
207  tiles are alloted to a single processor, each tile is computed on  tile is computed on independently of the other tiles, in a sequential
208  independently of the other tiles, in a sequential fashion.  fashion.
209    
210  \begin{figure}  \begin{figure}
211  \begin{center}  \begin{center}
# Line 194  independently of the other tiles, in a s Line 213  independently of the other tiles, in a s
213    \includegraphics{part4/domain_decomp.eps}    \includegraphics{part4/domain_decomp.eps}
214   }   }
215  \end{center}  \end{center}
216  \caption{ The WRAPPER provides support for one and two dimensional  \caption{ The WRAPPER provides support for one and two dimensional
217  decompositions of grid-point domains. The figure shows a hypothetical domain of    decompositions of grid-point domains. The figure shows a
218  total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in    hypothetical domain of total size $N_{x}N_{y}N_{z}$. This
219  two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf    hypothetical domain is decomposed in two-dimensions along the
220  tiles} are {\bf owned} by different processors. The {\bf owning}    $N_{x}$ and $N_{y}$ directions. The resulting {\bf tiles} are {\bf
221  processors perform the      owned} by different processors. The {\bf owning} processors
222  arithmetic operations associated with a {\bf tile}. Although not illustrated    perform the arithmetic operations associated with a {\bf tile}.
223  here, a single processor can {\bf own} several {\bf tiles}.    Although not illustrated here, a single processor can {\bf own}
224  Whenever a processor wishes to transfer data between tiles or    several {\bf tiles}.  Whenever a processor wishes to transfer data
225  communicate with other processors it calls a WRAPPER supplied    between tiles or communicate with other processors it calls a
226  function.    WRAPPER supplied function.  } \label{fig:domaindecomp}
 } \label{fig:domaindecomp}  
227  \end{figure}  \end{figure}
228    
229    
230  \subsubsection{Tile layout}  \subsubsection{Tile layout}
231    
232  Tiles consist of an interior region and an overlap region. The overlap region  Tiles consist of an interior region and an overlap region.  The
233  of a tile corresponds to the interior region of an adjacent tile.  overlap region of a tile corresponds to the interior region of an
234  In figure \ref{fig:tiledworld} each tile would own the region  adjacent tile.  In figure \ref{fig:tiledworld} each tile would own the
235  within the black square and hold duplicate information for overlap  region within the black square and hold duplicate information for
236  regions extending into the tiles to the north, south, east and west.  overlap regions extending into the tiles to the north, south, east and
237  During  west.  During computational phases a processor will reference data in
238  computational phases a processor will reference data in an overlap region  an overlap region whenever it requires values that lie outside the
239  whenever it requires values that outside the domain it owns. Periodically  domain it owns.  Periodically processors will make calls to WRAPPER
240  processors will make calls to WRAPPER functions to communicate data between  functions to communicate data between tiles, in order to keep the
241  tiles, in order to keep the overlap regions up to date (see section  overlap regions up to date (see section
242  \ref{sect:communication_primitives}). The WRAPPER functions can use a  \ref{sect:communication_primitives}).  The WRAPPER functions can use a
243  variety of different mechanisms to communicate data between tiles.  variety of different mechanisms to communicate data between tiles.
244    
245  \begin{figure}  \begin{figure}
# Line 238  Overlap regions are periodically updated Line 256  Overlap regions are periodically updated
256    
257  \subsection{Communication mechanisms}  \subsection{Communication mechanisms}
258    
259   Logical processors are assumed to be able to exchange information  Logical processors are assumed to be able to exchange information
260  between tiles and between each other using at least one of two possible  between tiles and between each other using at least one of two
261  mechanisms.  possible mechanisms.
262    
263  \begin{itemize}  \begin{itemize}
264  \item {\bf Shared memory communication}.  \item {\bf Shared memory communication}.  Under this mode of
265  Under this mode of communication data transfers are assumed to be possible    communication data transfers are assumed to be possible using direct
266  using direct addressing of regions of memory. In this case a CPU is able to read    addressing of regions of memory.  In this case a CPU is able to read
267  (and write) directly to regions of memory "owned" by another CPU    (and write) directly to regions of memory ``owned'' by another CPU
268  using simple programming language level assignment operations of the    using simple programming language level assignment operations of the
269  the sort shown in figure \ref{fig:simple_assign}. In this way one CPU    the sort shown in figure \ref{fig:simple_assign}.  In this way one
270  (CPU1 in the figure) can communicate information to another CPU (CPU2 in the    CPU (CPU1 in the figure) can communicate information to another CPU
271  figure) by assigning a particular value to a particular memory location.    (CPU2 in the figure) by assigning a particular value to a particular
272      memory location.
273  \item {\bf Distributed memory communication}.  
274  Under this mode of communication there is no mechanism, at the application code level,  \item {\bf Distributed memory communication}.  Under this mode of
275  for directly addressing regions of memory owned and visible to another CPU. Instead    communication there is no mechanism, at the application code level,
276  a communication library must be used as illustrated in figure    for directly addressing regions of memory owned and visible to
277  \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the    another CPU. Instead a communication library must be used as
278  communication library to communicate data from a tile that it owns to a tile    illustrated in figure \ref{fig:comm_msg}. In this case CPUs must
279  that another CPU owns. By default the WRAPPER binds to the MPI communication    call a function in the API of the communication library to
280  library \ref{MPI} for this style of communication.    communicate data from a tile that it owns to a tile that another CPU
281      owns. By default the WRAPPER binds to the MPI communication library
282      \ref{MPI} for this style of communication.
283  \end{itemize}  \end{itemize}
284    
285  The WRAPPER assumes that communication will use one of these two styles  The WRAPPER assumes that communication will use one of these two styles
286  of communication. The underlying hardware and operating system support  of communication.  The underlying hardware and operating system support
287  for the style used is not specified and can vary from system to system.  for the style used is not specified and can vary from system to system.
288    
289  \begin{figure}  \begin{figure}
# Line 277  for the style used is not specified and Line 297  for the style used is not specified and
297                                   |        END WHILE                                   |        END WHILE
298                                   |                                   |
299  \end{verbatim}  \end{verbatim}
300  \caption{ In the WRAPPER shared memory communication model, simple writes to an  \caption{In the WRAPPER shared memory communication model, simple writes to an
301  array can be made to be visible to other CPU's at the application code level.  array can be made to be visible to other CPUs at the application code level.
302  So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to  So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
303  element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)  element $3$ of array $a$, then other CPUs (for example CPU2 in the figure above)
304  will be able to see the value $8$ when they read from $a(3)$.  will be able to see the value $8$ when they read from $a(3)$.
305  This provides a very low latency and high bandwidth communication  This provides a very low latency and high bandwidth communication
306  mechanism.  mechanism.
# Line 299  mechanism. Line 319  mechanism.
319                                   |                                   |
320  \end{verbatim}  \end{verbatim}
321  \caption{ In the WRAPPER distributed memory communication model  \caption{ In the WRAPPER distributed memory communication model
322  data can not be made directly visible to other CPU's.  data can not be made directly visible to other CPUs.
323  If one CPU writes the value $8$ to element $3$ of array $a$, then  If one CPU writes the value $8$ to element $3$ of array $a$, then
324  at least one of CPU1 and/or CPU2 in the figure above will need  at least one of CPU1 and/or CPU2 in the figure above will need
325  to call a bespoke communication library in order for the updated  to call a bespoke communication library in order for the updated
326  value to be communicated between CPU's.  value to be communicated between CPUs.
327  } \label{fig:comm_msg}  } \label{fig:comm_msg}
328  \end{figure}  \end{figure}
329    
330  \subsection{Shared memory communication}  \subsection{Shared memory communication}
331  \label{sect:shared_memory_communication}  \label{sect:shared_memory_communication}
332    
333  Under shared communication independent CPU's are operating  Under shared communication independent CPUs are operating on the
334  on the exact same global address space at the application level.  exact same global address space at the application level.  This means
335  This means that CPU 1 can directly write into global  that CPU 1 can directly write into global data structures that CPU 2
336  data structures that CPU 2 ``owns'' using a simple  ``owns'' using a simple assignment at the application level.  This is
337  assignment at the application level.  the model of memory access is supported at the basic system design
338  This is the model of memory access is supported at the basic system  level in ``shared-memory'' systems such as PVP systems, SMP systems,
339  design level in ``shared-memory'' systems such as PVP systems, SMP systems,  and on distributed shared memory systems (\textit{eg.} SGI Origin, SGI
340  and on distributed shared memory systems (the SGI Origin).  Altix, and some AMD Opteron systems).  On such systems the WRAPPER
341  On such systems the WRAPPER will generally use simple read and write statements  will generally use simple read and write statements to access directly
342  to access directly application data structures when communicating between CPU's.  application data structures when communicating between CPUs.
343    
344  In a system where assignments statements, like the one in figure  In a system where assignments statements, like the one in figure
345  \ref{fig:simple_assign} map directly to  \ref{fig:simple_assign} map directly to hardware instructions that
346  hardware instructions that transport data between CPU and memory banks, this  transport data between CPU and memory banks, this can be a very
347  can be a very efficient mechanism for communication. In this case two CPU's,  efficient mechanism for communication.  In this case two CPUs, CPU1
348  CPU1 and CPU2, can communicate simply be reading and writing to an  and CPU2, can communicate simply be reading and writing to an agreed
349  agreed location and following a few basic rules. The latency of this sort  location and following a few basic rules.  The latency of this sort of
350  of communication is generally not that much higher than the hardware  communication is generally not that much higher than the hardware
351  latency of other memory accesses on the system. The bandwidth available  latency of other memory accesses on the system. The bandwidth
352  between CPU's communicating in this way can be close to the bandwidth of  available between CPUs communicating in this way can be close to the
353  the systems main-memory interconnect. This can make this method of  bandwidth of the systems main-memory interconnect.  This can make this
354  communication very efficient provided it is used appropriately.  method of communication very efficient provided it is used
355    appropriately.
356    
357  \subsubsection{Memory consistency}  \subsubsection{Memory consistency}
358  \label{sect:memory_consistency}  \label{sect:memory_consistency}
359    
360  When using shared memory communication between  When using shared memory communication between multiple processors the
361  multiple processors the WRAPPER level shields user applications from  WRAPPER level shields user applications from certain counter-intuitive
362  certain counter-intuitive system behaviors. In particular, one issue the  system behaviors.  In particular, one issue the WRAPPER layer must
363  WRAPPER layer must deal with is a systems memory model. In general the order  deal with is a systems memory model.  In general the order of reads
364  of reads and writes expressed by the textual order of an application code may  and writes expressed by the textual order of an application code may
365  not be the ordering of instructions executed by the processor performing the  not be the ordering of instructions executed by the processor
366  application. The processor performing the application instructions will always  performing the application.  The processor performing the application
367  operate so that, for the application instructions the processor is executing,  instructions will always operate so that, for the application
368  any reordering is not apparent. However, in general machines are often  instructions the processor is executing, any reordering is not
369  designed so that reordering of instructions is not hidden from other second  apparent.  However, in general machines are often designed so that
370  processors.  This means that, in general, even on a shared memory system two  reordering of instructions is not hidden from other second processors.
371  processors can observe inconsistent memory values.  This means that, in general, even on a shared memory system two
372    processors can observe inconsistent memory values.
373  The issue of memory consistency between multiple processors is discussed at  
374  length in many computer science papers, however, from a practical point of  The issue of memory consistency between multiple processors is
375  view, in order to deal with this issue, shared memory machines all provide  discussed at length in many computer science papers.  From a practical
376  some mechanism to enforce memory consistency when it is needed. The exact  point of view, in order to deal with this issue, shared memory
377  mechanism employed will vary between systems. For communication using shared  machines all provide some mechanism to enforce memory consistency when
378  memory, the WRAPPER provides a place to invoke the appropriate mechanism to  it is needed.  The exact mechanism employed will vary between systems.
379  ensure memory consistency for a particular platform.  For communication using shared memory, the WRAPPER provides a place to
380    invoke the appropriate mechanism to ensure memory consistency for a
381    particular platform.
382    
383  \subsubsection{Cache effects and false sharing}  \subsubsection{Cache effects and false sharing}
384  \label{sect:cache_effects_and_false_sharing}  \label{sect:cache_effects_and_false_sharing}
385    
386  Shared-memory machines often have local to processor memory caches  Shared-memory machines often have local to processor memory caches
387  which contain mirrored copies of main memory. Automatic cache-coherence  which contain mirrored copies of main memory.  Automatic cache-coherence
388  protocols are used to maintain consistency between caches on different  protocols are used to maintain consistency between caches on different
389  processors. These cache-coherence protocols typically enforce consistency  processors.  These cache-coherence protocols typically enforce consistency
390  between regions of memory with large granularity (typically 128 or 256 byte  between regions of memory with large granularity (typically 128 or 256 byte
391  chunks). The coherency protocols employed can be expensive relative to other  chunks).  The coherency protocols employed can be expensive relative to other
392  memory accesses and so care is taken in the WRAPPER (by padding synchronization  memory accesses and so care is taken in the WRAPPER (by padding synchronization
393  structures appropriately) to avoid unnecessary coherence traffic.  structures appropriately) to avoid unnecessary coherence traffic.
394    
395  \subsubsection{Operating system support for shared memory.}  \subsubsection{Operating system support for shared memory.}
396    
397  Applications running under multiple threads within a single process can  Applications running under multiple threads within a single process
398  use shared memory communication. In this case {\it all} the memory locations  can use shared memory communication.  In this case {\it all} the
399  in an application are potentially visible to all the compute threads. Multiple  memory locations in an application are potentially visible to all the
400  threads operating within a single process is the standard mechanism for  compute threads. Multiple threads operating within a single process is
401  supporting shared memory that the WRAPPER utilizes. Configuring and launching  the standard mechanism for supporting shared memory that the WRAPPER
402  code to run in multi-threaded mode on specific platforms is discussed in  utilizes. Configuring and launching code to run in multi-threaded mode
403  section \ref{sect:running_with_threads}.  However, on many systems, potentially  on specific platforms is discussed in section
404  very efficient mechanisms for using shared memory communication between  \ref{sect:multi-threaded-execution}.  However, on many systems,
405  multiple processes (in contrast to multiple threads within a single  potentially very efficient mechanisms for using shared memory
406  process) also exist. In most cases this works by making a limited region of  communication between multiple processes (in contrast to multiple
407  memory shared between processes. The MMAP \ref{magicgarden} and  threads within a single process) also exist. In most cases this works
408  IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do  by making a limited region of memory shared between processes. The
409  vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.  MMAP \ref{magicgarden} and IPC \ref{magicgarden} facilities in UNIX
410  Extensions exist for the WRAPPER that allow these mechanisms  systems provide this capability as do vendor specific tools like LAPI
411  to be used for shared memory communication. However, these mechanisms are not  \ref{IBMLAPI} and IMC \ref{Memorychannel}.  Extensions exist for the
412  distributed with the default WRAPPER sources, because of their proprietary  WRAPPER that allow these mechanisms to be used for shared memory
413  nature.  communication. However, these mechanisms are not distributed with the
414    default WRAPPER sources, because of their proprietary nature.
415    
416  \subsection{Distributed memory communication}  \subsection{Distributed memory communication}
417  \label{sect:distributed_memory_communication}  \label{sect:distributed_memory_communication}
418  Many parallel systems are not constructed in a way where it is  Many parallel systems are not constructed in a way where it is
419  possible or practical for an application to use shared memory  possible or practical for an application to use shared memory for
420  for communication. For example cluster systems consist of individual computers  communication. For example cluster systems consist of individual
421  connected by a fast network. On such systems their is no notion of shared memory  computers connected by a fast network. On such systems there is no
422  at the system level. For this sort of system the WRAPPER provides support  notion of shared memory at the system level. For this sort of system
423  for communication based on a bespoke communication library  the WRAPPER provides support for communication based on a bespoke
424  (see figure \ref{fig:comm_msg}).  The default communication library used is MPI  communication library (see figure \ref{fig:comm_msg}).  The default
425  \ref{mpi}. However, it is relatively straightforward to implement bindings to  communication library used is MPI \cite{MPI-std-20}. However, it is
426  optimized platform specific communication libraries. For example the work  relatively straightforward to implement bindings to optimized platform
427  described in \ref{hoe-hill:99} substituted standard MPI communication for a  specific communication libraries. For example the work described in
428  highly optimized library.  \ref{hoe-hill:99} substituted standard MPI communication for a highly
429    optimized library.
430    
431  \subsection{Communication primitives}  \subsection{Communication primitives}
432  \label{sect:communication_primitives}  \label{sect:communication_primitives}
# Line 413  highly optimized library. Line 438  highly optimized library.
438   }   }
439  \end{center}  \end{center}
440  \caption{Three performance critical parallel primitives are provided  \caption{Three performance critical parallel primitives are provided
441  by the WRAPPER. These primitives are always used to communicate data    by the WRAPPER. These primitives are always used to communicate data
442  between tiles. The figure shows four tiles. The curved arrows indicate    between tiles. The figure shows four tiles. The curved arrows
443  exchange primitives which transfer data between the overlap regions at tile    indicate exchange primitives which transfer data between the overlap
444  edges and interior regions for nearest-neighbor tiles.    regions at tile edges and interior regions for nearest-neighbor
445  The straight arrows symbolize global sum operations which connect all tiles.    tiles.  The straight arrows symbolize global sum operations which
446  The global sum operation provides both a key arithmetic primitive and can    connect all tiles.  The global sum operation provides both a key
447  serve as a synchronization primitive. A third barrier primitive is also    arithmetic primitive and can serve as a synchronization primitive. A
448  provided, it behaves much like the global sum primitive.    third barrier primitive is also provided, it behaves much like the
449  } \label{fig:communication_primitives}    global sum primitive.  } \label{fig:communication_primitives}
450  \end{figure}  \end{figure}
451    
452    
453  Optimized communication support is assumed to be possibly available  Optimized communication support is assumed to be potentially available
454  for a small number of communication operations.  for a small number of communication operations.  It is also assumed
455  It is assumed that communication performance optimizations can  that communication performance optimizations can be achieved by
456  be achieved by optimizing a small number of communication primitives.  optimizing a small number of communication primitives.  Three
457  Three optimizable primitives are provided by the WRAPPER  optimizable primitives are provided by the WRAPPER
458  \begin{itemize}  \begin{itemize}
459  \item{\bf EXCHANGE} This operation is used to transfer data between interior  \item{\bf EXCHANGE} This operation is used to transfer data between
460  and overlap regions of neighboring tiles. A number of different forms of this    interior and overlap regions of neighboring tiles. A number of
461  operation are supported. These different forms handle    different forms of this operation are supported. These different
462  \begin{itemize}    forms handle
463  \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled    \begin{itemize}
464  separately.    \item Data type differences. Sixty-four bit and thirty-two bit
465  \item Bindings to different communication methods.      fields may be handled separately.
466  Exchange primitives select between using shared memory or distributed    \item Bindings to different communication methods.  Exchange
467  memory communication.      primitives select between using shared memory or distributed
468  \item Transformation operations required when transporting      memory communication.
469  data between different grid regions. Transferring data    \item Transformation operations required when transporting data
470  between faces of a cube-sphere grid, for example, involves a rotation      between different grid regions. Transferring data between faces of
471  of vector components.      a cube-sphere grid, for example, involves a rotation of vector
472  \item Forward and reverse mode computations. Derivative calculations require      components.
473  tangent linear and adjoint forms of the exchange primitives.    \item Forward and reverse mode computations. Derivative calculations
474        require tangent linear and adjoint forms of the exchange
475  \end{itemize}      primitives.
476      \end{itemize}
477    
478  \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic  \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
479  operation for the pressure inversion phase of the MITgcm algorithm.    operation for the pressure inversion phase of the MITgcm algorithm.
480  For certain configurations scaling can be highly sensitive to    For certain configurations scaling can be highly sensitive to the
481  the performance of the global sum primitive. This operation is a collective    performance of the global sum primitive. This operation is a
482  operation involving all tiles of the simulated domain. Different forms    collective operation involving all tiles of the simulated domain.
483  of the global sum primitive exist for handling    Different forms of the global sum primitive exist for handling
484  \begin{itemize}    \begin{itemize}
485  \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled    \item Data type differences. Sixty-four bit and thirty-two bit
486  separately.      fields may be handled separately.
487  \item Bindings to different communication methods.    \item Bindings to different communication methods.  Exchange
488  Exchange primitives select between using shared memory or distributed      primitives select between using shared memory or distributed
489  memory communication.      memory communication.
490  \item Forward and reverse mode computations. Derivative calculations require    \item Forward and reverse mode computations. Derivative calculations
491  tangent linear and adjoint forms of the exchange primitives.      require tangent linear and adjoint forms of the exchange
492  \end{itemize}      primitives.
493      \end{itemize}
494  \item{\bf BARRIER} The WRAPPER provides a global synchronization function    
495  called barrier. This is used to synchronize computations over all tiles.  \item{\bf BARRIER} The WRAPPER provides a global synchronization
496  The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in    function called barrier. This is used to synchronize computations
497  some cases use the same underlying code.    over all tiles.  The {\bf BARRIER} and {\bf GLOBAL SUM} primitives
498      have much in common and in some cases use the same underlying code.
499  \end{itemize}  \end{itemize}
500    
501    
# Line 508  Following the discussion above, the mach Line 535  Following the discussion above, the mach
535  presents to an application has the following characteristics  presents to an application has the following characteristics
536    
537  \begin{itemize}  \begin{itemize}
538  \item The machine consists of one or more logical processors. \vspace{-3mm}  \item The machine consists of one or more logical processors.
539  \item Each processor operates on tiles that it owns.\vspace{-3mm}  \item Each processor operates on tiles that it owns.
540  \item A processor may own more than one tile.\vspace{-3mm}  \item A processor may own more than one tile.
541  \item Processors may compute concurrently.\vspace{-3mm}  \item Processors may compute concurrently.
542  \item Exchange of information between tiles is handled by the  \item Exchange of information between tiles is handled by the
543  machine (WRAPPER) not by the application.    machine (WRAPPER) not by the application.
544  \end{itemize}  \end{itemize}
545  Behind the scenes this allows the WRAPPER to adapt the machine model  Behind the scenes this allows the WRAPPER to adapt the machine model
546  functions to exploit hardware on which  functions to exploit hardware on which
547  \begin{itemize}  \begin{itemize}
548  \item Processors may be able to communicate very efficiently with each other  \item Processors may be able to communicate very efficiently with each
549  using shared memory. \vspace{-3mm}    other using shared memory.
550  \item An alternative communication mechanism based on a relatively  \item An alternative communication mechanism based on a relatively
551  simple inter-process communication API may be required.\vspace{-3mm}    simple inter-process communication API may be required.
552  \item Shared memory may not necessarily obey sequential consistency,  \item Shared memory may not necessarily obey sequential consistency,
553  however some mechanism will exist for enforcing memory consistency.    however some mechanism will exist for enforcing memory consistency.
 \vspace{-3mm}  
554  \item Memory consistency that is enforced at the hardware level  \item Memory consistency that is enforced at the hardware level
555  may be expensive. Unnecessary triggering of consistency protocols    may be expensive. Unnecessary triggering of consistency protocols
556  should be avoided. \vspace{-3mm}    should be avoided.
557  \item Memory access patterns may need to either repetitive or highly  \item Memory access patterns may need to either repetitive or highly
558  pipelined for optimum hardware performance. \vspace{-3mm}    pipelined for optimum hardware performance.
559  \end{itemize}  \end{itemize}
560    
561  This generic model captures the essential hardware ingredients  This generic model captures the essential hardware ingredients
# Line 537  of almost all successful scientific comp Line 563  of almost all successful scientific comp
563  last 50 years.  last 50 years.
564    
565  \section{Using the WRAPPER}  \section{Using the WRAPPER}
566    \begin{rawhtml}
567  In order to support maximum portability the WRAPPER is implemented primarily  <!-- CMIREDIR:using_the_wrapper: -->
568  in sequential Fortran 77. At a practical level the key steps provided by the  \end{rawhtml}
569  WRAPPER are  
570    In order to support maximum portability the WRAPPER is implemented
571    primarily in sequential Fortran 77. At a practical level the key steps
572    provided by the WRAPPER are
573  \begin{enumerate}  \begin{enumerate}
574  \item specifying how a domain will be decomposed  \item specifying how a domain will be decomposed
575  \item starting a code in either sequential or parallel modes of operations  \item starting a code in either sequential or parallel modes of operations
576  \item controlling communication between tiles and between concurrently  \item controlling communication between tiles and between concurrently
577  computing CPU's.    computing CPUs.
578  \end{enumerate}  \end{enumerate}
579  This section describes the details of each of these operations.  This section describes the details of each of these operations.
580  Section \ref{sect:specifying_a_decomposition} explains how the way in which  Section \ref{sect:specifying_a_decomposition} explains how the way in
581  a domain is decomposed (or composed) is expressed. Section  which a domain is decomposed (or composed) is expressed. Section
582  \ref{sect:starting_a_code} describes practical details of running codes  \ref{sect:starting_a_code} describes practical details of running
583  in various different parallel modes on contemporary computer systems.  codes in various different parallel modes on contemporary computer
584  Section \ref{sect:controlling_communication} explains the internal information  systems.  Section \ref{sect:controlling_communication} explains the
585  that the WRAPPER uses to control how information is communicated between  internal information that the WRAPPER uses to control how information
586  tiles.  is communicated between tiles.
587    
588  \subsection{Specifying a domain decomposition}  \subsection{Specifying a domain decomposition}
589  \label{sect:specifying_a_decomposition}  \label{sect:specifying_a_decomposition}
# Line 661  Within a {\em bi}, {\em bj} loop Line 690  Within a {\em bi}, {\em bj} loop
690  computation is performed concurrently over as many processes and threads  computation is performed concurrently over as many processes and threads
691  as there are physical processors available to compute.  as there are physical processors available to compute.
692    
693    An exception to the the use of {\em bi} and {\em bj} in loops arises in the
694    exchange routines used when the exch2 package is used with the cubed
695    sphere.  In this case {\em bj} is generally set to 1 and the loop runs from
696    1,{\em bi}.  Within the loop {\em bi} is used to retrieve the tile number,
697    which is then used to reference exchange parameters.
698    
699  The amount of computation that can be embedded  The amount of computation that can be embedded
700  a single loop over {\em bi} and {\em bj} varies for different parts of the  a single loop over {\em bi} and {\em bj} varies for different parts of the
701  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
# Line 929  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1 Line 964  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
964  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
965  File: {\em model/src/THE\_MODEL\_MAIN.F}\\  File: {\em model/src/THE\_MODEL\_MAIN.F}\\
966  File: {\em eesupp/src/MAIN.F}\\  File: {\em eesupp/src/MAIN.F}\\
967  File: {\em tools/genmake}\\  File: {\em tools/genmake2}\\
968  File: {\em eedata}\\  File: {\em eedata}\\
969  CPP:  {\em TARGET\_SUN}\\  CPP:  {\em TARGET\_SUN}\\
970  CPP:  {\em TARGET\_DEC}\\  CPP:  {\em TARGET\_DEC}\\
# Line 944  Parameter:  {\em nTy} Line 979  Parameter:  {\em nTy}
979  \subsubsection{Multi-process execution}  \subsubsection{Multi-process execution}
980  \label{sect:multi-process-execution}  \label{sect:multi-process-execution}
981    
982  Despite its appealing programming model, multi-threaded execution remains  Despite its appealing programming model, multi-threaded execution
983  less common then multi-process execution. One major reason for this  remains less common than multi-process execution. One major reason for
984  is that many system libraries are still not ``thread-safe''. This means that for  this is that many system libraries are still not ``thread-safe''. This
985  example on some systems it is not safe to call system routines to  means that, for example, on some systems it is not safe to call system
986  do I/O when running in multi-threaded mode, except for in a limited set of  routines to perform I/O when running in multi-threaded mode (except,
987  circumstances. Another reason is that support for multi-threaded programming  perhaps, in a limited set of circumstances).  Another reason is that
988  models varies between systems.  support for multi-threaded programming models varies between systems.
989    
990  Multi-process execution is more ubiquitous.  Multi-process execution is more ubiquitous.  In order to run code in a
991  In order to run code in a multi-process configuration a decomposition  multi-process configuration a decomposition specification (see section
992  specification ( see section \ref{sect:specifying_a_decomposition})  \ref{sect:specifying_a_decomposition}) is given (in which the at least
993  is given ( in which the at least one of the  one of the parameters {\em nPx} or {\em nPy} will be greater than one)
994  parameters {\em nPx} or {\em nPy} will be greater than one)  and then, as for multi-threaded operation, appropriate compile time
995  and then, as for multi-threaded operation,  and run time steps must be taken.
996  appropriate compile time and run time steps must be taken.  
997    \paragraph{Compilation} Multi-process execution under the WRAPPER
998  \paragraph{Compilation} Multi-process execution under the WRAPPER  assumes that the portable, MPI libraries are available for controlling
999  assumes that the portable, MPI libraries are available  the start-up of multiple processes. The MPI libraries are not
1000  for controlling the start-up of multiple processes. The MPI libraries  required, although they are usually used, for performance critical
1001  are not required, although they are usually used, for performance  communication. However, in order to simplify the task of controlling
1002  critical communication. However, in order to simplify the task  and coordinating the start up of a large number (hundreds and possibly
1003  of controlling and coordinating the start up of a large number  even thousands) of copies of the same program, MPI is used. The calls
1004  (hundreds and possibly even thousands) of copies of the same  to the MPI multi-process startup routines must be activated at compile
1005  program, MPI is used. The calls to the MPI multi-process startup  time.  Currently MPI libraries are invoked by specifying the
1006  routines must be activated at compile time. This is done  appropriate options file with the {\tt-of} flag when running the {\em
1007  by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}    genmake2} script, which generates the Makefile for compiling and
1008  flags in the {\em CPP\_EEOPTIONS.h} file.\\  linking MITgcm.  (Previously this was done by setting the {\em
1009      ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
1010      CPP\_EEOPTIONS.h} file.)  More detailed information about the use of
1011    {\em genmake2} for specifying
1012    local compiler flags is located in section \ref{sect:genmake}.\\
1013    
 \fbox{  
 \begin{minipage}{4.75in}  
 File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\  
 CPP:  {\em ALLOW\_USE\_MPI}\\  
 CPP:  {\em ALWAYS\_USE\_MPI}\\  
 Parameter:  {\em nPx}\\  
 Parameter:  {\em nPy}  
 \end{minipage}  
 } \\  
   
 Additionally, compile time options are required to link in the  
 MPI libraries and header files. Examples of these options  
 can be found in the {\em genmake} script that creates makefiles  
 for compilation. When this script is executed with the {bf -mpi}  
 flag it will generate a makefile that includes  
 paths for search for MPI head files and for linking in  
 MPI libraries. For example the {\bf -mpi} flag on a  
  Silicon Graphics IRIX system causes a  
 Makefile with the compilation command  
 Graphics IRIX system \begin{verbatim}  
 mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI  
 \end{verbatim}  
 to be generated.  
 This is the correct set of options for using the MPICH open-source  
 version of MPI, when it has been installed under the subdirectory  
 /usr/local/mpi.  
 However, on many systems there may be several  
 versions of MPI installed. For example many systems have both  
 the open source MPICH set of libraries and a vendor specific native form  
 of the MPI libraries. The correct setup to use will depend on the  
 local configuration of your system.\\  
1014    
1015  \fbox{  \fbox{
1016  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
1017  File: {\em tools/genmake}  Directory: {\em tools/build\_options}\\
1018    File: {\em tools/genmake2}
1019  \end{minipage}  \end{minipage}
1020  } \\  } \\
1021  \paragraph{\bf Execution} The mechanics of starting a program in  \paragraph{\bf Execution} The mechanics of starting a program in
1022  multi-process mode under MPI is not standardized. Documentation  multi-process mode under MPI is not standardized. Documentation
1023  associated with the distribution of MPI installed on a system will  associated with the distribution of MPI installed on a system will
1024  describe how to start a program using that distribution.  describe how to start a program using that distribution.  For the
1025  For the free, open-source MPICH system the MITgcm program is started  open-source MPICH system, the MITgcm program can be started using a
1026  using a command such as  command such as
1027  \begin{verbatim}  \begin{verbatim}
1028  mpirun -np 64 -machinefile mf ./mitgcmuv  mpirun -np 64 -machinefile mf ./mitgcmuv
1029  \end{verbatim}  \end{verbatim}
1030  In this example the text {\em -np 64} specifies the number of processes  In this example the text {\em -np 64} specifies the number of
1031  that will be created. The numeric value {\em 64} must be equal to the  processes that will be created. The numeric value {\em 64} must be
1032  product of the processor grid settings of {\em nPx} and {\em nPy}  equal to the product of the processor grid settings of {\em nPx} and
1033  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file  {\em nPy} in the file {\em SIZE.h}. The parameter {\em mf} specifies
1034  called ``mf'' will be read to get a list of processor names on  that a text file called ``mf'' will be read to get a list of processor
1035  which the sixty-four processes will execute. The syntax of this file  names on which the sixty-four processes will execute. The syntax of
1036  is specified by the MPI distribution  this file is specified by the MPI distribution.
1037  \\  \\
1038    
1039  \fbox{  \fbox{
1040  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1037  Parameter: {\em nPy} Line 1046  Parameter: {\em nPy}
1046    
1047    
1048  \paragraph{Environment variables}  \paragraph{Environment variables}
1049  On most systems multi-threaded execution also requires the setting  On most systems multi-threaded execution also requires the setting of
1050  of a special environment variable. On many machines this variable  a special environment variable. On many machines this variable is
1051  is called PARALLEL and its values should be set to the number  called PARALLEL and its values should be set to the number of parallel
1052  of parallel threads required. Generally the help pages associated  threads required. Generally the help or manual pages associated with
1053  with the multi-threaded compiler on a machine will explain  the multi-threaded compiler on a machine will explain how to set the
1054  how to set the required environment variables for that machines.  required environment variables.
1055    
1056  \paragraph{Runtime input parameters}  \paragraph{Runtime input parameters}
1057  Finally the file {\em eedata} needs to be configured to indicate  Finally the file {\em eedata} needs to be configured to indicate the
1058  the number of threads to be used in the x and y directions.  number of threads to be used in the x and y directions.  The variables
1059  The variables {\em nTx} and {\em nTy} in this file are used to  {\em nTx} and {\em nTy} in this file are used to specify the
1060  specify the information required. The product of {\em nTx} and  information required. The product of {\em nTx} and {\em nTy} must be
1061  {\em nTy} must be equal to the number of threads spawned i.e.  equal to the number of threads spawned i.e.  the setting of the
1062  the setting of the environment variable PARALLEL.  environment variable PARALLEL.  The value of {\em nTx} must subdivide
1063  The value of {\em nTx} must subdivide the number of sub-domains  the number of sub-domains in x ({\em nSx}) exactly. The value of {\em
1064  in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the    nTy} must subdivide the number of sub-domains in y ({\em nSy})
1065  number of sub-domains in y ({\em nSy}) exactly.  exactly.  The multiprocess startup of the MITgcm executable {\em
1066  The multiprocess startup of the MITgcm executable {\em mitgcmuv}    mitgcmuv} is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1067  is controlled by the routines {\em EEBOOT\_MINIMAL()} and  {\em INI\_PROCS()}. The first routine performs basic steps required to
1068  {\em INI\_PROCS()}. The first routine performs basic steps required  make sure each process is started and has a textual output stream
1069  to make sure each process is started and has a textual output  associated with it. By default two output files are opened for each
1070  stream associated with it. By default two output files are opened  process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.  The {\bf
1071  for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.    NNNNN} part of the name is filled in with the process number so that
1072  The {\bf NNNNN} part of the name is filled in with the process  process number 0 will create output files {\bf STDOUT.0000} and {\bf
1073  number so that process number 0 will create output files    STDERR.0000}, process number 1 will create output files {\bf
1074  {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create    STDOUT.0001} and {\bf STDERR.0001}, etc. These files are used for
1075  output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files  reporting status and configuration information and for reporting error
1076  are used for reporting status and configuration information and  conditions on a process by process basis.  The {\em EEBOOT\_MINIMAL()}
1077  for reporting error conditions on a process by process basis.  procedure also sets the variables {\em myProcId} and {\em
1078  The {\em EEBOOT\_MINIMAL()} procedure also sets the variables    MPI\_COMM\_MODEL}.  These variables are related to processor
1079  {\em myProcId} and {\em MPI\_COMM\_MODEL}.  identification are are used later in the routine {\em INI\_PROCS()} to
1080  These variables are related  allocate tiles to processes.
1081  to processor identification are are used later in the routine  
1082  {\em INI\_PROCS()} to allocate tiles to processes.  Allocation of processes to tiles is controlled by the routine {\em
1083      INI\_PROCS()}. For each process this routine sets the variables {\em
1084  Allocation of processes to tiles in controlled by the routine    myXGlobalLo} and {\em myYGlobalLo}.  These variables specify, in
1085  {\em INI\_PROCS()}. For each process this routine sets  index space, the coordinates of the southernmost and westernmost
1086  the variables {\em myXGlobalLo} and {\em myYGlobalLo}.  corner of the southernmost and westernmost tile owned by this process.
1087  These variables specify (in index space) the coordinate  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are
1088  of the southern most and western most corner of the  also set in this routine. These are used to identify processes holding
1089  southern most and western most tile owned by this process.  tiles to the west, east, south and north of a given process. These
1090  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}  values are stored in global storage in the header file {\em
1091  are also set in this routine. These are used to identify    EESUPPORT.h} for use by communication routines.  The above does not
1092  processes holding tiles to the west, east, south and north  hold when the exch2 package is used.  The exch2 sets its own
1093  of this process. These values are stored in global storage  parameters to specify the global indices of tiles and their
1094  in the header file {\em EESUPPORT.h} for use by  relationships to each other.  See the documentation on the exch2
1095  communication routines.  package (\ref{sec:exch2}) for details.
1096  \\  \\
1097    
1098  \fbox{  \fbox{
# Line 1109  operations and that can be customized fo Line 1118  operations and that can be customized fo
1118  describes the information that is held and used.  describes the information that is held and used.
1119    
1120  \begin{enumerate}  \begin{enumerate}
1121  \item {\bf Tile-tile connectivity information} For each tile the WRAPPER  \item {\bf Tile-tile connectivity information}
1122  sets a flag that sets the tile number to the north, south, east and    For each tile the WRAPPER sets a flag that sets the tile number to
1123  west of that tile. This number is unique over all tiles in a    the north, south, east and west of that tile. This number is unique
1124  configuration. The number is held in the variables {\em tileNo}    over all tiles in a configuration. Except when using the cubed
1125  ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},    sphere and the exch2 package, the number is held in the variables
1126  {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile    {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em
1127  that specifies the type of communication that is used between tiles.      tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also
1128  This information is held in the variables {\em tileCommModeN},    stored with each tile that specifies the type of communication that
1129  {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.    is used between tiles.  This information is held in the variables
1130  This latter set of variables can take one of the following values    {\em tileCommModeN}, {\em tileCommModeS}, {\em tileCommModeE} and
1131  {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.    {\em tileCommModeW}.  This latter set of variables can take one of
1132  A value of {\em COMM\_NONE} is used to indicate that a tile has no    the following values {\em COMM\_NONE}, {\em COMM\_MSG}, {\em
1133  neighbor to communicate with on a particular face. A value      COMM\_PUT} and {\em COMM\_GET}.  A value of {\em COMM\_NONE} is
1134  of {\em COMM\_MSG} is used to indicated that some form of distributed    used to indicate that a tile has no neighbor to communicate with on
1135  memory communication is required to communicate between    a particular face. A value of {\em COMM\_MSG} is used to indicate
1136  these tile faces ( see section \ref{sect:distributed_memory_communication}).    that some form of distributed memory communication is required to
1137  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate    communicate between these tile faces (see section
1138  forms of shared memory communication ( see section    \ref{sect:distributed_memory_communication}).  A value of {\em
1139  \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates      COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
1140  that a CPU should communicate by writing to data structures owned by another    memory communication (see section
1141  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading    \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value
1142  from data structures owned by another CPU. These flags affect the behavior    indicates that a CPU should communicate by writing to data
1143  of the WRAPPER exchange primitive    structures owned by another CPU. A {\em COMM\_GET} value indicates
1144  (see figure \ref{fig:communication_primitives}). The routine    that a CPU should communicate by reading from data structures owned
1145  {\em ini\_communication\_patterns()} is responsible for setting the    by another CPU. These flags affect the behavior of the WRAPPER
1146  communication mode values for each tile.    exchange primitive (see figure \ref{fig:communication_primitives}).
1147  \\    The routine {\em ini\_communication\_patterns()} is responsible for
1148      setting the communication mode values for each tile.
1149    
1150      When using the cubed sphere configuration with the exch2 package,
1151      the relationships between tiles and their communication methods are
1152      set by the exch2 package and stored in different variables.  See the
1153      exch2 package documentation (\ref{sec:exch2} for details.
1154    
1155  \fbox{  \fbox{
1156  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1154  Parameter: {\em tileCommModeS} \\ Line 1169  Parameter: {\em tileCommModeS} \\
1169  } \\  } \\
1170    
1171  \item {\bf MP directives}  \item {\bf MP directives}
1172  The WRAPPER transfers control to numerical application code through    The WRAPPER transfers control to numerical application code through
1173  the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way    the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1174  that allows for it to be invoked by several threads. Support for this    that allows for it to be invoked by several threads. Support for
1175  is based on using multi-processing (MP) compiler directives.    this is based on either multi-processing (MP) compiler directives or
1176  Most commercially available Fortran compilers support the generation    specific calls to multi-threading libraries (\textit{eg.} POSIX
1177  of code to spawn multiple threads through some form of compiler directives.    threads).  Most commercially available Fortran compilers support the
1178  As this is generally much more convenient than writing code to interface    generation of code to spawn multiple threads through some form of
1179  to operating system libraries to explicitly spawn threads, and on some systems    compiler directives.  Compiler directives are generally more
1180  this may be the only method available the WRAPPER is distributed with    convenient than writing code to explicitly spawning threads.  And,
1181  template MP directives for a number of systems.    on some systems, compiler directives may be the only method
1182      available.  The WRAPPER is distributed with template MP directives
1183   These directives are inserted into the code just before and after the    for a number of systems.
1184  transfer of control to numerical algorithm code through the routine  
1185  {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of    These directives are inserted into the code just before and after
1186  the code that performs this process for a Silicon Graphics system.    the transfer of control to numerical algorithm code through the
1187  This code is extracted from the files {\em main.F} and    routine {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows
1188  {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies    an example of the code that performs this process for a Silicon
1189  how many instances of the routine {\em THE\_MODEL\_MAIN} will    Graphics system.  This code is extracted from the files {\em main.F}
1190  be created. The value of {\em nThreads} is set in the routine    and {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads}
1191  {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the    specifies how many instances of the routine {\em THE\_MODEL\_MAIN}
1192  the product of the parameters {\em nTx} and {\em nTy} that    will be created. The value of {\em nThreads} is set in the routine
1193  are read from the file {\em eedata}. If the value of {\em nThreads}    {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the the
1194  is inconsistent with the number of threads requested from the    product of the parameters {\em nTx} and {\em nTy} that are read from
1195  operating system (for example by using an environment    the file {\em eedata}. If the value of {\em nThreads} is
1196  variable as described in section \ref{sect:multi_threaded_execution})    inconsistent with the number of threads requested from the operating
1197  then usually an error will be reported by the routine    system (for example by using an environment variable as described in
1198  {\em CHECK\_THREADS}.\\    section \ref{sect:multi_threaded_execution}) then usually an error
1199      will be reported by the routine {\em CHECK\_THREADS}.
1200    
1201  \fbox{  \fbox{
1202  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1196  Parameter: {\em nTy} \\ Line 1212  Parameter: {\em nTy} \\
1212  }  }
1213    
1214  \item {\bf memsync flags}  \item {\bf memsync flags}
1215  As discussed in section \ref{sect:memory_consistency}, when using shared memory,    As discussed in section \ref{sect:memory_consistency}, a low-level
1216  a low-level system function may be need to force memory consistency.    system function may be need to force memory consistency on some
1217  The routine {\em MEMSYNC()} is used for this purpose. This routine should    shared memory systems.  The routine {\em MEMSYNC()} is used for this
1218  not need modifying and the information below is only provided for    purpose. This routine should not need modifying and the information
1219  completeness. A logical parameter {\em exchNeedsMemSync} set    below is only provided for completeness. A logical parameter {\em
1220  in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether      exchNeedsMemSync} set in the routine {\em
1221  the {\em MEMSYNC()} primitive is called. In general this      INI\_COMMUNICATION\_PATTERNS()} controls whether the {\em
1222  routine is only used for multi-threaded execution.      MEMSYNC()} primitive is called. In general this routine is only
1223  The code that goes into the {\em MEMSYNC()}    used for multi-threaded execution.  The code that goes into the {\em
1224   routine is specific to the compiler and      MEMSYNC()} routine is specific to the compiler and processor used.
1225  processor being used for multi-threaded execution and in general    In some cases, it must be written using a short code snippet of
1226  must be written using a short code snippet of assembly language.    assembly language.  For an Ultra Sparc system the following code
1227  For an Ultra Sparc system the following code snippet is used    snippet is used
1228  \begin{verbatim}  \begin{verbatim}
1229  asm("membar #LoadStore|#StoreStore");  asm("membar #LoadStore|#StoreStore");
1230  \end{verbatim}  \end{verbatim}
# Line 1222  asm("lock; addl $0,0(%%esp)": : :"memory Line 1238  asm("lock; addl $0,0(%%esp)": : :"memory
1238  \end{verbatim}  \end{verbatim}
1239    
1240  \item {\bf Cache line size}  \item {\bf Cache line size}
1241  As discussed in section \ref{sect:cache_effects_and_false_sharing},    As discussed in section \ref{sect:cache_effects_and_false_sharing},
1242  milti-threaded codes explicitly avoid penalties associated with excessive    milti-threaded codes explicitly avoid penalties associated with
1243  coherence traffic on an SMP system. To do this the shared memory data structures    excessive coherence traffic on an SMP system. To do this the shared
1244  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines    memory data structures used by the {\em GLOBAL\_SUM}, {\em
1245  are padded. The variables that control the padding are set in the      GLOBAL\_MAX} and {\em BARRIER} routines are padded. The variables
1246  header file {\em EEPARAMS.h}. These variables are called    that control the padding are set in the header file {\em
1247  {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and      EEPARAMS.h}. These variables are called {\em cacheLineSize}, {\em
1248  {\em lShare8}. The default values should not normally need changing.      lShare1}, {\em lShare4} and {\em lShare8}. The default values
1249      should not normally need changing.
1250    
1251  \item {\bf \_BARRIER}  \item {\bf \_BARRIER}
1252  This is a CPP macro that is expanded to a call to a routine    This is a CPP macro that is expanded to a call to a routine which
1253  which synchronizes all the logical processors running under the    synchronizes all the logical processors running under the WRAPPER.
1254  WRAPPER. Using a macro here preserves flexibility to insert    Using a macro here preserves flexibility to insert a specialized
1255  a specialized call in-line into application code. By default this    call in-line into application code. By default this resolves to
1256  resolves to calling the procedure {\em BARRIER()}. The default    calling the procedure {\em BARRIER()}. The default setting for the
1257  setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.    \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1258    
1259  \item {\bf \_GSUM}  \item {\bf \_GSUM}
1260  This is a CPP macro that is expanded to a call to a routine    This is a CPP macro that is expanded to a call to a routine which
1261  which sums up a floating point number    sums up a floating point number over all the logical processors
1262  over all the logical processors running under the    running under the WRAPPER. Using a macro here provides extra
1263  WRAPPER. Using a macro here provides extra flexibility to insert    flexibility to insert a specialized call in-line into application
1264  a specialized call in-line into application code. By default this    code. By default this resolves to calling the procedure {\em
1265  resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for      GLOBAL\_SUM\_R8()} ( for 64-bit floating point operands) or {\em
1266  64-bit floating point operands)      GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The
1267  or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default    default setting for the \_GSUM macro is given in the file {\em
1268  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.      CPP\_EEMACROS.h}.  The \_GSUM macro is a performance critical
1269  The \_GSUM macro is a performance critical operation, especially for    operation, especially for large processor count, small tile size
1270  large processor count, small tile size configurations.    configurations.  The custom communication example discussed in
1271  The custom communication example discussed in section \ref{sect:jam_example}    section \ref{sect:jam_example} shows how the macro is used to invoke
1272  shows how the macro is used to invoke a custom global sum routine    a custom global sum routine for a specific set of hardware.
 for a specific set of hardware.  
1273    
1274  \item {\bf \_EXCH}  \item {\bf \_EXCH}
1275  The \_EXCH CPP macro is used to update tile overlap regions.    The \_EXCH CPP macro is used to update tile overlap regions.  It is
1276  It is qualified by a suffix indicating whether overlap updates are for    qualified by a suffix indicating whether overlap updates are for
1277  two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )    two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1278  physical fields and whether fields are 32-bit floating point    physical fields and whether fields are 32-bit floating point (
1279  ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point    \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point (
1280  ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined    \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined in
1281  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the    the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
1282  \_EXCH operation plays a crucial role in scaling to small tile,    operation plays a crucial role in scaling to small tile, large
1283  large logical and physical processor count configurations.    logical and physical processor count configurations.  The example in
1284  The example in section \ref{sect:jam_example} discusses defining an    section \ref{sect:jam_example} discusses defining an optimized and
1285  optimized and specialized form on the \_EXCH operation.    specialized form on the \_EXCH operation.
1286    
1287  The \_EXCH operation is also central to supporting grids such as    The \_EXCH operation is also central to supporting grids such as the
1288  the cube-sphere grid. In this class of grid a rotation may be required    cube-sphere grid. In this class of grid a rotation may be required
1289  between tiles. Aligning the coordinate requiring rotation with the    between tiles. Aligning the coordinate requiring rotation with the
1290  tile decomposition, allows the coordinate transformation to    tile decomposition, allows the coordinate transformation to be
1291  be embedded within a custom form of the \_EXCH primitive.    embedded within a custom form of the \_EXCH primitive.  In these
1292      cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1293      package documentation \ref{sec:exch2}.
1294    
1295  \item {\bf Reverse Mode}  \item {\bf Reverse Mode}
1296  The communication primitives \_EXCH and \_GSUM both employ    The communication primitives \_EXCH and \_GSUM both employ
1297  hand-written adjoint forms (or reverse mode) forms.    hand-written adjoint forms (or reverse mode) forms.  These reverse
1298  These reverse mode forms can be found in the    mode forms can be found in the source code directory {\em
1299  source code directory {\em pkg/autodiff}.      pkg/autodiff}.  For the global sum primitive the reverse mode form
1300  For the global sum primitive the reverse mode form    calls are to {\em GLOBAL\_ADSUM\_R4} and {\em GLOBAL\_ADSUM\_R8}.
1301  calls are to {\em GLOBAL\_ADSUM\_R4} and    The reverse mode form of the exchange primitives are found in
1302  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the    routines prefixed {\em ADEXCH}. The exchange routines make calls to
1303  exchange primitives are found in routines    the same low-level communication primitives as the forward mode
1304  prefixed {\em ADEXCH}. The exchange routines make calls to    operations. However, the routine argument {\em simulationMode} is
1305  the same low-level communication primitives as the forward mode    set to the value {\em REVERSE\_SIMULATION}. This signifies to the
1306  operations. However, the routine argument {\em simulationMode}    low-level routines that the adjoint forms of the appropriate
1307  is set to the value {\em REVERSE\_SIMULATION}. This signifies    communication operation should be performed.
1308  ti the low-level routines that the adjoint forms of the  
 appropriate communication operation should be performed.  
1309  \item {\bf MAX\_NO\_THREADS}  \item {\bf MAX\_NO\_THREADS}
1310  The variable {\em MAX\_NO\_THREADS} is used to indicate the    The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum
1311  maximum number of OS threads that a code will use. This    number of OS threads that a code will use. This value defaults to
1312  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.    thirty-two and is set in the file {\em EEPARAMS.h}.  For single
1313  For single threaded execution it can be reduced to one if required.    threaded execution it can be reduced to one if required.  The value
1314  The value; is largely private to the WRAPPER and application code    is largely private to the WRAPPER and application code will not
1315  will nor normally reference the value, except in the following scenario.    normally reference the value, except in the following scenario.
1316    
1317  For certain physical parametrization schemes it is necessary to have    For certain physical parametrization schemes it is necessary to have
1318  a substantial number of work arrays. Where these arrays are allocated    a substantial number of work arrays. Where these arrays are
1319  in heap storage ( for example COMMON blocks ) multi-threaded    allocated in heap storage (for example COMMON blocks) multi-threaded
1320  execution will require multiple instances of the COMMON block data.    execution will require multiple instances of the COMMON block data.
1321  This can be achieved using a Fortran 90 module construct, however,    This can be achieved using a Fortran 90 module construct.  However,
1322  if this might be unavailable then the work arrays can be extended    if this mechanism is unavailable then the work arrays can be extended
1323  with dimensions use the tile dimensioning scheme of {\em nSx}    with dimensions using the tile dimensioning scheme of {\em nSx} and
1324  and {\em nSy} ( as described in section    {\em nSy} (as described in section
1325  \ref{sect:specifying_a_decomposition}). However, if the configuration    \ref{sect:specifying_a_decomposition}). However, if the
1326  being specified involves many more tiles than OS threads then    configuration being specified involves many more tiles than OS
1327  it can save memory resources to reduce the variable    threads then it can save memory resources to reduce the variable
1328  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that    {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
1329  will be used and to declare the physical parameterization    that will be used and to declare the physical parameterization work
1330  work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.    arrays with a single {\em MAX\_NO\_THREADS} extra dimension.  An
1331  An example of this is given in the verification experiment    example of this is given in the verification experiment {\em
1332  {\em aim.5l\_cs}. Here the default setting of      aim.5l\_cs}. Here the default setting of {\em MAX\_NO\_THREADS} is
1333  {\em MAX\_NO\_THREADS} is altered to    altered to
1334  \begin{verbatim}  \begin{verbatim}
1335        INTEGER MAX_NO_THREADS        INTEGER MAX_NO_THREADS
1336        PARAMETER ( MAX_NO_THREADS =    6 )        PARAMETER ( MAX_NO_THREADS =    6 )
1337  \end{verbatim}  \end{verbatim}
1338  and several work arrays for storing intermediate calculations are    and several work arrays for storing intermediate calculations are
1339  created with declarations of the form.    created with declarations of the form.
1340  \begin{verbatim}  \begin{verbatim}
1341        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1342  \end{verbatim}  \end{verbatim}
1343  This declaration scheme is not used widely, because most global data    This declaration scheme is not used widely, because most global data
1344  is used for permanent not temporary storage of state information.    is used for permanent not temporary storage of state information.
1345  In the case of permanent state information this approach cannot be used    In the case of permanent state information this approach cannot be
1346  because there has to be enough storage allocated for all tiles.    used because there has to be enough storage allocated for all tiles.
1347  However, the technique can sometimes be a useful scheme for reducing memory    However, the technique can sometimes be a useful scheme for reducing
1348  requirements in complex physical parameterizations.    memory requirements in complex physical parameterizations.
1349  \end{enumerate}  \end{enumerate}
1350    
1351  \begin{figure}  \begin{figure}
# Line 1348  C--     Invoke nThreads instances of the Line 1366  C--     Invoke nThreads instances of the
1366    
1367        ENDDO        ENDDO
1368  \end{verbatim}  \end{verbatim}
1369  \caption{Prior to transferring control to    \caption{Prior to transferring control to the procedure {\em
1370  the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use        THE\_MODEL\_MAIN()} the WRAPPER may use MP directives to spawn
1371  MP directives to spawn multiple threads.      multiple threads.  } \label{fig:mp_directives}
 } \label{fig:mp_directives}  
1372  \end{figure}  \end{figure}
1373    
1374    
# Line 1364  how it can be used to adapt to new gridi Line 1381  how it can be used to adapt to new gridi
1381    
1382  \subsubsection{JAM example}  \subsubsection{JAM example}
1383  \label{sect:jam_example}  \label{sect:jam_example}
1384  On some platforms a big performance boost can be obtained by  On some platforms a big performance boost can be obtained by binding
1385  binding the communication routines {\em \_EXCH} and  the communication routines {\em \_EXCH} and {\em \_GSUM} to
1386  {\em \_GSUM} to specialized native libraries ) fro example the  specialized native libraries (for example, the shmem library on CRAY
1387  shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag  T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag is used as an
1388  is used as an illustration of a specialized communication configuration  illustration of a specialized communication configuration that
1389  that substitutes for standard, portable forms of {\em \_EXCH} and  substitutes for standard, portable forms of {\em \_EXCH} and {\em
1390  {\em \_GSUM}. It affects three source files {\em eeboot.F},    \_GSUM}. It affects three source files {\em eeboot.F}, {\em
1391  {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined    CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined is has
1392  is has the following effects.  the following effects.
1393  \begin{itemize}  \begin{itemize}
1394  \item An extra phase is included at boot time to initialize the custom  \item An extra phase is included at boot time to initialize the custom
1395  communications library ( see {\em ini\_jam.F}).    communications library ( see {\em ini\_jam.F}).
1396  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1397  with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})    with calls to custom routines (see {\em gsum\_jam.F} and {\em
1398        exch\_jam.F})
1399  \item a highly specialized form of the exchange operator (optimized  \item a highly specialized form of the exchange operator (optimized
1400  for overlap regions of width one) is substituted into the elliptic    for overlap regions of width one) is substituted into the elliptic
1401  solver routine {\em cg2d.F}.    solver routine {\em cg2d.F}.
1402  \end{itemize}  \end{itemize}
1403  Developing specialized code for other libraries follows a similar  Developing specialized code for other libraries follows a similar
1404  pattern.  pattern.
1405    
1406  \subsubsection{Cube sphere communication}  \subsubsection{Cube sphere communication}
1407  \label{sect:cube_sphere_communication}  \label{sect:cube_sphere_communication}
1408  Actual {\em \_EXCH} routine code is generated automatically from  Actual {\em \_EXCH} routine code is generated automatically from a
1409  a series of template files, for example {\em exch\_rx.template}.  series of template files, for example {\em exch\_rx.template}.  This
1410  This is done to allow a large number of variations on the exchange  is done to allow a large number of variations on the exchange process
1411  process to be maintained. One set of variations supports the  to be maintained. One set of variations supports the cube sphere grid.
1412  cube sphere grid. Support for a cube sphere grid in MITgcm is based  Support for a cube sphere grid in MITgcm is based on having each face
1413  on having each face of the cube as a separate tile (or tiles).  of the cube as a separate tile or tiles.  The exchange routines are
1414  The exchange routines are then able to absorb much of the  then able to absorb much of the detailed rotation and reorientation
1415  detailed rotation and reorientation required when moving around the  required when moving around the cube grid. The set of {\em \_EXCH}
1416  cube grid. The set of {\em \_EXCH} routines that contain the  routines that contain the word cube in their name perform these
1417  word cube in their name perform these transformations.  transformations.  They are invoked when the run-time logical parameter
 They are invoked when the run-time logical parameter  
1418  {\em useCubedSphereExchange} is set true. To facilitate the  {\em useCubedSphereExchange} is set true. To facilitate the
1419  transformations on a staggered C-grid, exchange operations are defined  transformations on a staggered C-grid, exchange operations are defined
1420  separately for both vector and scalar quantities and for  separately for both vector and scalar quantities and for grid-centered
1421  grid-centered and for grid-face and corner quantities.  and for grid-face and grid-corner quantities.  Three sets of exchange
1422  Three sets of exchange routines are defined. Routines  routines are defined. Routines with names of the form {\em exch\_rx}
1423  with names of the form {\em exch\_rx} are used to exchange  are used to exchange cell centered scalar quantities. Routines with
1424  cell centered scalar quantities. Routines with names of the form  names of the form {\em exch\_uv\_rx} are used to exchange vector
1425  {\em exch\_uv\_rx} are used to exchange vector quantities located at  quantities located at the C-grid velocity points. The vector
1426  the C-grid velocity points. The vector quantities exchanged by the  quantities exchanged by the {\em exch\_uv\_rx} routines can either be
1427  {\em exch\_uv\_rx} routines can either be signed (for example velocity  signed (for example velocity components) or un-signed (for example
1428  components) or un-signed (for example grid-cell separations).  grid-cell separations).  Routines with names of the form {\em
1429  Routines with names of the form {\em exch\_z\_rx} are used to exchange    exch\_z\_rx} are used to exchange quantities at the C-grid vorticity
1430  quantities at the C-grid vorticity point locations.  point locations.
1431    
1432    
1433    
1434    
1435  \section{MITgcm execution under WRAPPER}  \section{MITgcm execution under WRAPPER}
1436    \begin{rawhtml}
1437    <!-- CMIREDIR:mitgcm_wrapper: -->
1438    \end{rawhtml}
1439    
1440  Fitting together the WRAPPER elements, package elements and  Fitting together the WRAPPER elements, package elements and
1441  MITgcm core equation elements of the source code produces calling  MITgcm core equation elements of the source code produces calling
# Line 1461  Core equations plus packages. Line 1481  Core equations plus packages.
1481  {\footnotesize  {\footnotesize
1482  \begin{verbatim}  \begin{verbatim}
1483  C  C
 C  
1484  C Invocation from WRAPPER level...  C Invocation from WRAPPER level...
1485  C  :  C  :
1486  C  :  C  :
# Line 1525  C    | | |-CTRL_INIT           :: Contro Line 1544  C    | | |-CTRL_INIT           :: Contro
1544  C    | | |-OPTIM_READPARMS     :: Optimisation support package. see pkg/ctrl  C    | | |-OPTIM_READPARMS     :: Optimisation support package. see pkg/ctrl
1545  C    | | |-GRDCHK_READPARMS    :: Gradient check package. see pkg/grdchk  C    | | |-GRDCHK_READPARMS    :: Gradient check package. see pkg/grdchk
1546  C    | | |-ECCO_READPARMS      :: ECCO Support Package. see pkg/ecco  C    | | |-ECCO_READPARMS      :: ECCO Support Package. see pkg/ecco
1547    C    | | |-PTRACERS_READPARMS  :: multiple tracer package, see pkg/ptracers
1548    C    | | |-GCHEM_READPARMS     :: tracer interface package, see pkg/gchem
1549  C    | |  C    | |
1550  C    | |-PACKAGES_CHECK  C    | |-PACKAGES_CHECK
1551  C    | | |  C    | | |
1552  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp
1553  C    | | |-OBCS_CHECK          :: Open bndy Package. pkg/obcs  C    | | |-OBCS_CHECK          :: Open bndy Pacakge. pkg/obcs
1554  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi
1555  C    | |  C    | |
1556  C    | |-PACKAGES_INIT_FIXED  C    | |-PACKAGES_INIT_FIXED
1557  C    | | |-OBCS_INIT_FIXED     :: Open bndy Package. see pkg/obcs  C    | | |-OBCS_INIT_FIXED     :: Open bndy Package. see pkg/obcs
1558  C    | | |-FLT_INIT            :: Floats Package. see pkg/flt  C    | | |-FLT_INIT            :: Floats Package. see pkg/flt
1559    C    | | |-GCHEM_INIT_FIXED    :: tracer interface pachage, see pkg/gchem
1560  C    | |  C    | |
1561  C    | |-ZONAL_FILT_INIT       :: FFT filter Package. see pkg/zonal_filt  C    | |-ZONAL_FILT_INIT       :: FFT filter Package. see pkg/zonal_filt
1562  C    | |  C    | |
1563  C    | |-INI_CG2D              :: 2d con. grad solver initialisation.  C    | |-INI_CG2D              :: 2d con. grad solver initialization.
1564  C    | |  C    | |
1565  C    | |-INI_CG3D              :: 3d con. grad solver initialisation.  C    | |-INI_CG3D              :: 3d con. grad solver initialization.
1566  C    | |  C    | |
1567  C    | |-CONFIG_SUMMARY        :: Provide synopsis of kernel setup.  C    | |-CONFIG_SUMMARY        :: Provide synopsis of kernel setup.
1568  C    |                         :: Includes annotated table of kernel  C    |                         :: Includes annotated table of kernel
# Line 1565  C    | | | Line 1587  C    | | |
1587  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,
1588  C    | | |              :: sphere options are coded.  C    | | |              :: sphere options are coded.
1589  C    | | |  C    | | |
1590  C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.  C    | | |-INI_CG2D     :: 2d con. grad solver initialization.
1591  C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.  C    | | |-INI_CG3D     :: 3d con. grad solver initialization.
1592  C    | | |-INI_MIXING   :: Initialise diapycnal diffusivity.  C    | | |-INI_MIXING   :: Initialize diapycnal diffusivity.
1593  C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical  C    | | |-INI_DYNVARS  :: Initialize to zero all DYNVARS.h arrays (dynamical
1594  C    | | |              :: fields).  C    | | |              :: fields).
1595  C    | | |  C    | | |
1596  C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero  C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero
# Line 1576  C    | | | |-INI_VEL    :: Initialize 3D Line 1598  C    | | | |-INI_VEL    :: Initialize 3D
1598  C    | | | |-INI_THETA  :: Set model initial temperature field.  C    | | | |-INI_THETA  :: Set model initial temperature field.
1599  C    | | | |-INI_SALT   :: Set model initial salinity field.  C    | | | |-INI_SALT   :: Set model initial salinity field.
1600  C    | | | |-INI_PSURF  :: Set model initial free-surface height/pressure.  C    | | | |-INI_PSURF  :: Set model initial free-surface height/pressure.
1601  C    | | |  C    | | | |-INI_PRESSURE :: Compute model initial hydrostatic pressure
1602  C    | | |-INI_TR1      :: Set initial tracer 1 distribution.  C    | | | |-READ_CHECKPOINT :: Read the checkpoint
1603  C    | | |  C    | | |
1604  C    | | |-THE_CORRECTION_STEP :: Step forward to next time step.  C    | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1605  C    | | | |                   :: Here applied to move restart conditions  C    | | | |                   :: Here applied to move restart conditions
# Line 1604  C    | | | |-FIND_RHO  :: Find adjacent Line 1626  C    | | | |-FIND_RHO  :: Find adjacent
1626  C    | | | |-CONVECT   :: Mix static instability.  C    | | | |-CONVECT   :: Mix static instability.
1627  C    | | | |-TIMEAVE_CUMULATE :: Update convection statistics.  C    | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1628  C    | | |  C    | | |
1629  C    | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving  C    | | |-PACKAGES_INIT_VARIABLES :: Does initialization of time evolving
1630  C    | | | |                       :: package data.  C    | | | |                       :: package data.
1631  C    | | | |  C    | | | |
1632  C    | | | |-GMREDI_INIT          :: GM package. ( see pkg/gmredi )  C    | | | |-GMREDI_INIT          :: GM package. ( see pkg/gmredi )
1633  C    | | | |-KPP_INIT             :: KPP package. ( see pkg/kpp )  C    | | | |-KPP_INIT             :: KPP package. ( see pkg/kpp )
1634  C    | | | |-KPP_OPEN_DIAGS      C    | | | |-KPP_OPEN_DIAGS    
1635  C    | | | |-OBCS_INIT_VARIABLES  :: Open bndy. package. ( see pkg/obcs )  C    | | | |-OBCS_INIT_VARIABLES  :: Open bndy. package. ( see pkg/obcs )
1636    C    | | | |-PTRACERS_INIT        :: multi. tracer package,(see pkg/ptracers)
1637    C    | | | |-GCHEM_INIT           :: tracer interface pkg (see pkh/gchem)
1638  C    | | | |-AIM_INIT             :: Interm. atmos package. ( see pkg/aim )  C    | | | |-AIM_INIT             :: Interm. atmos package. ( see pkg/aim )
1639  C    | | | |-CTRL_MAP_INI         :: Control vector package.( see pkg/ctrl )  C    | | | |-CTRL_MAP_INI         :: Control vector package.( see pkg/ctrl )
1640  C    | | | |-COST_INIT            :: Cost function package. ( see pkg/cost )  C    | | | |-COST_INIT            :: Cost function package. ( see pkg/cost )
# Line 1653  C/\  | | | |                    :: Simpl Line 1677  C/\  | | | |                    :: Simpl
1677  C/\  | | | |                    :: for forcing datasets.  C/\  | | | |                    :: for forcing datasets.
1678  C/\  | | | |                    C/\  | | | |                  
1679  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.
1680    C/\  | | |-SEAICE_MODEL   :: Compute sea-ice terms. ( pkg/seaice )
1681    C/\  | | |-FREEZE         :: Limit surface temperature.
1682    C/\  | | |-GCHEM_FIELD_LOAD :: load tracer forcing fields (pkg/gchem)
1683  C/\  | | |  C/\  | | |
1684  C/\  | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.  C/\  | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1685  C/\  | | | |  C/\  | | | |
1686  C/\  | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.  C/\  | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1687  C/\  | | | |-OBCS_APPLY_W    :: Open bndy. package ( see pkg/obcs ).  C/\  | | | |-OBCS_APPLY_W    :: Open bndy. package ( see pkg/obcs ).
1688  C/\  | | | |-FIND_RHO        :: Calculates [rho(S,T,z)-Rhonil] of a slice  C/\  | | | |-FIND_RHO        :: Calculates [rho(S,T,z)-RhoConst] of a slice
1689  C/\  | | | |-GRAD_SIGMA      :: Calculate isoneutral gradients  C/\  | | | |-GRAD_SIGMA      :: Calculate isoneutral gradients
1690  C/\  | | | |-CALC_IVDC       :: Set Implicit Vertical Diffusivity for Convection  C/\  | | | |-CALC_IVDC       :: Set Implicit Vertical Diffusivity for Convection
1691  C/\  | | | |  C/\  | | | |
1692  C/\  | | | |-OBCS_CALC            :: Open bndy. package ( see pkg/obcs ).  C/\  | | | |-OBCS_CALC            :: Open bndy. package ( see pkg/obcs ).
1693  C/\  | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned  C/\  | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1694  C/\  | | | |                      :: forcing terms.  C/\  | | | | |                    :: forcing terms.
1695    C/\  | | | | |-PTRACERS_FORCING_SURF :: Tracer package ( see pkg/ptracers ).
1696  C/\  | | | |  C/\  | | | |
1697  C/\  | | | |-GMREDI_CALC_TENSOR   :: GM package ( see pkg/gmredi ).  C/\  | | | |-GMREDI_CALC_TENSOR   :: GM package ( see pkg/gmredi ).
1698  C/\  | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).  C/\  | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
# Line 1682  C/\  | | | | Line 1710  C/\  | | | |
1710  C/\  | | | |-CALC_GT              :: Calculate the temperature tendency terms  C/\  | | | |-CALC_GT              :: Calculate the temperature tendency terms
1711  C/\  | | | | |  C/\  | | | | |
1712  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
1713  C/\  | | | | |                    :: ( see pkg/gad )  C/\  | | | | | |                  :: ( see pkg/gad )
1714    C/\  | | | | | |-KPP_TRANSPORT_T  :: KPP non-local transport ( see pkg/kpp ).
1715    C/\  | | | | |
1716  C/\  | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.  C/\  | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1717  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
1718  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
# Line 1692  C/\  | | | | Line 1722  C/\  | | | |
1722  C/\  | | | |-CALC_GS              :: Calculate the salinity tendency terms  C/\  | | | |-CALC_GS              :: Calculate the salinity tendency terms
1723  C/\  | | | | |  C/\  | | | | |
1724  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
1725  C/\  | | | | |                    :: ( see pkg/gad )  C/\  | | | | | |                  :: ( see pkg/gad )
1726    C/\  | | | | | |-KPP_TRANSPORT_S  :: KPP non-local transport ( see pkg/kpp ).
1727    C/\  | | | | |
1728  C/\  | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.  C/\  | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1729  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
1730  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1731  C/\  | | | |  C/\  | | | |
1732  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time
1733  C/\  | | | |  C/\  | | | |
1734  C/\  | | | |-CALC_GTR1            :: Calculate other tracer(s) tendency terms  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time
1735    C/\  | | | |
1736    C/\  | | | |-PTRACERS_INTEGRATE   :: Integrate other tracer(s) (see pkg/ptracers).
1737  C/\  | | | | |  C/\  | | | | |
1738  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
1739  C/\  | | | | |                    :: ( see pkg/gad )  C/\  | | | | | |                  :: ( see pkg/gad )
1740  C/\  | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.  C/\  | | | | | |-KPP_TRANSPORT_PTR:: KPP non-local transport ( see pkg/kpp ).
1741    C/\  | | | | |
1742    C/\  | | | | |-PTRACERS_FORCING   :: Problem specific forcing for tracer.
1743    C/\  | | | | |-GCHEM_FORCING_INT  :: tracer forcing for gchem pkg (if all
1744    C/\  | | | | |                       tendancy terms calcualted together)
1745  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
1746  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1747    C/\  | | | | |-TIMESTEP_TRACER    :: Step tracer field forward in time
1748  C/\  | | | |  C/\  | | | |
 C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time  
1749  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).
 C/\  | | | |-FREEZE               :: Limit range of temperature.  
1750  C/\  | | | |  C/\  | | | |
1751  C/\  | | | |-IMPLDIFF             :: Solve vertical implicit diffusion equation.  C/\  | | | |-IMPLDIFF             :: Solve vertical implicit diffusion equation.
1752  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).
# Line 1768  C/\  | | | Line 1805  C/\  | | |
1805  C/\  | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.  C/\  | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1806  C/\  | | | |-EXCH                                                    C/\  | | | |-EXCH                                                  
1807  C/\  | | |  C/\  | | |
1808    C/\  | | |-GCHEM_FORCING_SEP :: tracer forcing for gchem pkg (if
1809    C/\  | | |                      tracer dependent tendencies calculated
1810    C/\  | | |                      separatly)
1811    C/\  | | |
1812  C/\  | | |-FLT_MAIN         :: Float package ( pkg/flt ).  C/\  | | |-FLT_MAIN         :: Float package ( pkg/flt ).
1813  C/\  | | |  C/\  | | |
1814  C/\  | | |-MONITOR          :: Monitor package ( pkg/monitor ).  C/\  | | |-MONITOR          :: Monitor package ( pkg/monitor ).
# Line 1778  C/\  | | | |-TIMEAVE_STATV_WRITE :: Time Line 1819  C/\  | | | |-TIMEAVE_STATV_WRITE :: Time
1819  C/\  | | | |-AIM_WRITE_DIAGS     :: Intermed. atmos diags. see pkg/aim  C/\  | | | |-AIM_WRITE_DIAGS     :: Intermed. atmos diags. see pkg/aim
1820  C/\  | | | |-GMREDI_DIAGS        :: GM diags. see pkg/gmredi  C/\  | | | |-GMREDI_DIAGS        :: GM diags. see pkg/gmredi
1821  C/\  | | | |-KPP_DO_DIAGS        :: KPP diags. see pkg/kpp  C/\  | | | |-KPP_DO_DIAGS        :: KPP diags. see pkg/kpp
1822    C/\  | | | |-SBO_CALC            :: SBO diags. see pkg/sbo
1823    C/\  | | | |-SBO_DIAGS           :: SBO diags. see pkg/sbo
1824    C/\  | | | |-SEAICE_DO_DIAGS     :: SEAICE diags. see pkg/seaice
1825    C/\  | | | |-GCHEM_DIAGS         :: gchem diags. see pkg/gchem
1826  C/\  | | |  C/\  | | |
1827  C/\  | | |-WRITE_CHECKPOINT :: Do I/O for restart files.  C/\  | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1828  C/\  | |  C/\  | |
# Line 1795  C    |-TIMER_PRINTALL :: Computational t Line 1840  C    |-TIMER_PRINTALL :: Computational t
1840  C    |  C    |
1841  C    |-COMM_STATS     :: Summarise inter-proc and inter-thread communication  C    |-COMM_STATS     :: Summarise inter-proc and inter-thread communication
1842  C                     :: events.  C                     :: events.
1843  C  C
1844  \end{verbatim}  \end{verbatim}
1845  }  }
1846    

Legend:
Removed from v.1.10  
changed lines
  Added in v.1.23

  ViewVC Help
Powered by ViewVC 1.1.22