/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Diff of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.6 by adcroft, Tue Nov 13 20:13:55 2001 UTC revision 1.24 by jmc, Wed Apr 5 01:12:02 2006 UTC
# Line 1  Line 1 
1  % $Header$  % $Header$
2    
3  In this chapter we describe the software architecture and  This chapter focuses on describing the {\bf WRAPPER} environment
4  implementation strategy for the MITgcm code. The first part of this  within which both the core numerics and the pluggable packages
5  chapter discusses the MITgcm architecture at an abstract level. In the second  operate. The description presented here is intended to be a detailed
6  part of the chapter we described practical details of the MITgcm implementation  exposition and contains significant background material, as well as
7  and of current tools and operating system features that are employed.  advanced details on working with the WRAPPER.  The tutorial sections
8    of this manual (see sections \ref{sect:tutorials} and
9    \ref{sect:tutorialIII}) contain more succinct, step-by-step
10    instructions on running basic numerical experiments, of varous types,
11    both sequentially and in parallel. For many projects simply starting
12    from an example code and adapting it to suit a particular situation
13    will be all that is required.  The first part of this chapter
14    discusses the MITgcm architecture at an abstract level. In the second
15    part of the chapter we described practical details of the MITgcm
16    implementation and of current tools and operating system features that
17    are employed.
18    
19  \section{Overall architectural goals}  \section{Overall architectural goals}
20    \begin{rawhtml}
21    <!-- CMIREDIR:overall_architectural_goals: -->
22    \end{rawhtml}
23    
24  Broadly, the goals of the software architecture employed in MITgcm are  Broadly, the goals of the software architecture employed in MITgcm are
25  three-fold  three-fold
26    
27  \begin{itemize}  \begin{itemize}
28  \item We wish to be able to study a very broad range  \item We wish to be able to study a very broad range of interesting
29  of interesting and challenging rotating fluids problems.    and challenging rotating fluids problems.
30  \item We wish the model code to be readily targeted to  \item We wish the model code to be readily targeted to a wide range of
31  a wide range of platforms    platforms
32  \item On any given platform we would like to be  \item On any given platform we would like to be able to achieve
33  able to achieve performance comparable to an implementation    performance comparable to an implementation developed and
34  developed and specialized specifically for that platform.    specialized specifically for that platform.
35  \end{itemize}  \end{itemize}
36    
37  These points are summarized in figure \ref{fig:mitgcm_architecture_goals}  These points are summarized in figure
38  which conveys the goals of the MITgcm design. The goals lead to  \ref{fig:mitgcm_architecture_goals} which conveys the goals of the
39  a software architecture which at the high-level can be viewed as consisting  MITgcm design. The goals lead to a software architecture which at the
40  of  high-level can be viewed as consisting of
41    
42  \begin{enumerate}  \begin{enumerate}
43  \item A core set of numerical and support code. This is discussed in detail in  \item A core set of numerical and support code. This is discussed in
44  section \ref{sect:partII}.    detail in section \ref{chap:discretization}.
45  \item A scheme for supporting optional "pluggable" {\bf packages} (containing  
46  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).  \item A scheme for supporting optional ``pluggable'' {\bf packages}
47  These packages are used both to overlay alternate dynamics and to introduce    (containing for example mixed-layer schemes, biogeochemical schemes,
48  specialized physical content onto the core numerical code. An overview of    atmospheric physics).  These packages are used both to overlay
49  the {\bf package} scheme is given at the start of part \ref{part:packages}.    alternate dynamics and to introduce specialized physical content
50  \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel    onto the core numerical code. An overview of the {\bf package}
51  Programming Environment Resource), within which the core numerics and pluggable    scheme is given at the start of part \ref{chap:packagesI}.
52  packages operate.  
53    \item A support framework called {\bf WRAPPER} (Wrappable Application
54      Parallel Programming Environment Resource), within which the core
55      numerics and pluggable packages operate.
56  \end{enumerate}  \end{enumerate}
57    
58  This chapter focuses on describing the {\bf WRAPPER} environment under which  This chapter focuses on describing the {\bf WRAPPER} environment under
59  both the core numerics and the pluggable packages function. The description  which both the core numerics and the pluggable packages function. The
60  presented here is intended to be a detailed exposition and contains significant  description presented here is intended to be a detailed exposition and
61  background material, as well as advanced details on working with the WRAPPER.  contains significant background material, as well as advanced details
62  The examples section of this manual (part \ref{part:example}) contains more  on working with the WRAPPER.  The examples section of this manual
63  succinct, step-by-step instructions on running basic numerical  (part \ref{chap:getting_started}) contains more succinct, step-by-step
64  experiments both sequentially and in parallel. For many projects simply  instructions on running basic numerical experiments both sequentially
65  starting from an example code and adapting it to suit a particular situation  and in parallel. For many projects simply starting from an example
66  will be all that is required.  code and adapting it to suit a particular situation will be all that
67    is required.
68    
69    
70  \begin{figure}  \begin{figure}
71  \begin{center}  \begin{center}
72  \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}  \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
73  \end{center}  \end{center}
74  \caption{  \caption{ The MITgcm architecture is designed to allow simulation of a
75  The MITgcm architecture is designed to allow simulation of a wide    wide range of physical problems on a wide range of hardware. The
76  range of physical problems on a wide range of hardware. The computational    computational resource requirements of the applications targeted
77  resource requirements of the applications targeted range from around    range from around $10^7$ bytes ($\approx 10$ megabytes) of memory to
78  $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes    $10^{11}$ bytes ($\approx 100$ gigabytes). Arithmetic operation
79  ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of    counts for the applications of interest range from $10^{9}$ floating
80  interest range from $10^{9}$ floating point operations to more than $10^{17}$    point operations to more than $10^{17}$ floating point operations.}
 floating point operations.}  
81  \label{fig:mitgcm_architecture_goals}  \label{fig:mitgcm_architecture_goals}
82  \end{figure}  \end{figure}
83    
84  \section{WRAPPER}  \section{WRAPPER}
85    \begin{rawhtml}
86  A significant element of the software architecture utilized in  <!-- CMIREDIR:wrapper: -->
87  MITgcm is a software superstructure and substructure collectively  \end{rawhtml}
88  called the WRAPPER (Wrappable Application Parallel Programming  
89  Environment Resource). All numerical and support code in MITgcm is written  A significant element of the software architecture utilized in MITgcm
90  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within  is a software superstructure and substructure collectively called the
91  the WRAPPER means that coding has to follow certain, relatively  WRAPPER (Wrappable Application Parallel Programming Environment
92  straightforward, rules and conventions ( these are discussed further in  Resource). All numerical and support code in MITgcm is written to
93  section \ref{sect:specifying_a_decomposition} ).  ``fit'' within the WRAPPER infrastructure. Writing code to ``fit''
94    within the WRAPPER means that coding has to follow certain, relatively
95  The approach taken by the WRAPPER is illustrated in figure  straightforward, rules and conventions (these are discussed further in
96  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code  section \ref{sect:specifying_a_decomposition}).
97  that fits within it from architectural differences between hardware platforms  
98  and operating systems. This allows numerical code to be easily retargetted.  The approach taken by the WRAPPER is illustrated in figure
99    \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to
100    insulate code that fits within it from architectural differences
101    between hardware platforms and operating systems. This allows
102    numerical code to be easily retargetted.
103    
104    
105  \begin{figure}  \begin{figure}
# Line 87  and operating systems. This allows numer Line 107  and operating systems. This allows numer
107  \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}  \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
108  \end{center}  \end{center}
109  \caption{  \caption{
110  Numerical code is written too fit within a software support  Numerical code is written to fit within a software support
111  infrastructure called WRAPPER. The WRAPPER is portable and  infrastructure called WRAPPER. The WRAPPER is portable and
112  can be specialized for a wide range of specific target hardware and  can be specialized for a wide range of specific target hardware and
113  programming environments, without impacting numerical code that fits  programming environments, without impacting numerical code that fits
# Line 100  optimized for that platform.} Line 120  optimized for that platform.}
120  \subsection{Target hardware}  \subsection{Target hardware}
121  \label{sect:target_hardware}  \label{sect:target_hardware}
122    
123  The WRAPPER is designed to target as broad as possible a range of computer  The WRAPPER is designed to target as broad as possible a range of
124  systems. The original development of the WRAPPER took place on a  computer systems.  The original development of the WRAPPER took place
125  multi-processor, CRAY Y-MP system. On that system, numerical code performance  on a multi-processor, CRAY Y-MP system. On that system, numerical code
126  and scaling under the WRAPPER was in excess of that of an implementation that  performance and scaling under the WRAPPER was in excess of that of an
127  was tightly bound to the CRAY systems proprietary multi-tasking and  implementation that was tightly bound to the CRAY systems proprietary
128  micro-tasking approach. Later developments have been carried out on  multi-tasking and micro-tasking approach. Later developments have been
129  uniprocessor and multi-processor Sun systems with both uniform memory access  carried out on uniprocessor and multi-processor Sun systems with both
130  (UMA) and non-uniform memory access (NUMA) designs. Significant work has also  uniform memory access (UMA) and non-uniform memory access (NUMA)
131  been undertaken on x86 cluster systems, Alpha processor based clustered SMP  designs.  Significant work has also been undertaken on x86 cluster
132  systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.  systems, Alpha processor based clustered SMP systems, and on
133  The MITgcm code, operating within the WRAPPER, is also used routinely used on  cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
134  large scale MPP systems (for example T3E systems and IBM SP systems). In all  systems.  The MITgcm code, operating within the WRAPPER, is also
135  cases numerical code, operating within the WRAPPER, performs and scales very  routinely used on large scale MPP systems (for example, Cray T3E and
136  competitively with equivalent numerical code that has been modified to contain  IBM SP systems). In all cases numerical code, operating within the
137  native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.  WRAPPER, performs and scales very competitively with equivalent
138    numerical code that has been modified to contain native optimizations
139    for a particular system \ref{ref hoe and hill, ecmwf}.
140    
141  \subsection{Supporting hardware neutrality}  \subsection{Supporting hardware neutrality}
142    
143  The different systems listed in section \ref{sect:target_hardware} can be  The different systems listed in section \ref{sect:target_hardware} can
144  categorized in many different ways. For example, one common distinction is  be categorized in many different ways. For example, one common
145  between shared-memory parallel systems (SMP's, PVP's) and distributed memory  distinction is between shared-memory parallel systems (SMP and PVP)
146  parallel systems (for example x86 clusters and large MPP systems). This is one  and distributed memory parallel systems (for example x86 clusters and
147  example of a difference between compute platforms that can impact an  large MPP systems). This is one example of a difference between
148  application. Another common distinction is between vector processing systems  compute platforms that can impact an application. Another common
149  with highly specialized CPU's and memory subsystems and commodity  distinction is between vector processing systems with highly
150  microprocessor based systems. There are numerous other differences, especially  specialized CPUs and memory subsystems and commodity microprocessor
151  in relation to how parallel execution is supported. To capture the essential  based systems. There are numerous other differences, especially in
152  differences between different platforms the WRAPPER uses a {\it machine model}.  relation to how parallel execution is supported. To capture the
153    essential differences between different platforms the WRAPPER uses a
154    {\it machine model}.
155    
156  \subsection{WRAPPER machine model}  \subsection{WRAPPER machine model}
157    
158  Applications using the WRAPPER are not written to target just one  Applications using the WRAPPER are not written to target just one
159  particular machine (for example an IBM SP2) or just one particular family or  particular machine (for example an IBM SP2) or just one particular
160  class of machines (for example Parallel Vector Processor Systems). Instead the  family or class of machines (for example Parallel Vector Processor
161  WRAPPER provides applications with an  Systems). Instead the WRAPPER provides applications with an abstract
162  abstract {\it machine model}. The machine model is very general, however, it can  {\it machine model}. The machine model is very general, however, it
163  easily be specialized to fit, in a computationally efficient manner, any  can easily be specialized to fit, in a computationally efficient
164  computer architecture currently available to the scientific computing community.  manner, any computer architecture currently available to the
165    scientific computing community.
166    
167  \subsection{Machine model parallelism}  \subsection{Machine model parallelism}
168    \label{sect:domain_decomposition}
169   Codes operating under the WRAPPER target an abstract machine that is assumed to  \begin{rawhtml}
170  consist of one or more logical processors that can compute concurrently.    <!-- CMIREDIR:domain_decomp: -->
171  Computational work is divided among the logical  \end{rawhtml}
172  processors by allocating ``ownership'' to  
173  each processor of a certain set (or sets) of calculations. Each set of  Codes operating under the WRAPPER target an abstract machine that is
174  calculations owned by a particular processor is associated with a specific  assumed to consist of one or more logical processors that can compute
175  region of the physical space that is being simulated, only one processor will  concurrently.  Computational work is divided among the logical
176  be associated with each such region (domain decomposition).    processors by allocating ``ownership'' to each processor of a certain
177    set (or sets) of calculations. Each set of calculations owned by a
178  In a strict sense the logical processors over which work is divided do not need  particular processor is associated with a specific region of the
179  to correspond to physical processors. It is perfectly possible to execute a  physical space that is being simulated, only one processor will be
180  configuration decomposed for multiple logical processors on a single physical  associated with each such region (domain decomposition).
181  processor. This helps ensure that numerical code that is written to fit  
182  within the WRAPPER will parallelize with no additional effort and is  In a strict sense the logical processors over which work is divided do
183  also useful when debugging codes. Generally, however,  not need to correspond to physical processors.  It is perfectly
184  the computational domain will be subdivided over multiple logical  possible to execute a configuration decomposed for multiple logical
185  processors in order to then bind those logical processors to physical  processors on a single physical processor.  This helps ensure that
186  processor resources that can compute in parallel.  numerical code that is written to fit within the WRAPPER will
187    parallelize with no additional effort.  It is also useful for
188    debugging purposes.  Generally, however, the computational domain will
189    be subdivided over multiple logical processors in order to then bind
190    those logical processors to physical processor resources that can
191    compute in parallel.
192    
193  \subsubsection{Tiles}  \subsubsection{Tiles}
194    
195  Computationally, associated with each region of physical  Computationally, the data structures (\textit{eg.} arrays, scalar
196  space allocated to a particular logical processor, there will be data  variables, etc.) that hold the simulated state are associated with
197  structures (arrays, scalar variables etc...) that hold the simulated state of  each region of physical space and are allocated to a particular
198  that region. We refer to these data structures as being {\bf owned} by the  logical processor.  We refer to these data structures as being {\bf
199  processor to which their    owned} by the processor to which their associated region of physical
200  associated region of physical space has been allocated. Individual  space has been allocated.  Individual regions that are allocated to
201  regions that are allocated to processors are called {\bf tiles}. A  processors are called {\bf tiles}.  A processor can own more than one
202  processor can own more  tile.  Figure \ref{fig:domaindecomp} shows a physical domain being
203  than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being  mapped to a set of logical processors, with each processors owning a
204  mapped to a set of logical processors, with each processors owning a single  single region of the domain (a single tile).  Except for periods of
205  region of the domain (a single tile). Except for periods of  communication and coordination, each processor computes autonomously,
206  communication and coordination, each processor computes autonomously, working  working only with data from the tile (or tiles) that the processor
207  only with data from the tile (or tiles) that the processor owns. When multiple  owns.  When multiple tiles are alloted to a single processor, each
208  tiles are alloted to a single processor, each tile is computed on  tile is computed on independently of the other tiles, in a sequential
209  independently of the other tiles, in a sequential fashion.  fashion.
210    
211  \begin{figure}  \begin{figure}
212  \begin{center}  \begin{center}
# Line 184  independently of the other tiles, in a s Line 214  independently of the other tiles, in a s
214    \includegraphics{part4/domain_decomp.eps}    \includegraphics{part4/domain_decomp.eps}
215   }   }
216  \end{center}  \end{center}
217  \caption{ The WRAPPER provides support for one and two dimensional  \caption{ The WRAPPER provides support for one and two dimensional
218  decompositions of grid-point domains. The figure shows a hypothetical domain of    decompositions of grid-point domains. The figure shows a
219  total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in    hypothetical domain of total size $N_{x}N_{y}N_{z}$. This
220  two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf    hypothetical domain is decomposed in two-dimensions along the
221  tiles} are {\bf owned} by different processors. The {\bf owning}    $N_{x}$ and $N_{y}$ directions. The resulting {\bf tiles} are {\bf
222  processors perform the      owned} by different processors. The {\bf owning} processors
223  arithmetic operations associated with a {\bf tile}. Although not illustrated    perform the arithmetic operations associated with a {\bf tile}.
224  here, a single processor can {\bf own} several {\bf tiles}.    Although not illustrated here, a single processor can {\bf own}
225  Whenever a processor wishes to transfer data between tiles or    several {\bf tiles}.  Whenever a processor wishes to transfer data
226  communicate with other processors it calls a WRAPPER supplied    between tiles or communicate with other processors it calls a
227  function.    WRAPPER supplied function.  } \label{fig:domaindecomp}
 } \label{fig:domaindecomp}  
228  \end{figure}  \end{figure}
229    
230    
231  \subsubsection{Tile layout}  \subsubsection{Tile layout}
232    
233  Tiles consist of an interior region and an overlap region. The overlap region  Tiles consist of an interior region and an overlap region.  The
234  of a tile corresponds to the interior region of an adjacent tile.  overlap region of a tile corresponds to the interior region of an
235  In figure \ref{fig:tiledworld} each tile would own the region  adjacent tile.  In figure \ref{fig:tiledworld} each tile would own the
236  within the black square and hold duplicate information for overlap  region within the black square and hold duplicate information for
237  regions extending into the tiles to the north, south, east and west.  overlap regions extending into the tiles to the north, south, east and
238  During  west.  During computational phases a processor will reference data in
239  computational phases a processor will reference data in an overlap region  an overlap region whenever it requires values that lie outside the
240  whenever it requires values that outside the domain it owns. Periodically  domain it owns.  Periodically processors will make calls to WRAPPER
241  processors will make calls to WRAPPER functions to communicate data between  functions to communicate data between tiles, in order to keep the
242  tiles, in order to keep the overlap regions up to date (see section  overlap regions up to date (see section
243  \ref{sect:communication_primitives}). The WRAPPER functions can use a  \ref{sect:communication_primitives}).  The WRAPPER functions can use a
244  variety of different mechanisms to communicate data between tiles.  variety of different mechanisms to communicate data between tiles.
245    
246  \begin{figure}  \begin{figure}
# Line 228  Overlap regions are periodically updated Line 257  Overlap regions are periodically updated
257    
258  \subsection{Communication mechanisms}  \subsection{Communication mechanisms}
259    
260   Logical processors are assumed to be able to exchange information  Logical processors are assumed to be able to exchange information
261  between tiles and between each other using at least one of two possible  between tiles and between each other using at least one of two
262  mechanisms.  possible mechanisms.
263    
264  \begin{itemize}  \begin{itemize}
265  \item {\bf Shared memory communication}.  \item {\bf Shared memory communication}.  Under this mode of
266  Under this mode of communication data transfers are assumed to be possible    communication data transfers are assumed to be possible using direct
267  using direct addressing of regions of memory. In this case a CPU is able to read    addressing of regions of memory.  In this case a CPU is able to read
268  (and write) directly to regions of memory "owned" by another CPU    (and write) directly to regions of memory ``owned'' by another CPU
269  using simple programming language level assignment operations of the    using simple programming language level assignment operations of the
270  the sort shown in figure \ref{fig:simple_assign}. In this way one CPU    the sort shown in figure \ref{fig:simple_assign}.  In this way one
271  (CPU1 in the figure) can communicate information to another CPU (CPU2 in the    CPU (CPU1 in the figure) can communicate information to another CPU
272  figure) by assigning a particular value to a particular memory location.    (CPU2 in the figure) by assigning a particular value to a particular
273      memory location.
274  \item {\bf Distributed memory communication}.  
275  Under this mode of communication there is no mechanism, at the application code level,  \item {\bf Distributed memory communication}.  Under this mode of
276  for directly addressing regions of memory owned and visible to another CPU. Instead    communication there is no mechanism, at the application code level,
277  a communication library must be used as illustrated in figure    for directly addressing regions of memory owned and visible to
278  \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the    another CPU. Instead a communication library must be used as
279  communication library to communicate data from a tile that it owns to a tile    illustrated in figure \ref{fig:comm_msg}. In this case CPUs must
280  that another CPU owns. By default the WRAPPER binds to the MPI communication    call a function in the API of the communication library to
281  library \ref{MPI} for this style of communication.    communicate data from a tile that it owns to a tile that another CPU
282      owns. By default the WRAPPER binds to the MPI communication library
283      \ref{MPI} for this style of communication.
284  \end{itemize}  \end{itemize}
285    
286  The WRAPPER assumes that communication will use one of these two styles  The WRAPPER assumes that communication will use one of these two styles
287  of communication. The underlying hardware and operating system support  of communication.  The underlying hardware and operating system support
288  for the style used is not specified and can vary from system to system.  for the style used is not specified and can vary from system to system.
289    
290  \begin{figure}  \begin{figure}
# Line 267  for the style used is not specified and Line 298  for the style used is not specified and
298                                   |        END WHILE                                   |        END WHILE
299                                   |                                   |
300  \end{verbatim}  \end{verbatim}
301  \caption{ In the WRAPPER shared memory communication model, simple writes to an  \caption{In the WRAPPER shared memory communication model, simple writes to an
302  array can be made to be visible to other CPU's at the application code level.  array can be made to be visible to other CPUs at the application code level.
303  So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to  So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
304  element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)  element $3$ of array $a$, then other CPUs (for example CPU2 in the figure above)
305  will be able to see the value $8$ when they read from $a(3)$.  will be able to see the value $8$ when they read from $a(3)$.
306  This provides a very low latency and high bandwidth communication  This provides a very low latency and high bandwidth communication
307  mechanism.  mechanism.
# Line 289  mechanism. Line 320  mechanism.
320                                   |                                   |
321  \end{verbatim}  \end{verbatim}
322  \caption{ In the WRAPPER distributed memory communication model  \caption{ In the WRAPPER distributed memory communication model
323  data can not be made directly visible to other CPU's.  data can not be made directly visible to other CPUs.
324  If one CPU writes the value $8$ to element $3$ of array $a$, then  If one CPU writes the value $8$ to element $3$ of array $a$, then
325  at least one of CPU1 and/or CPU2 in the figure above will need  at least one of CPU1 and/or CPU2 in the figure above will need
326  to call a bespoke communication library in order for the updated  to call a bespoke communication library in order for the updated
327  value to be communicated between CPU's.  value to be communicated between CPUs.
328  } \label{fig:comm_msg}  } \label{fig:comm_msg}
329  \end{figure}  \end{figure}
330    
331  \subsection{Shared memory communication}  \subsection{Shared memory communication}
332  \label{sect:shared_memory_communication}  \label{sect:shared_memory_communication}
333    
334  Under shared communication independent CPU's are operating  Under shared communication independent CPUs are operating on the
335  on the exact same global address space at the application level.  exact same global address space at the application level.  This means
336  This means that CPU 1 can directly write into global  that CPU 1 can directly write into global data structures that CPU 2
337  data structures that CPU 2 ``owns'' using a simple  ``owns'' using a simple assignment at the application level.  This is
338  assignment at the application level.  the model of memory access is supported at the basic system design
339  This is the model of memory access is supported at the basic system  level in ``shared-memory'' systems such as PVP systems, SMP systems,
340  design level in ``shared-memory'' systems such as PVP systems, SMP systems,  and on distributed shared memory systems (\textit{eg.} SGI Origin, SGI
341  and on distributed shared memory systems (the SGI Origin).  Altix, and some AMD Opteron systems).  On such systems the WRAPPER
342  On such systems the WRAPPER will generally use simple read and write statements  will generally use simple read and write statements to access directly
343  to access directly application data structures when communicating between CPU's.  application data structures when communicating between CPUs.
344    
345  In a system where assignments statements, like the one in figure  In a system where assignments statements, like the one in figure
346  \ref{fig:simple_assign} map directly to  \ref{fig:simple_assign} map directly to hardware instructions that
347  hardware instructions that transport data between CPU and memory banks, this  transport data between CPU and memory banks, this can be a very
348  can be a very efficient mechanism for communication. In this case two CPU's,  efficient mechanism for communication.  In this case two CPUs, CPU1
349  CPU1 and CPU2, can communicate simply be reading and writing to an  and CPU2, can communicate simply be reading and writing to an agreed
350  agreed location and following a few basic rules. The latency of this sort  location and following a few basic rules.  The latency of this sort of
351  of communication is generally not that much higher than the hardware  communication is generally not that much higher than the hardware
352  latency of other memory accesses on the system. The bandwidth available  latency of other memory accesses on the system. The bandwidth
353  between CPU's communicating in this way can be close to the bandwidth of  available between CPUs communicating in this way can be close to the
354  the systems main-memory interconnect. This can make this method of  bandwidth of the systems main-memory interconnect.  This can make this
355  communication very efficient provided it is used appropriately.  method of communication very efficient provided it is used
356    appropriately.
357    
358  \subsubsection{Memory consistency}  \subsubsection{Memory consistency}
359  \label{sect:memory_consistency}  \label{sect:memory_consistency}
360    
361  When using shared memory communication between  When using shared memory communication between multiple processors the
362  multiple processors the WRAPPER level shields user applications from  WRAPPER level shields user applications from certain counter-intuitive
363  certain counter-intuitive system behaviors. In particular, one issue the  system behaviors.  In particular, one issue the WRAPPER layer must
364  WRAPPER layer must deal with is a systems memory model. In general the order  deal with is a systems memory model.  In general the order of reads
365  of reads and writes expressed by the textual order of an application code may  and writes expressed by the textual order of an application code may
366  not be the ordering of instructions executed by the processor performing the  not be the ordering of instructions executed by the processor
367  application. The processor performing the application instructions will always  performing the application.  The processor performing the application
368  operate so that, for the application instructions the processor is executing,  instructions will always operate so that, for the application
369  any reordering is not apparent. However, in general machines are often  instructions the processor is executing, any reordering is not
370  designed so that reordering of instructions is not hidden from other second  apparent.  However, in general machines are often designed so that
371  processors.  This means that, in general, even on a shared memory system two  reordering of instructions is not hidden from other second processors.
372  processors can observe inconsistent memory values.  This means that, in general, even on a shared memory system two
373    processors can observe inconsistent memory values.
374  The issue of memory consistency between multiple processors is discussed at  
375  length in many computer science papers, however, from a practical point of  The issue of memory consistency between multiple processors is
376  view, in order to deal with this issue, shared memory machines all provide  discussed at length in many computer science papers.  From a practical
377  some mechanism to enforce memory consistency when it is needed. The exact  point of view, in order to deal with this issue, shared memory
378  mechanism employed will vary between systems. For communication using shared  machines all provide some mechanism to enforce memory consistency when
379  memory, the WRAPPER provides a place to invoke the appropriate mechanism to  it is needed.  The exact mechanism employed will vary between systems.
380  ensure memory consistency for a particular platform.  For communication using shared memory, the WRAPPER provides a place to
381    invoke the appropriate mechanism to ensure memory consistency for a
382    particular platform.
383    
384  \subsubsection{Cache effects and false sharing}  \subsubsection{Cache effects and false sharing}
385  \label{sect:cache_effects_and_false_sharing}  \label{sect:cache_effects_and_false_sharing}
386    
387  Shared-memory machines often have local to processor memory caches  Shared-memory machines often have local to processor memory caches
388  which contain mirrored copies of main memory. Automatic cache-coherence  which contain mirrored copies of main memory.  Automatic cache-coherence
389  protocols are used to maintain consistency between caches on different  protocols are used to maintain consistency between caches on different
390  processors. These cache-coherence protocols typically enforce consistency  processors.  These cache-coherence protocols typically enforce consistency
391  between regions of memory with large granularity (typically 128 or 256 byte  between regions of memory with large granularity (typically 128 or 256 byte
392  chunks). The coherency protocols employed can be expensive relative to other  chunks).  The coherency protocols employed can be expensive relative to other
393  memory accesses and so care is taken in the WRAPPER (by padding synchronization  memory accesses and so care is taken in the WRAPPER (by padding synchronization
394  structures appropriately) to avoid unnecessary coherence traffic.  structures appropriately) to avoid unnecessary coherence traffic.
395    
396  \subsubsection{Operating system support for shared memory.}  \subsubsection{Operating system support for shared memory.}
397    
398  Applications running under multiple threads within a single process can  Applications running under multiple threads within a single process
399  use shared memory communication. In this case {\it all} the memory locations  can use shared memory communication.  In this case {\it all} the
400  in an application are potentially visible to all the compute threads. Multiple  memory locations in an application are potentially visible to all the
401  threads operating within a single process is the standard mechanism for  compute threads. Multiple threads operating within a single process is
402  supporting shared memory that the WRAPPER utilizes. Configuring and launching  the standard mechanism for supporting shared memory that the WRAPPER
403  code to run in multi-threaded mode on specific platforms is discussed in  utilizes. Configuring and launching code to run in multi-threaded mode
404  section \ref{sect:running_with_threads}.  However, on many systems, potentially  on specific platforms is discussed in section
405  very efficient mechanisms for using shared memory communication between  \ref{sect:multi-threaded-execution}.  However, on many systems,
406  multiple processes (in contrast to multiple threads within a single  potentially very efficient mechanisms for using shared memory
407  process) also exist. In most cases this works by making a limited region of  communication between multiple processes (in contrast to multiple
408  memory shared between processes. The MMAP \ref{magicgarden} and  threads within a single process) also exist. In most cases this works
409  IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do  by making a limited region of memory shared between processes. The
410  vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.  MMAP \ref{magicgarden} and IPC \ref{magicgarden} facilities in UNIX
411  Extensions exist for the WRAPPER that allow these mechanisms  systems provide this capability as do vendor specific tools like LAPI
412  to be used for shared memory communication. However, these mechanisms are not  \ref{IBMLAPI} and IMC \ref{Memorychannel}.  Extensions exist for the
413  distributed with the default WRAPPER sources, because of their proprietary  WRAPPER that allow these mechanisms to be used for shared memory
414  nature.  communication. However, these mechanisms are not distributed with the
415    default WRAPPER sources, because of their proprietary nature.
416    
417  \subsection{Distributed memory communication}  \subsection{Distributed memory communication}
418  \label{sect:distributed_memory_communication}  \label{sect:distributed_memory_communication}
419  Many parallel systems are not constructed in a way where it is  Many parallel systems are not constructed in a way where it is
420  possible or practical for an application to use shared memory  possible or practical for an application to use shared memory for
421  for communication. For example cluster systems consist of individual computers  communication. For example cluster systems consist of individual
422  connected by a fast network. On such systems their is no notion of shared memory  computers connected by a fast network. On such systems there is no
423  at the system level. For this sort of system the WRAPPER provides support  notion of shared memory at the system level. For this sort of system
424  for communication based on a bespoke communication library  the WRAPPER provides support for communication based on a bespoke
425  (see figure \ref{fig:comm_msg}).  The default communication library used is MPI  communication library (see figure \ref{fig:comm_msg}).  The default
426  \ref{mpi}. However, it is relatively straightforward to implement bindings to  communication library used is MPI \cite{MPI-std-20}. However, it is
427  optimized platform specific communication libraries. For example the work  relatively straightforward to implement bindings to optimized platform
428  described in \ref{hoe-hill:99} substituted standard MPI communication for a  specific communication libraries. For example the work described in
429  highly optimized library.  \ref{hoe-hill:99} substituted standard MPI communication for a highly
430    optimized library.
431    
432  \subsection{Communication primitives}  \subsection{Communication primitives}
433  \label{sect:communication_primitives}  \label{sect:communication_primitives}
# Line 403  highly optimized library. Line 439  highly optimized library.
439   }   }
440  \end{center}  \end{center}
441  \caption{Three performance critical parallel primitives are provided  \caption{Three performance critical parallel primitives are provided
442  by the WRAPPER. These primitives are always used to communicate data    by the WRAPPER. These primitives are always used to communicate data
443  between tiles. The figure shows four tiles. The curved arrows indicate    between tiles. The figure shows four tiles. The curved arrows
444  exchange primitives which transfer data between the overlap regions at tile    indicate exchange primitives which transfer data between the overlap
445  edges and interior regions for nearest-neighbor tiles.    regions at tile edges and interior regions for nearest-neighbor
446  The straight arrows symbolize global sum operations which connect all tiles.    tiles.  The straight arrows symbolize global sum operations which
447  The global sum operation provides both a key arithmetic primitive and can    connect all tiles.  The global sum operation provides both a key
448  serve as a synchronization primitive. A third barrier primitive is also    arithmetic primitive and can serve as a synchronization primitive. A
449  provided, it behaves much like the global sum primitive.    third barrier primitive is also provided, it behaves much like the
450  } \label{fig:communication_primitives}    global sum primitive.  } \label{fig:communication_primitives}
451  \end{figure}  \end{figure}
452    
453    
454  Optimized communication support is assumed to be possibly available  Optimized communication support is assumed to be potentially available
455  for a small number of communication operations.  for a small number of communication operations.  It is also assumed
456  It is assumed that communication performance optimizations can  that communication performance optimizations can be achieved by
457  be achieved by optimizing a small number of communication primitives.  optimizing a small number of communication primitives.  Three
458  Three optimizable primitives are provided by the WRAPPER  optimizable primitives are provided by the WRAPPER
 \begin{itemize}  
 \item{\bf EXCHANGE} This operation is used to transfer data between interior  
 and overlap regions of neighboring tiles. A number of different forms of this  
 operation are supported. These different forms handle  
459  \begin{itemize}  \begin{itemize}
460  \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled  \item{\bf EXCHANGE} This operation is used to transfer data between
461  separately.    interior and overlap regions of neighboring tiles. A number of
462  \item Bindings to different communication methods.    different forms of this operation are supported. These different
463  Exchange primitives select between using shared memory or distributed    forms handle
464  memory communication.    \begin{itemize}
465  \item Transformation operations required when transporting    \item Data type differences. Sixty-four bit and thirty-two bit
466  data between different grid regions. Transferring data      fields may be handled separately.
467  between faces of a cube-sphere grid, for example, involves a rotation    \item Bindings to different communication methods.  Exchange
468  of vector components.      primitives select between using shared memory or distributed
469  \item Forward and reverse mode computations. Derivative calculations require      memory communication.
470  tangent linear and adjoint forms of the exchange primitives.    \item Transformation operations required when transporting data
471        between different grid regions. Transferring data between faces of
472  \end{itemize}      a cube-sphere grid, for example, involves a rotation of vector
473        components.
474      \item Forward and reverse mode computations. Derivative calculations
475        require tangent linear and adjoint forms of the exchange
476        primitives.
477      \end{itemize}
478    
479  \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic  \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
480  operation for the pressure inversion phase of the MITgcm algorithm.    operation for the pressure inversion phase of the MITgcm algorithm.
481  For certain configurations scaling can be highly sensitive to    For certain configurations scaling can be highly sensitive to the
482  the performance of the global sum primitive. This operation is a collective    performance of the global sum primitive. This operation is a
483  operation involving all tiles of the simulated domain. Different forms    collective operation involving all tiles of the simulated domain.
484  of the global sum primitive exist for handling    Different forms of the global sum primitive exist for handling
485  \begin{itemize}    \begin{itemize}
486  \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled    \item Data type differences. Sixty-four bit and thirty-two bit
487  separately.      fields may be handled separately.
488  \item Bindings to different communication methods.    \item Bindings to different communication methods.  Exchange
489  Exchange primitives select between using shared memory or distributed      primitives select between using shared memory or distributed
490  memory communication.      memory communication.
491  \item Forward and reverse mode computations. Derivative calculations require    \item Forward and reverse mode computations. Derivative calculations
492  tangent linear and adjoint forms of the exchange primitives.      require tangent linear and adjoint forms of the exchange
493  \end{itemize}      primitives.
494      \end{itemize}
495  \item{\bf BARRIER} The WRAPPER provides a global synchronization function    
496  called barrier. This is used to synchronize computations over all tiles.  \item{\bf BARRIER} The WRAPPER provides a global synchronization
497  The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in    function called barrier. This is used to synchronize computations
498  some cases use the same underlying code.    over all tiles.  The {\bf BARRIER} and {\bf GLOBAL SUM} primitives
499      have much in common and in some cases use the same underlying code.
500  \end{itemize}  \end{itemize}
501    
502    
# Line 498  Following the discussion above, the mach Line 536  Following the discussion above, the mach
536  presents to an application has the following characteristics  presents to an application has the following characteristics
537    
538  \begin{itemize}  \begin{itemize}
539  \item The machine consists of one or more logical processors. \vspace{-3mm}  \item The machine consists of one or more logical processors.
540  \item Each processor operates on tiles that it owns.\vspace{-3mm}  \item Each processor operates on tiles that it owns.
541  \item A processor may own more than one tile.\vspace{-3mm}  \item A processor may own more than one tile.
542  \item Processors may compute concurrently.\vspace{-3mm}  \item Processors may compute concurrently.
543  \item Exchange of information between tiles is handled by the  \item Exchange of information between tiles is handled by the
544  machine (WRAPPER) not by the application.    machine (WRAPPER) not by the application.
545  \end{itemize}  \end{itemize}
546  Behind the scenes this allows the WRAPPER to adapt the machine model  Behind the scenes this allows the WRAPPER to adapt the machine model
547  functions to exploit hardware on which  functions to exploit hardware on which
548  \begin{itemize}  \begin{itemize}
549  \item Processors may be able to communicate very efficiently with each other  \item Processors may be able to communicate very efficiently with each
550  using shared memory. \vspace{-3mm}    other using shared memory.
551  \item An alternative communication mechanism based on a relatively  \item An alternative communication mechanism based on a relatively
552  simple inter-process communication API may be required.\vspace{-3mm}    simple inter-process communication API may be required.
553  \item Shared memory may not necessarily obey sequential consistency,  \item Shared memory may not necessarily obey sequential consistency,
554  however some mechanism will exist for enforcing memory consistency.    however some mechanism will exist for enforcing memory consistency.
 \vspace{-3mm}  
555  \item Memory consistency that is enforced at the hardware level  \item Memory consistency that is enforced at the hardware level
556  may be expensive. Unnecessary triggering of consistency protocols    may be expensive. Unnecessary triggering of consistency protocols
557  should be avoided. \vspace{-3mm}    should be avoided.
558  \item Memory access patterns may need to either repetitive or highly  \item Memory access patterns may need to either repetitive or highly
559  pipelined for optimum hardware performance. \vspace{-3mm}    pipelined for optimum hardware performance.
560  \end{itemize}  \end{itemize}
561    
562  This generic model captures the essential hardware ingredients  This generic model captures the essential hardware ingredients
# Line 527  of almost all successful scientific comp Line 564  of almost all successful scientific comp
564  last 50 years.  last 50 years.
565    
566  \section{Using the WRAPPER}  \section{Using the WRAPPER}
567    \begin{rawhtml}
568  In order to support maximum portability the WRAPPER is implemented primarily  <!-- CMIREDIR:using_the_wrapper: -->
569  in sequential Fortran 77. At a practical level the key steps provided by the  \end{rawhtml}
570  WRAPPER are  
571    In order to support maximum portability the WRAPPER is implemented
572    primarily in sequential Fortran 77. At a practical level the key steps
573    provided by the WRAPPER are
574  \begin{enumerate}  \begin{enumerate}
575  \item specifying how a domain will be decomposed  \item specifying how a domain will be decomposed
576  \item starting a code in either sequential or parallel modes of operations  \item starting a code in either sequential or parallel modes of operations
577  \item controlling communication between tiles and between concurrently  \item controlling communication between tiles and between concurrently
578  computing CPU's.    computing CPUs.
579  \end{enumerate}  \end{enumerate}
580  This section describes the details of each of these operations.  This section describes the details of each of these operations.
581  Section \ref{sect:specifying_a_decomposition} explains how the way in which  Section \ref{sect:specifying_a_decomposition} explains how the way in
582  a domain is decomposed (or composed) is expressed. Section  which a domain is decomposed (or composed) is expressed. Section
583  \ref{sect:starting_a_code} describes practical details of running codes  \ref{sect:starting_a_code} describes practical details of running
584  in various different parallel modes on contemporary computer systems.  codes in various different parallel modes on contemporary computer
585  Section \ref{sect:controlling_communication} explains the internal information  systems.  Section \ref{sect:controlling_communication} explains the
586  that the WRAPPER uses to control how information is communicated between  internal information that the WRAPPER uses to control how information
587  tiles.  is communicated between tiles.
588    
589  \subsection{Specifying a domain decomposition}  \subsection{Specifying a domain decomposition}
590  \label{sect:specifying_a_decomposition}  \label{sect:specifying_a_decomposition}
# Line 651  Within a {\em bi}, {\em bj} loop Line 691  Within a {\em bi}, {\em bj} loop
691  computation is performed concurrently over as many processes and threads  computation is performed concurrently over as many processes and threads
692  as there are physical processors available to compute.  as there are physical processors available to compute.
693    
694    An exception to the the use of {\em bi} and {\em bj} in loops arises in the
695    exchange routines used when the exch2 package is used with the cubed
696    sphere.  In this case {\em bj} is generally set to 1 and the loop runs from
697    1,{\em bi}.  Within the loop {\em bi} is used to retrieve the tile number,
698    which is then used to reference exchange parameters.
699    
700  The amount of computation that can be embedded  The amount of computation that can be embedded
701  a single loop over {\em bi} and {\em bj} varies for different parts of the  a single loop over {\em bi} and {\em bj} varies for different parts of the
702  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
# Line 771  The global domain size is again ninety g Line 817  The global domain size is again ninety g
817  forty grid points in y. The two sub-domains in each process will be computed  forty grid points in y. The two sub-domains in each process will be computed
818  sequentially if they are given to a single thread within a single process.  sequentially if they are given to a single thread within a single process.
819  Alternatively if the code is invoked with multiple threads per process  Alternatively if the code is invoked with multiple threads per process
820  the two domains in y may be computed on concurrently.  the two domains in y may be computed concurrently.
821  \item  \item
822  \begin{verbatim}  \begin{verbatim}
823        PARAMETER (        PARAMETER (
# Line 807  by the application code. The startup cal Line 853  by the application code. The startup cal
853  WRAPPER is shown in figure \ref{fig:wrapper_startup}.  WRAPPER is shown in figure \ref{fig:wrapper_startup}.
854    
855  \begin{figure}  \begin{figure}
856    {\footnotesize
857  \begin{verbatim}  \begin{verbatim}
858    
859         MAIN           MAIN  
# Line 835  WRAPPER is shown in figure \ref{fig:wrap Line 882  WRAPPER is shown in figure \ref{fig:wrap
882    
883    
884  \end{verbatim}  \end{verbatim}
885    }
886  \caption{Main stages of the WRAPPER startup procedure.  \caption{Main stages of the WRAPPER startup procedure.
887  This process proceeds transfer of control to application code, which  This process proceeds transfer of control to application code, which
888  occurs through the procedure {\em THE\_MODEL\_MAIN()}.  occurs through the procedure {\em THE\_MODEL\_MAIN()}.
# Line 917  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1 Line 965  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
965  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
966  File: {\em model/src/THE\_MODEL\_MAIN.F}\\  File: {\em model/src/THE\_MODEL\_MAIN.F}\\
967  File: {\em eesupp/src/MAIN.F}\\  File: {\em eesupp/src/MAIN.F}\\
968  File: {\em tools/genmake}\\  File: {\em tools/genmake2}\\
969  File: {\em eedata}\\  File: {\em eedata}\\
970  CPP:  {\em TARGET\_SUN}\\  CPP:  {\em TARGET\_SUN}\\
971  CPP:  {\em TARGET\_DEC}\\  CPP:  {\em TARGET\_DEC}\\
# Line 932  Parameter:  {\em nTy} Line 980  Parameter:  {\em nTy}
980  \subsubsection{Multi-process execution}  \subsubsection{Multi-process execution}
981  \label{sect:multi-process-execution}  \label{sect:multi-process-execution}
982    
983  Despite its appealing programming model, multi-threaded execution remains  Despite its appealing programming model, multi-threaded execution
984  less common then multi-process execution. One major reason for this  remains less common than multi-process execution. One major reason for
985  is that many system libraries are still not ``thread-safe''. This means that for  this is that many system libraries are still not ``thread-safe''. This
986  example on some systems it is not safe to call system routines to  means that, for example, on some systems it is not safe to call system
987  do I/O when running in multi-threaded mode, except for in a limited set of  routines to perform I/O when running in multi-threaded mode (except,
988  circumstances. Another reason is that support for multi-threaded programming  perhaps, in a limited set of circumstances).  Another reason is that
989  models varies between systems.  support for multi-threaded programming models varies between systems.
990    
991  Multi-process execution is more ubiquitous.  Multi-process execution is more ubiquitous.  In order to run code in a
992  In order to run code in a multi-process configuration a decomposition  multi-process configuration a decomposition specification (see section
993  specification ( see section \ref{sect:specifying_a_decomposition})  \ref{sect:specifying_a_decomposition}) is given (in which the at least
994  is given ( in which the at least one of the  one of the parameters {\em nPx} or {\em nPy} will be greater than one)
995  parameters {\em nPx} or {\em nPy} will be greater than one)  and then, as for multi-threaded operation, appropriate compile time
996  and then, as for multi-threaded operation,  and run time steps must be taken.
997  appropriate compile time and run time steps must be taken.  
998    \paragraph{Compilation} Multi-process execution under the WRAPPER
999  \paragraph{Compilation} Multi-process execution under the WRAPPER  assumes that the portable, MPI libraries are available for controlling
1000  assumes that the portable, MPI libraries are available  the start-up of multiple processes. The MPI libraries are not
1001  for controlling the start-up of multiple processes. The MPI libraries  required, although they are usually used, for performance critical
1002  are not required, although they are usually used, for performance  communication. However, in order to simplify the task of controlling
1003  critical communication. However, in order to simplify the task  and coordinating the start up of a large number (hundreds and possibly
1004  of controlling and coordinating the start up of a large number  even thousands) of copies of the same program, MPI is used. The calls
1005  (hundreds and possibly even thousands) of copies of the same  to the MPI multi-process startup routines must be activated at compile
1006  program, MPI is used. The calls to the MPI multi-process startup  time.  Currently MPI libraries are invoked by specifying the
1007  routines must be activated at compile time. This is done  appropriate options file with the {\tt-of} flag when running the {\em
1008  by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}    genmake2} script, which generates the Makefile for compiling and
1009  flags in the {\em CPP\_EEOPTIONS.h} file.\\  linking MITgcm.  (Previously this was done by setting the {\em
1010      ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
1011  \fbox{    CPP\_EEOPTIONS.h} file.)  More detailed information about the use of
1012  \begin{minipage}{4.75in}  {\em genmake2} for specifying
1013  File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\  local compiler flags is located in section \ref{sect:genmake}.\\
 CPP:  {\em ALLOW\_USE\_MPI}\\  
 CPP:  {\em ALWAYS\_USE\_MPI}\\  
 Parameter:  {\em nPx}\\  
 Parameter:  {\em nPy}  
 \end{minipage}  
 } \\  
1014    
 Additionally, compile time options are required to link in the  
 MPI libraries and header files. Examples of these options  
 can be found in the {\em genmake} script that creates makefiles  
 for compilation. When this script is executed with the {bf -mpi}  
 flag it will generate a makefile that includes  
 paths for search for MPI head files and for linking in  
 MPI libraries. For example the {\bf -mpi} flag on a  
  Silicon Graphics IRIX system causes a  
 Makefile with the compilation command  
 Graphics IRIX system \begin{verbatim}  
 mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI  
 \end{verbatim}  
 to be generated.  
 This is the correct set of options for using the MPICH open-source  
 version of MPI, when it has been installed under the subdirectory  
 /usr/local/mpi.  
 However, on many systems there may be several  
 versions of MPI installed. For example many systems have both  
 the open source MPICH set of libraries and a vendor specific native form  
 of the MPI libraries. The correct setup to use will depend on the  
 local configuration of your system.\\  
1015    
1016  \fbox{  \fbox{
1017  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
1018  File: {\em tools/genmake}  Directory: {\em tools/build\_options}\\
1019    File: {\em tools/genmake2}
1020  \end{minipage}  \end{minipage}
1021  } \\  } \\
1022  \paragraph{\bf Execution} The mechanics of starting a program in  \paragraph{\bf Execution} The mechanics of starting a program in
1023  multi-process mode under MPI is not standardized. Documentation  multi-process mode under MPI is not standardized. Documentation
1024  associated with the distribution of MPI installed on a system will  associated with the distribution of MPI installed on a system will
1025  describe how to start a program using that distribution.  describe how to start a program using that distribution.  For the
1026  For the free, open-source MPICH system the MITgcm program is started  open-source MPICH system, the MITgcm program can be started using a
1027  using a command such as  command such as
1028  \begin{verbatim}  \begin{verbatim}
1029  mpirun -np 64 -machinefile mf ./mitgcmuv  mpirun -np 64 -machinefile mf ./mitgcmuv
1030  \end{verbatim}  \end{verbatim}
1031  In this example the text {\em -np 64} specifies the number of processes  In this example the text {\em -np 64} specifies the number of
1032  that will be created. The numeric value {\em 64} must be equal to the  processes that will be created. The numeric value {\em 64} must be
1033  product of the processor grid settings of {\em nPx} and {\em nPy}  equal to the product of the processor grid settings of {\em nPx} and
1034  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file  {\em nPy} in the file {\em SIZE.h}. The parameter {\em mf} specifies
1035  called ``mf'' will be read to get a list of processor names on  that a text file called ``mf'' will be read to get a list of processor
1036  which the sixty-four processes will execute. The syntax of this file  names on which the sixty-four processes will execute. The syntax of
1037  is specified by the MPI distribution  this file is specified by the MPI distribution.
1038  \\  \\
1039    
1040  \fbox{  \fbox{
1041  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1025  Parameter: {\em nPy} Line 1047  Parameter: {\em nPy}
1047    
1048    
1049  \paragraph{Environment variables}  \paragraph{Environment variables}
1050  On most systems multi-threaded execution also requires the setting  On most systems multi-threaded execution also requires the setting of
1051  of a special environment variable. On many machines this variable  a special environment variable. On many machines this variable is
1052  is called PARALLEL and its values should be set to the number  called PARALLEL and its values should be set to the number of parallel
1053  of parallel threads required. Generally the help pages associated  threads required. Generally the help or manual pages associated with
1054  with the multi-threaded compiler on a machine will explain  the multi-threaded compiler on a machine will explain how to set the
1055  how to set the required environment variables for that machines.  required environment variables.
1056    
1057  \paragraph{Runtime input parameters}  \paragraph{Runtime input parameters}
1058  Finally the file {\em eedata} needs to be configured to indicate  Finally the file {\em eedata} needs to be configured to indicate the
1059  the number of threads to be used in the x and y directions.  number of threads to be used in the x and y directions.  The variables
1060  The variables {\em nTx} and {\em nTy} in this file are used to  {\em nTx} and {\em nTy} in this file are used to specify the
1061  specify the information required. The product of {\em nTx} and  information required. The product of {\em nTx} and {\em nTy} must be
1062  {\em nTy} must be equal to the number of threads spawned i.e.  equal to the number of threads spawned i.e.  the setting of the
1063  the setting of the environment variable PARALLEL.  environment variable PARALLEL.  The value of {\em nTx} must subdivide
1064  The value of {\em nTx} must subdivide the number of sub-domains  the number of sub-domains in x ({\em nSx}) exactly. The value of {\em
1065  in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the    nTy} must subdivide the number of sub-domains in y ({\em nSy})
1066  number of sub-domains in y ({\em nSy}) exactly.  exactly.  The multiprocess startup of the MITgcm executable {\em
1067  The multiprocess startup of the MITgcm executable {\em mitgcmuv}    mitgcmuv} is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1068  is controlled by the routines {\em EEBOOT\_MINIMAL()} and  {\em INI\_PROCS()}. The first routine performs basic steps required to
1069  {\em INI\_PROCS()}. The first routine performs basic steps required  make sure each process is started and has a textual output stream
1070  to make sure each process is started and has a textual output  associated with it. By default two output files are opened for each
1071  stream associated with it. By default two output files are opened  process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.  The {\bf
1072  for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.    NNNNN} part of the name is filled in with the process number so that
1073  The {\bf NNNNN} part of the name is filled in with the process  process number 0 will create output files {\bf STDOUT.0000} and {\bf
1074  number so that process number 0 will create output files    STDERR.0000}, process number 1 will create output files {\bf
1075  {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create    STDOUT.0001} and {\bf STDERR.0001}, etc. These files are used for
1076  output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files  reporting status and configuration information and for reporting error
1077  are used for reporting status and configuration information and  conditions on a process by process basis.  The {\em EEBOOT\_MINIMAL()}
1078  for reporting error conditions on a process by process basis.  procedure also sets the variables {\em myProcId} and {\em
1079  The {\em EEBOOT\_MINIMAL()} procedure also sets the variables    MPI\_COMM\_MODEL}.  These variables are related to processor
1080  {\em myProcId} and {\em MPI\_COMM\_MODEL}.  identification are are used later in the routine {\em INI\_PROCS()} to
1081  These variables are related  allocate tiles to processes.
1082  to processor identification are are used later in the routine  
1083  {\em INI\_PROCS()} to allocate tiles to processes.  Allocation of processes to tiles is controlled by the routine {\em
1084      INI\_PROCS()}. For each process this routine sets the variables {\em
1085  Allocation of processes to tiles in controlled by the routine    myXGlobalLo} and {\em myYGlobalLo}.  These variables specify, in
1086  {\em INI\_PROCS()}. For each process this routine sets  index space, the coordinates of the southernmost and westernmost
1087  the variables {\em myXGlobalLo} and {\em myYGlobalLo}.  corner of the southernmost and westernmost tile owned by this process.
1088  These variables specify (in index space) the coordinate  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are
1089  of the southern most and western most corner of the  also set in this routine. These are used to identify processes holding
1090  southern most and western most tile owned by this process.  tiles to the west, east, south and north of a given process. These
1091  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}  values are stored in global storage in the header file {\em
1092  are also set in this routine. These are used to identify    EESUPPORT.h} for use by communication routines.  The above does not
1093  processes holding tiles to the west, east, south and north  hold when the exch2 package is used.  The exch2 sets its own
1094  of this process. These values are stored in global storage  parameters to specify the global indices of tiles and their
1095  in the header file {\em EESUPPORT.h} for use by  relationships to each other.  See the documentation on the exch2
1096  communication routines.  package (\ref{sec:exch2}) for details.
1097  \\  \\
1098    
1099  \fbox{  \fbox{
# Line 1097  operations and that can be customized fo Line 1119  operations and that can be customized fo
1119  describes the information that is held and used.  describes the information that is held and used.
1120    
1121  \begin{enumerate}  \begin{enumerate}
1122  \item {\bf Tile-tile connectivity information} For each tile the WRAPPER  \item {\bf Tile-tile connectivity information}
1123  sets a flag that sets the tile number to the north, south, east and    For each tile the WRAPPER sets a flag that sets the tile number to
1124  west of that tile. This number is unique over all tiles in a    the north, south, east and west of that tile. This number is unique
1125  configuration. The number is held in the variables {\em tileNo}    over all tiles in a configuration. Except when using the cubed
1126  ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},    sphere and the exch2 package, the number is held in the variables
1127  {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile    {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em
1128  that specifies the type of communication that is used between tiles.      tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also
1129  This information is held in the variables {\em tileCommModeN},    stored with each tile that specifies the type of communication that
1130  {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.    is used between tiles.  This information is held in the variables
1131  This latter set of variables can take one of the following values    {\em tileCommModeN}, {\em tileCommModeS}, {\em tileCommModeE} and
1132  {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.    {\em tileCommModeW}.  This latter set of variables can take one of
1133  A value of {\em COMM\_NONE} is used to indicate that a tile has no    the following values {\em COMM\_NONE}, {\em COMM\_MSG}, {\em
1134  neighbor to communicate with on a particular face. A value      COMM\_PUT} and {\em COMM\_GET}.  A value of {\em COMM\_NONE} is
1135  of {\em COMM\_MSG} is used to indicated that some form of distributed    used to indicate that a tile has no neighbor to communicate with on
1136  memory communication is required to communicate between    a particular face. A value of {\em COMM\_MSG} is used to indicate
1137  these tile faces ( see section \ref{sect:distributed_memory_communication}).    that some form of distributed memory communication is required to
1138  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate    communicate between these tile faces (see section
1139  forms of shared memory communication ( see section    \ref{sect:distributed_memory_communication}).  A value of {\em
1140  \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates      COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
1141  that a CPU should communicate by writing to data structures owned by another    memory communication (see section
1142  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading    \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value
1143  from data structures owned by another CPU. These flags affect the behavior    indicates that a CPU should communicate by writing to data
1144  of the WRAPPER exchange primitive    structures owned by another CPU. A {\em COMM\_GET} value indicates
1145  (see figure \ref{fig:communication_primitives}). The routine    that a CPU should communicate by reading from data structures owned
1146  {\em ini\_communication\_patterns()} is responsible for setting the    by another CPU. These flags affect the behavior of the WRAPPER
1147  communication mode values for each tile.    exchange primitive (see figure \ref{fig:communication_primitives}).
1148  \\    The routine {\em ini\_communication\_patterns()} is responsible for
1149      setting the communication mode values for each tile.
1150    
1151      When using the cubed sphere configuration with the exch2 package,
1152      the relationships between tiles and their communication methods are
1153      set by the exch2 package and stored in different variables.  See the
1154      exch2 package documentation (\ref{sec:exch2} for details.
1155    
1156  \fbox{  \fbox{
1157  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1142  Parameter: {\em tileCommModeS} \\ Line 1170  Parameter: {\em tileCommModeS} \\
1170  } \\  } \\
1171    
1172  \item {\bf MP directives}  \item {\bf MP directives}
1173  The WRAPPER transfers control to numerical application code through    The WRAPPER transfers control to numerical application code through
1174  the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way    the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1175  that allows for it to be invoked by several threads. Support for this    that allows for it to be invoked by several threads. Support for
1176  is based on using multi-processing (MP) compiler directives.    this is based on either multi-processing (MP) compiler directives or
1177  Most commercially available Fortran compilers support the generation    specific calls to multi-threading libraries (\textit{eg.} POSIX
1178  of code to spawn multiple threads through some form of compiler directives.    threads).  Most commercially available Fortran compilers support the
1179  As this is generally much more convenient than writing code to interface    generation of code to spawn multiple threads through some form of
1180  to operating system libraries to explicitly spawn threads, and on some systems    compiler directives.  Compiler directives are generally more
1181  this may be the only method available the WRAPPER is distributed with    convenient than writing code to explicitly spawning threads.  And,
1182  template MP directives for a number of systems.    on some systems, compiler directives may be the only method
1183      available.  The WRAPPER is distributed with template MP directives
1184   These directives are inserted into the code just before and after the    for a number of systems.
1185  transfer of control to numerical algorithm code through the routine  
1186  {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of    These directives are inserted into the code just before and after
1187  the code that performs this process for a Silicon Graphics system.    the transfer of control to numerical algorithm code through the
1188  This code is extracted from the files {\em main.F} and    routine {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows
1189  {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies    an example of the code that performs this process for a Silicon
1190  how many instances of the routine {\em THE\_MODEL\_MAIN} will    Graphics system.  This code is extracted from the files {\em main.F}
1191  be created. The value of {\em nThreads} is set in the routine    and {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads}
1192  {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the    specifies how many instances of the routine {\em THE\_MODEL\_MAIN}
1193  the product of the parameters {\em nTx} and {\em nTy} that    will be created. The value of {\em nThreads} is set in the routine
1194  are read from the file {\em eedata}. If the value of {\em nThreads}    {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the the
1195  is inconsistent with the number of threads requested from the    product of the parameters {\em nTx} and {\em nTy} that are read from
1196  operating system (for example by using an environment    the file {\em eedata}. If the value of {\em nThreads} is
1197  variable as described in section \ref{sect:multi_threaded_execution})    inconsistent with the number of threads requested from the operating
1198  then usually an error will be reported by the routine    system (for example by using an environment variable as described in
1199  {\em CHECK\_THREADS}.\\    section \ref{sect:multi_threaded_execution}) then usually an error
1200      will be reported by the routine {\em CHECK\_THREADS}.
1201    
1202  \fbox{  \fbox{
1203  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1184  Parameter: {\em nTy} \\ Line 1213  Parameter: {\em nTy} \\
1213  }  }
1214    
1215  \item {\bf memsync flags}  \item {\bf memsync flags}
1216  As discussed in section \ref{sect:memory_consistency}, when using shared memory,    As discussed in section \ref{sect:memory_consistency}, a low-level
1217  a low-level system function may be need to force memory consistency.    system function may be need to force memory consistency on some
1218  The routine {\em MEMSYNC()} is used for this purpose. This routine should    shared memory systems.  The routine {\em MEMSYNC()} is used for this
1219  not need modifying and the information below is only provided for    purpose. This routine should not need modifying and the information
1220  completeness. A logical parameter {\em exchNeedsMemSync} set    below is only provided for completeness. A logical parameter {\em
1221  in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether      exchNeedsMemSync} set in the routine {\em
1222  the {\em MEMSYNC()} primitive is called. In general this      INI\_COMMUNICATION\_PATTERNS()} controls whether the {\em
1223  routine is only used for multi-threaded execution.      MEMSYNC()} primitive is called. In general this routine is only
1224  The code that goes into the {\em MEMSYNC()}    used for multi-threaded execution.  The code that goes into the {\em
1225   routine is specific to the compiler and      MEMSYNC()} routine is specific to the compiler and processor used.
1226  processor being used for multi-threaded execution and in general    In some cases, it must be written using a short code snippet of
1227  must be written using a short code snippet of assembly language.    assembly language.  For an Ultra Sparc system the following code
1228  For an Ultra Sparc system the following code snippet is used    snippet is used
1229  \begin{verbatim}  \begin{verbatim}
1230  asm("membar #LoadStore|#StoreStore");  asm("membar #LoadStore|#StoreStore");
1231  \end{verbatim}  \end{verbatim}
# Line 1210  asm("lock; addl $0,0(%%esp)": : :"memory Line 1239  asm("lock; addl $0,0(%%esp)": : :"memory
1239  \end{verbatim}  \end{verbatim}
1240    
1241  \item {\bf Cache line size}  \item {\bf Cache line size}
1242  As discussed in section \ref{sect:cache_effects_and_false_sharing},    As discussed in section \ref{sect:cache_effects_and_false_sharing},
1243  milti-threaded codes explicitly avoid penalties associated with excessive    milti-threaded codes explicitly avoid penalties associated with
1244  coherence traffic on an SMP system. To do this the shared memory data structures    excessive coherence traffic on an SMP system. To do this the shared
1245  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines    memory data structures used by the {\em GLOBAL\_SUM}, {\em
1246  are padded. The variables that control the padding are set in the      GLOBAL\_MAX} and {\em BARRIER} routines are padded. The variables
1247  header file {\em EEPARAMS.h}. These variables are called    that control the padding are set in the header file {\em
1248  {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and      EEPARAMS.h}. These variables are called {\em cacheLineSize}, {\em
1249  {\em lShare8}. The default values should not normally need changing.      lShare1}, {\em lShare4} and {\em lShare8}. The default values
1250      should not normally need changing.
1251    
1252  \item {\bf \_BARRIER}  \item {\bf \_BARRIER}
1253  This is a CPP macro that is expanded to a call to a routine    This is a CPP macro that is expanded to a call to a routine which
1254  which synchronizes all the logical processors running under the    synchronizes all the logical processors running under the WRAPPER.
1255  WRAPPER. Using a macro here preserves flexibility to insert    Using a macro here preserves flexibility to insert a specialized
1256  a specialized call in-line into application code. By default this    call in-line into application code. By default this resolves to
1257  resolves to calling the procedure {\em BARRIER()}. The default    calling the procedure {\em BARRIER()}. The default setting for the
1258  setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.    \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1259    
1260  \item {\bf \_GSUM}  \item {\bf \_GSUM}
1261  This is a CPP macro that is expanded to a call to a routine    This is a CPP macro that is expanded to a call to a routine which
1262  which sums up a floating point number    sums up a floating point number over all the logical processors
1263  over all the logical processors running under the    running under the WRAPPER. Using a macro here provides extra
1264  WRAPPER. Using a macro here provides extra flexibility to insert    flexibility to insert a specialized call in-line into application
1265  a specialized call in-line into application code. By default this    code. By default this resolves to calling the procedure {\em
1266  resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for      GLOBAL\_SUM\_R8()} ( for 64-bit floating point operands) or {\em
1267  64-bit floating point operands)      GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The
1268  or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default    default setting for the \_GSUM macro is given in the file {\em
1269  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.      CPP\_EEMACROS.h}.  The \_GSUM macro is a performance critical
1270  The \_GSUM macro is a performance critical operation, especially for    operation, especially for large processor count, small tile size
1271  large processor count, small tile size configurations.    configurations.  The custom communication example discussed in
1272  The custom communication example discussed in section \ref{sect:jam_example}    section \ref{sect:jam_example} shows how the macro is used to invoke
1273  shows how the macro is used to invoke a custom global sum routine    a custom global sum routine for a specific set of hardware.
 for a specific set of hardware.  
1274    
1275  \item {\bf \_EXCH}  \item {\bf \_EXCH}
1276  The \_EXCH CPP macro is used to update tile overlap regions.    The \_EXCH CPP macro is used to update tile overlap regions.  It is
1277  It is qualified by a suffix indicating whether overlap updates are for    qualified by a suffix indicating whether overlap updates are for
1278  two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )    two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1279  physical fields and whether fields are 32-bit floating point    physical fields and whether fields are 32-bit floating point (
1280  ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point    \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point (
1281  ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined    \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined in
1282  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the    the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
1283  \_EXCH operation plays a crucial role in scaling to small tile,    operation plays a crucial role in scaling to small tile, large
1284  large logical and physical processor count configurations.    logical and physical processor count configurations.  The example in
1285  The example in section \ref{sect:jam_example} discusses defining an    section \ref{sect:jam_example} discusses defining an optimized and
1286  optimized and specialized form on the \_EXCH operation.    specialized form on the \_EXCH operation.
1287    
1288  The \_EXCH operation is also central to supporting grids such as    The \_EXCH operation is also central to supporting grids such as the
1289  the cube-sphere grid. In this class of grid a rotation may be required    cube-sphere grid. In this class of grid a rotation may be required
1290  between tiles. Aligning the coordinate requiring rotation with the    between tiles. Aligning the coordinate requiring rotation with the
1291  tile decomposition, allows the coordinate transformation to    tile decomposition, allows the coordinate transformation to be
1292  be embedded within a custom form of the \_EXCH primitive.    embedded within a custom form of the \_EXCH primitive.  In these
1293      cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1294      package documentation \ref{sec:exch2}.
1295    
1296  \item {\bf Reverse Mode}  \item {\bf Reverse Mode}
1297  The communication primitives \_EXCH and \_GSUM both employ    The communication primitives \_EXCH and \_GSUM both employ
1298  hand-written adjoint forms (or reverse mode) forms.    hand-written adjoint forms (or reverse mode) forms.  These reverse
1299  These reverse mode forms can be found in the    mode forms can be found in the source code directory {\em
1300  source code directory {\em pkg/autodiff}.      pkg/autodiff}.  For the global sum primitive the reverse mode form
1301  For the global sum primitive the reverse mode form    calls are to {\em GLOBAL\_ADSUM\_R4} and {\em GLOBAL\_ADSUM\_R8}.
1302  calls are to {\em GLOBAL\_ADSUM\_R4} and    The reverse mode form of the exchange primitives are found in
1303  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the    routines prefixed {\em ADEXCH}. The exchange routines make calls to
1304  exchange primitives are found in routines    the same low-level communication primitives as the forward mode
1305  prefixed {\em ADEXCH}. The exchange routines make calls to    operations. However, the routine argument {\em simulationMode} is
1306  the same low-level communication primitives as the forward mode    set to the value {\em REVERSE\_SIMULATION}. This signifies to the
1307  operations. However, the routine argument {\em simulationMode}    low-level routines that the adjoint forms of the appropriate
1308  is set to the value {\em REVERSE\_SIMULATION}. This signifies    communication operation should be performed.
1309  ti the low-level routines that the adjoint forms of the  
 appropriate communication operation should be performed.  
1310  \item {\bf MAX\_NO\_THREADS}  \item {\bf MAX\_NO\_THREADS}
1311  The variable {\em MAX\_NO\_THREADS} is used to indicate the    The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum
1312  maximum number of OS threads that a code will use. This    number of OS threads that a code will use. This value defaults to
1313  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.    thirty-two and is set in the file {\em EEPARAMS.h}.  For single
1314  For single threaded execution it can be reduced to one if required.    threaded execution it can be reduced to one if required.  The value
1315  The value; is largely private to the WRAPPER and application code    is largely private to the WRAPPER and application code will not
1316  will nor normally reference the value, except in the following scenario.    normally reference the value, except in the following scenario.
1317    
1318  For certain physical parametrization schemes it is necessary to have    For certain physical parametrization schemes it is necessary to have
1319  a substantial number of work arrays. Where these arrays are allocated    a substantial number of work arrays. Where these arrays are
1320  in heap storage ( for example COMMON blocks ) multi-threaded    allocated in heap storage (for example COMMON blocks) multi-threaded
1321  execution will require multiple instances of the COMMON block data.    execution will require multiple instances of the COMMON block data.
1322  This can be achieved using a Fortran 90 module construct, however,    This can be achieved using a Fortran 90 module construct.  However,
1323  if this might be unavailable then the work arrays can be extended    if this mechanism is unavailable then the work arrays can be extended
1324  with dimensions use the tile dimensioning scheme of {\em nSx}    with dimensions using the tile dimensioning scheme of {\em nSx} and
1325  and {\em nSy} ( as described in section    {\em nSy} (as described in section
1326  \ref{sect:specifying_a_decomposition}). However, if the configuration    \ref{sect:specifying_a_decomposition}). However, if the
1327  being specified involves many more tiles than OS threads then    configuration being specified involves many more tiles than OS
1328  it can save memory resources to reduce the variable    threads then it can save memory resources to reduce the variable
1329  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that    {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
1330  will be used and to declare the physical parameterization    that will be used and to declare the physical parameterization work
1331  work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.    arrays with a single {\em MAX\_NO\_THREADS} extra dimension.  An
1332  An example of this is given in the verification experiment    example of this is given in the verification experiment {\em
1333  {\em aim.5l\_cs}. Here the default setting of      aim.5l\_cs}. Here the default setting of {\em MAX\_NO\_THREADS} is
1334  {\em MAX\_NO\_THREADS} is altered to    altered to
1335  \begin{verbatim}  \begin{verbatim}
1336        INTEGER MAX_NO_THREADS        INTEGER MAX_NO_THREADS
1337        PARAMETER ( MAX_NO_THREADS =    6 )        PARAMETER ( MAX_NO_THREADS =    6 )
1338  \end{verbatim}  \end{verbatim}
1339  and several work arrays for storing intermediate calculations are    and several work arrays for storing intermediate calculations are
1340  created with declarations of the form.    created with declarations of the form.
1341  \begin{verbatim}  \begin{verbatim}
1342        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1343  \end{verbatim}  \end{verbatim}
1344  This declaration scheme is not used widely, because most global data    This declaration scheme is not used widely, because most global data
1345  is used for permanent not temporary storage of state information.    is used for permanent not temporary storage of state information.
1346  In the case of permanent state information this approach cannot be used    In the case of permanent state information this approach cannot be
1347  because there has to be enough storage allocated for all tiles.    used because there has to be enough storage allocated for all tiles.
1348  However, the technique can sometimes be a useful scheme for reducing memory    However, the technique can sometimes be a useful scheme for reducing
1349  requirements in complex physical parameterizations.    memory requirements in complex physical parameterizations.
1350  \end{enumerate}  \end{enumerate}
1351    
1352  \begin{figure}  \begin{figure}
# Line 1336  C--     Invoke nThreads instances of the Line 1367  C--     Invoke nThreads instances of the
1367    
1368        ENDDO        ENDDO
1369  \end{verbatim}  \end{verbatim}
1370  \caption{Prior to transferring control to    \caption{Prior to transferring control to the procedure {\em
1371  the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use        THE\_MODEL\_MAIN()} the WRAPPER may use MP directives to spawn
1372  MP directives to spawn multiple threads.      multiple threads.  } \label{fig:mp_directives}
 } \label{fig:mp_directives}  
1373  \end{figure}  \end{figure}
1374    
1375    
# Line 1352  how it can be used to adapt to new gridi Line 1382  how it can be used to adapt to new gridi
1382    
1383  \subsubsection{JAM example}  \subsubsection{JAM example}
1384  \label{sect:jam_example}  \label{sect:jam_example}
1385  On some platforms a big performance boost can be obtained by  On some platforms a big performance boost can be obtained by binding
1386  binding the communication routines {\em \_EXCH} and  the communication routines {\em \_EXCH} and {\em \_GSUM} to
1387  {\em \_GSUM} to specialized native libraries ) fro example the  specialized native libraries (for example, the shmem library on CRAY
1388  shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag  T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag is used as an
1389  is used as an illustration of a specialized communication configuration  illustration of a specialized communication configuration that
1390  that substitutes for standard, portable forms of {\em \_EXCH} and  substitutes for standard, portable forms of {\em \_EXCH} and {\em
1391  {\em \_GSUM}. It affects three source files {\em eeboot.F},    \_GSUM}. It affects three source files {\em eeboot.F}, {\em
1392  {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined    CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined is has
1393  is has the following effects.  the following effects.
1394  \begin{itemize}  \begin{itemize}
1395  \item An extra phase is included at boot time to initialize the custom  \item An extra phase is included at boot time to initialize the custom
1396  communications library ( see {\em ini\_jam.F}).    communications library ( see {\em ini\_jam.F}).
1397  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1398  with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})    with calls to custom routines (see {\em gsum\_jam.F} and {\em
1399        exch\_jam.F})
1400  \item a highly specialized form of the exchange operator (optimized  \item a highly specialized form of the exchange operator (optimized
1401  for overlap regions of width one) is substituted into the elliptic    for overlap regions of width one) is substituted into the elliptic
1402  solver routine {\em cg2d.F}.    solver routine {\em cg2d.F}.
1403  \end{itemize}  \end{itemize}
1404  Developing specialized code for other libraries follows a similar  Developing specialized code for other libraries follows a similar
1405  pattern.  pattern.
1406    
1407  \subsubsection{Cube sphere communication}  \subsubsection{Cube sphere communication}
1408  \label{sect:cube_sphere_communication}  \label{sect:cube_sphere_communication}
1409  Actual {\em \_EXCH} routine code is generated automatically from  Actual {\em \_EXCH} routine code is generated automatically from a
1410  a series of template files, for example {\em exch\_rx.template}.  series of template files, for example {\em exch\_rx.template}.  This
1411  This is done to allow a large number of variations on the exchange  is done to allow a large number of variations on the exchange process
1412  process to be maintained. One set of variations supports the  to be maintained. One set of variations supports the cube sphere grid.
1413  cube sphere grid. Support for a cube sphere grid in MITgcm is based  Support for a cube sphere grid in MITgcm is based on having each face
1414  on having each face of the cube as a separate tile (or tiles).  of the cube as a separate tile or tiles.  The exchange routines are
1415  The exchange routines are then able to absorb much of the  then able to absorb much of the detailed rotation and reorientation
1416  detailed rotation and reorientation required when moving around the  required when moving around the cube grid. The set of {\em \_EXCH}
1417  cube grid. The set of {\em \_EXCH} routines that contain the  routines that contain the word cube in their name perform these
1418  word cube in their name perform these transformations.  transformations.  They are invoked when the run-time logical parameter
 They are invoked when the run-time logical parameter  
1419  {\em useCubedSphereExchange} is set true. To facilitate the  {\em useCubedSphereExchange} is set true. To facilitate the
1420  transformations on a staggered C-grid, exchange operations are defined  transformations on a staggered C-grid, exchange operations are defined
1421  separately for both vector and scalar quantities and for  separately for both vector and scalar quantities and for grid-centered
1422  grid-centered and for grid-face and corner quantities.  and for grid-face and grid-corner quantities.  Three sets of exchange
1423  Three sets of exchange routines are defined. Routines  routines are defined. Routines with names of the form {\em exch\_rx}
1424  with names of the form {\em exch\_rx} are used to exchange  are used to exchange cell centered scalar quantities. Routines with
1425  cell centered scalar quantities. Routines with names of the form  names of the form {\em exch\_uv\_rx} are used to exchange vector
1426  {\em exch\_uv\_rx} are used to exchange vector quantities located at  quantities located at the C-grid velocity points. The vector
1427  the C-grid velocity points. The vector quantities exchanged by the  quantities exchanged by the {\em exch\_uv\_rx} routines can either be
1428  {\em exch\_uv\_rx} routines can either be signed (for example velocity  signed (for example velocity components) or un-signed (for example
1429  components) or un-signed (for example grid-cell separations).  grid-cell separations).  Routines with names of the form {\em
1430  Routines with names of the form {\em exch\_z\_rx} are used to exchange    exch\_z\_rx} are used to exchange quantities at the C-grid vorticity
1431  quantities at the C-grid vorticity point locations.  point locations.
1432    
1433    
1434    
1435    
1436  \section{MITgcm execution under WRAPPER}  \section{MITgcm execution under WRAPPER}
1437    \begin{rawhtml}
1438    <!-- CMIREDIR:mitgcm_wrapper: -->
1439    \end{rawhtml}
1440    
1441  Fitting together the WRAPPER elements, package elements and  Fitting together the WRAPPER elements, package elements and
1442  MITgcm core equation elements of the source code produces calling  MITgcm core equation elements of the source code produces calling
# Line 1414  sequence shown in section \ref{sect:call Line 1447  sequence shown in section \ref{sect:call
1447    
1448  WRAPPER layer.  WRAPPER layer.
1449    
1450    {\footnotesize
1451  \begin{verbatim}  \begin{verbatim}
1452    
1453         MAIN           MAIN  
# Line 1441  WRAPPER layer. Line 1475  WRAPPER layer.
1475         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine
1476    
1477  \end{verbatim}  \end{verbatim}
1478    }
1479    
1480  Core equations plus packages.  Core equations plus packages.
1481    
1482    {\footnotesize
1483  \begin{verbatim}  \begin{verbatim}
1484  C  C
 C  
1485  C Invocation from WRAPPER level...  C Invocation from WRAPPER level...
1486  C  :  C  :
1487  C  :  C  :
# Line 1510  C    | | |-CTRL_INIT           :: Contro Line 1545  C    | | |-CTRL_INIT           :: Contro
1545  C    | | |-OPTIM_READPARMS     :: Optimisation support package. see pkg/ctrl  C    | | |-OPTIM_READPARMS     :: Optimisation support package. see pkg/ctrl
1546  C    | | |-GRDCHK_READPARMS    :: Gradient check package. see pkg/grdchk  C    | | |-GRDCHK_READPARMS    :: Gradient check package. see pkg/grdchk
1547  C    | | |-ECCO_READPARMS      :: ECCO Support Package. see pkg/ecco  C    | | |-ECCO_READPARMS      :: ECCO Support Package. see pkg/ecco
1548    C    | | |-PTRACERS_READPARMS  :: multiple tracer package, see pkg/ptracers
1549    C    | | |-GCHEM_READPARMS     :: tracer interface package, see pkg/gchem
1550  C    | |  C    | |
1551  C    | |-PACKAGES_CHECK  C    | |-PACKAGES_CHECK
1552  C    | | |  C    | | |
1553  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp
1554  C    | | |-OBCS_CHECK          :: Open bndy Package. pkg/obcs  C    | | |-OBCS_CHECK          :: Open bndy Pacakge. pkg/obcs
1555  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi
1556  C    | |  C    | |
1557  C    | |-PACKAGES_INIT_FIXED  C    | |-PACKAGES_INIT_FIXED
1558  C    | | |-OBCS_INIT_FIXED     :: Open bndy Package. see pkg/obcs  C    | | |-OBCS_INIT_FIXED     :: Open bndy Package. see pkg/obcs
1559  C    | | |-FLT_INIT            :: Floats Package. see pkg/flt  C    | | |-FLT_INIT            :: Floats Package. see pkg/flt
1560    C    | | |-GCHEM_INIT_FIXED    :: tracer interface pachage, see pkg/gchem
1561  C    | |  C    | |
1562  C    | |-ZONAL_FILT_INIT       :: FFT filter Package. see pkg/zonal_filt  C    | |-ZONAL_FILT_INIT       :: FFT filter Package. see pkg/zonal_filt
1563  C    | |  C    | |
1564  C    | |-INI_CG2D              :: 2d con. grad solver initialisation.  C    | |-INI_CG2D              :: 2d con. grad solver initialization.
1565  C    | |  C    | |
1566  C    | |-INI_CG3D              :: 3d con. grad solver initialisation.  C    | |-INI_CG3D              :: 3d con. grad solver initialization.
1567  C    | |  C    | |
1568  C    | |-CONFIG_SUMMARY        :: Provide synopsis of kernel setup.  C    | |-CONFIG_SUMMARY        :: Provide synopsis of kernel setup.
1569  C    |                         :: Includes annotated table of kernel  C    |                         :: Includes annotated table of kernel
# Line 1550  C    | | | Line 1588  C    | | |
1588  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,
1589  C    | | |              :: sphere options are coded.  C    | | |              :: sphere options are coded.
1590  C    | | |  C    | | |
1591  C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.  C    | | |-INI_CG2D     :: 2d con. grad solver initialization.
1592  C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.  C    | | |-INI_CG3D     :: 3d con. grad solver initialization.
1593  C    | | |-INI_MIXING   :: Initialise diapycnal diffusivity.  C    | | |-INI_MIXING   :: Initialize diapycnal diffusivity.
1594  C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical  C    | | |-INI_DYNVARS  :: Initialize to zero all DYNVARS.h arrays (dynamical
1595  C    | | |              :: fields).  C    | | |              :: fields).
1596  C    | | |  C    | | |
1597  C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero  C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero
# Line 1561  C    | | | |-INI_VEL    :: Initialize 3D Line 1599  C    | | | |-INI_VEL    :: Initialize 3D
1599  C    | | | |-INI_THETA  :: Set model initial temperature field.  C    | | | |-INI_THETA  :: Set model initial temperature field.
1600  C    | | | |-INI_SALT   :: Set model initial salinity field.  C    | | | |-INI_SALT   :: Set model initial salinity field.
1601  C    | | | |-INI_PSURF  :: Set model initial free-surface height/pressure.  C    | | | |-INI_PSURF  :: Set model initial free-surface height/pressure.
1602  C    | | |  C    | | | |-INI_PRESSURE :: Compute model initial hydrostatic pressure
1603  C    | | |-INI_TR1      :: Set initial tracer 1 distribution.  C    | | | |-READ_CHECKPOINT :: Read the checkpoint
1604  C    | | |  C    | | |
1605  C    | | |-THE_CORRECTION_STEP :: Step forward to next time step.  C    | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1606  C    | | | |                   :: Here applied to move restart conditions  C    | | | |                   :: Here applied to move restart conditions
# Line 1589  C    | | | |-FIND_RHO  :: Find adjacent Line 1627  C    | | | |-FIND_RHO  :: Find adjacent
1627  C    | | | |-CONVECT   :: Mix static instability.  C    | | | |-CONVECT   :: Mix static instability.
1628  C    | | | |-TIMEAVE_CUMULATE :: Update convection statistics.  C    | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1629  C    | | |  C    | | |
1630  C    | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving  C    | | |-PACKAGES_INIT_VARIABLES :: Does initialization of time evolving
1631  C    | | | |                       :: package data.  C    | | | |                       :: package data.
1632  C    | | | |  C    | | | |
1633  C    | | | |-GMREDI_INIT          :: GM package. ( see pkg/gmredi )  C    | | | |-GMREDI_INIT          :: GM package. ( see pkg/gmredi )
1634  C    | | | |-KPP_INIT             :: KPP package. ( see pkg/kpp )  C    | | | |-KPP_INIT             :: KPP package. ( see pkg/kpp )
1635  C    | | | |-KPP_OPEN_DIAGS      C    | | | |-KPP_OPEN_DIAGS    
1636  C    | | | |-OBCS_INIT_VARIABLES  :: Open bndy. package. ( see pkg/obcs )  C    | | | |-OBCS_INIT_VARIABLES  :: Open bndy. package. ( see pkg/obcs )
1637    C    | | | |-PTRACERS_INIT        :: multi. tracer package,(see pkg/ptracers)
1638    C    | | | |-GCHEM_INIT           :: tracer interface pkg (see pkh/gchem)
1639  C    | | | |-AIM_INIT             :: Interm. atmos package. ( see pkg/aim )  C    | | | |-AIM_INIT             :: Interm. atmos package. ( see pkg/aim )
1640  C    | | | |-CTRL_MAP_INI         :: Control vector package.( see pkg/ctrl )  C    | | | |-CTRL_MAP_INI         :: Control vector package.( see pkg/ctrl )
1641  C    | | | |-COST_INIT            :: Cost function package. ( see pkg/cost )  C    | | | |-COST_INIT            :: Cost function package. ( see pkg/cost )
# Line 1638  C/\  | | | |                    :: Simpl Line 1678  C/\  | | | |                    :: Simpl
1678  C/\  | | | |                    :: for forcing datasets.  C/\  | | | |                    :: for forcing datasets.
1679  C/\  | | | |                    C/\  | | | |                  
1680  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.
1681    C/\  | | |-SEAICE_MODEL   :: Compute sea-ice terms. ( pkg/seaice )
1682    C/\  | | |-FREEZE         :: Limit surface temperature.
1683    C/\  | | |-GCHEM_FIELD_LOAD :: load tracer forcing fields (pkg/gchem)
1684  C/\  | | |  C/\  | | |
1685  C/\  | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.  C/\  | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1686  C/\  | | | |  C/\  | | | |
1687  C/\  | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.  C/\  | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1688  C/\  | | | |-OBCS_APPLY_W    :: Open bndy. package ( see pkg/obcs ).  C/\  | | | |-OBCS_APPLY_W    :: Open bndy. package ( see pkg/obcs ).
1689  C/\  | | | |-FIND_RHO        :: Calculates [rho(S,T,z)-Rhonil] of a slice  C/\  | | | |-FIND_RHO        :: Calculates [rho(S,T,z)-RhoConst] of a slice
1690  C/\  | | | |-GRAD_SIGMA      :: Calculate isoneutral gradients  C/\  | | | |-GRAD_SIGMA      :: Calculate isoneutral gradients
1691  C/\  | | | |-CALC_IVDC       :: Set Implicit Vertical Diffusivity for Convection  C/\  | | | |-CALC_IVDC       :: Set Implicit Vertical Diffusivity for Convection
1692  C/\  | | | |  C/\  | | | |
1693  C/\  | | | |-OBCS_CALC            :: Open bndy. package ( see pkg/obcs ).  C/\  | | | |-OBCS_CALC            :: Open bndy. package ( see pkg/obcs ).
1694  C/\  | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned  C/\  | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1695  C/\  | | | |                      :: forcing terms.  C/\  | | | | |                    :: forcing terms.
1696    C/\  | | | | |-PTRACERS_FORCING_SURF :: Tracer package ( see pkg/ptracers ).
1697  C/\  | | | |  C/\  | | | |
1698  C/\  | | | |-GMREDI_CALC_TENSOR   :: GM package ( see pkg/gmredi ).  C/\  | | | |-GMREDI_CALC_TENSOR   :: GM package ( see pkg/gmredi ).
1699  C/\  | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).  C/\  | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
# Line 1667  C/\  | | | | Line 1711  C/\  | | | |
1711  C/\  | | | |-CALC_GT              :: Calculate the temperature tendency terms  C/\  | | | |-CALC_GT              :: Calculate the temperature tendency terms
1712  C/\  | | | | |  C/\  | | | | |
1713  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
1714  C/\  | | | | |                    :: ( see pkg/gad )  C/\  | | | | | |                  :: ( see pkg/gad )
1715    C/\  | | | | | |-KPP_TRANSPORT_T  :: KPP non-local transport ( see pkg/kpp ).
1716    C/\  | | | | |
1717  C/\  | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.  C/\  | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1718  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
1719  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
# Line 1677  C/\  | | | | Line 1723  C/\  | | | |
1723  C/\  | | | |-CALC_GS              :: Calculate the salinity tendency terms  C/\  | | | |-CALC_GS              :: Calculate the salinity tendency terms
1724  C/\  | | | | |  C/\  | | | | |
1725  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
1726  C/\  | | | | |                    :: ( see pkg/gad )  C/\  | | | | | |                  :: ( see pkg/gad )
1727    C/\  | | | | | |-KPP_TRANSPORT_S  :: KPP non-local transport ( see pkg/kpp ).
1728    C/\  | | | | |
1729  C/\  | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.  C/\  | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1730  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
1731  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1732  C/\  | | | |  C/\  | | | |
1733  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time
1734  C/\  | | | |  C/\  | | | |
1735  C/\  | | | |-CALC_GTR1            :: Calculate other tracer(s) tendency terms  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time
1736    C/\  | | | |
1737    C/\  | | | |-PTRACERS_INTEGRATE   :: Integrate other tracer(s) (see pkg/ptracers).
1738  C/\  | | | | |  C/\  | | | | |
1739  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
1740  C/\  | | | | |                    :: ( see pkg/gad )  C/\  | | | | | |                  :: ( see pkg/gad )
1741  C/\  | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.  C/\  | | | | | |-KPP_TRANSPORT_PTR:: KPP non-local transport ( see pkg/kpp ).
1742    C/\  | | | | |
1743    C/\  | | | | |-PTRACERS_FORCING   :: Problem specific forcing for tracer.
1744    C/\  | | | | |-GCHEM_FORCING_INT  :: tracer forcing for gchem pkg (if all
1745    C/\  | | | | |                       tendancy terms calcualted together)
1746  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
1747  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1748    C/\  | | | | |-TIMESTEP_TRACER    :: Step tracer field forward in time
1749  C/\  | | | |  C/\  | | | |
 C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time  
1750  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).
 C/\  | | | |-FREEZE               :: Limit range of temperature.  
1751  C/\  | | | |  C/\  | | | |
1752  C/\  | | | |-IMPLDIFF             :: Solve vertical implicit diffusion equation.  C/\  | | | |-IMPLDIFF             :: Solve vertical implicit diffusion equation.
1753  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).
# Line 1753  C/\  | | | Line 1806  C/\  | | |
1806  C/\  | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.  C/\  | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1807  C/\  | | | |-EXCH                                                    C/\  | | | |-EXCH                                                  
1808  C/\  | | |  C/\  | | |
1809    C/\  | | |-GCHEM_FORCING_SEP :: tracer forcing for gchem pkg (if
1810    C/\  | | |                      tracer dependent tendencies calculated
1811    C/\  | | |                      separatly)
1812    C/\  | | |
1813  C/\  | | |-FLT_MAIN         :: Float package ( pkg/flt ).  C/\  | | |-FLT_MAIN         :: Float package ( pkg/flt ).
1814  C/\  | | |  C/\  | | |
1815  C/\  | | |-MONITOR          :: Monitor package ( pkg/monitor ).  C/\  | | |-MONITOR          :: Monitor package ( pkg/monitor ).
# Line 1763  C/\  | | | |-TIMEAVE_STATV_WRITE :: Time Line 1820  C/\  | | | |-TIMEAVE_STATV_WRITE :: Time
1820  C/\  | | | |-AIM_WRITE_DIAGS     :: Intermed. atmos diags. see pkg/aim  C/\  | | | |-AIM_WRITE_DIAGS     :: Intermed. atmos diags. see pkg/aim
1821  C/\  | | | |-GMREDI_DIAGS        :: GM diags. see pkg/gmredi  C/\  | | | |-GMREDI_DIAGS        :: GM diags. see pkg/gmredi
1822  C/\  | | | |-KPP_DO_DIAGS        :: KPP diags. see pkg/kpp  C/\  | | | |-KPP_DO_DIAGS        :: KPP diags. see pkg/kpp
1823    C/\  | | | |-SBO_CALC            :: SBO diags. see pkg/sbo
1824    C/\  | | | |-SBO_DIAGS           :: SBO diags. see pkg/sbo
1825    C/\  | | | |-SEAICE_DO_DIAGS     :: SEAICE diags. see pkg/seaice
1826    C/\  | | | |-GCHEM_DIAGS         :: gchem diags. see pkg/gchem
1827  C/\  | | |  C/\  | | |
1828  C/\  | | |-WRITE_CHECKPOINT :: Do I/O for restart files.  C/\  | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1829  C/\  | |  C/\  | |
# Line 1780  C    |-TIMER_PRINTALL :: Computational t Line 1841  C    |-TIMER_PRINTALL :: Computational t
1841  C    |  C    |
1842  C    |-COMM_STATS     :: Summarise inter-proc and inter-thread communication  C    |-COMM_STATS     :: Summarise inter-proc and inter-thread communication
1843  C                     :: events.  C                     :: events.
1844  C  C
1845  \end{verbatim}  \end{verbatim}
1846    }
1847    
1848  \subsection{Measuring and Characterizing Performance}  \subsection{Measuring and Characterizing Performance}
1849    

Legend:
Removed from v.1.6  
changed lines
  Added in v.1.24

  ViewVC Help
Powered by ViewVC 1.1.22