/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Diff of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.4 by cnh, Thu Oct 25 18:36:55 2001 UTC revision 1.26 by jmc, Mon Aug 30 23:09:22 2010 UTC
# Line 1  Line 1 
1  % $Header$  % $Header$
2    
3  In this chapter we describe the software architecture and  This chapter focuses on describing the {\bf WRAPPER} environment
4  implementation strategy for the MITgcm code. The first part of this  within which both the core numerics and the pluggable packages
5  chapter discusses the MITgcm architecture at an abstract level. In the second  operate. The description presented here is intended to be a detailed
6  part of the chapter we described practical details of the MITgcm implementation  exposition and contains significant background material, as well as
7  and of current tools and operating system features that are employed.  advanced details on working with the WRAPPER.  The tutorial sections
8    of this manual (see sections \ref{sec:modelExamples} and
9    \ref{sec:tutorialIII}) contain more succinct, step-by-step
10    instructions on running basic numerical experiments, of varous types,
11    both sequentially and in parallel. For many projects simply starting
12    from an example code and adapting it to suit a particular situation
13    will be all that is required.  The first part of this chapter
14    discusses the MITgcm architecture at an abstract level. In the second
15    part of the chapter we described practical details of the MITgcm
16    implementation and of current tools and operating system features that
17    are employed.
18    
19  \section{Overall architectural goals}  \section{Overall architectural goals}
20    \begin{rawhtml}
21    <!-- CMIREDIR:overall_architectural_goals: -->
22    \end{rawhtml}
23    
24  Broadly, the goals of the software architecture employed in MITgcm are  Broadly, the goals of the software architecture employed in MITgcm are
25  three-fold  three-fold
26    
27  \begin{itemize}  \begin{itemize}
28  \item We wish to be able to study a very broad range  \item We wish to be able to study a very broad range of interesting
29  of interesting and challenging rotating fluids problems.    and challenging rotating fluids problems.
30  \item We wish the model code to be readily targeted to  \item We wish the model code to be readily targeted to a wide range of
31  a wide range of platforms    platforms
32  \item On any given platform we would like to be  \item On any given platform we would like to be able to achieve
33  able to achieve performance comparable to an implementation    performance comparable to an implementation developed and
34  developed and specialized specifically for that platform.    specialized specifically for that platform.
35  \end{itemize}  \end{itemize}
36    
37  These points are summarized in figure \ref{fig:mitgcm_architecture_goals}  These points are summarized in figure
38  which conveys the goals of the MITgcm design. The goals lead to  \ref{fig:mitgcm_architecture_goals} which conveys the goals of the
39  a software architecture which at the high-level can be viewed as consisting  MITgcm design. The goals lead to a software architecture which at the
40  of  high-level can be viewed as consisting of
41    
42  \begin{enumerate}  \begin{enumerate}
43  \item A core set of numerical and support code. This is discussed in detail in  \item A core set of numerical and support code. This is discussed in
44  section \ref{sec:partII}.    detail in section \ref{chap:discretization}.
45  \item A scheme for supporting optional "pluggable" {\bf packages} (containing  
46  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).  \item A scheme for supporting optional ``pluggable'' {\bf packages}
47  These packages are used both to overlay alternate dynamics and to introduce    (containing for example mixed-layer schemes, biogeochemical schemes,
48  specialized physical content onto the core numerical code. An overview of    atmospheric physics).  These packages are used both to overlay
49  the {\bf package} scheme is given at the start of part \ref{part:packages}.    alternate dynamics and to introduce specialized physical content
50  \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel    onto the core numerical code. An overview of the {\bf package}
51  Programming Environment Resource), within which the core numerics and pluggable    scheme is given at the start of part \ref{chap:packagesI}.
52  packages operate.  
53    \item A support framework called {\bf WRAPPER} (Wrappable Application
54      Parallel Programming Environment Resource), within which the core
55      numerics and pluggable packages operate.
56  \end{enumerate}  \end{enumerate}
57    
58  This chapter focuses on describing the {\bf WRAPPER} environment under which  This chapter focuses on describing the {\bf WRAPPER} environment under
59  both the core numerics and the pluggable packages function. The description  which both the core numerics and the pluggable packages function. The
60  presented here is intended to be a detailed exposition and contains significant  description presented here is intended to be a detailed exposition and
61  background material, as well as advanced details on working with the WRAPPER.  contains significant background material, as well as advanced details
62  The examples section of this manual (part \ref{part:example}) contains more  on working with the WRAPPER.  The examples section of this manual
63  succinct, step-by-step instructions on running basic numerical  (part \ref{chap:getting_started}) contains more succinct, step-by-step
64  experiments both sequentially and in parallel. For many projects simply  instructions on running basic numerical experiments both sequentially
65  starting from an example code and adapting it to suit a particular situation  and in parallel. For many projects simply starting from an example
66  will be all that is required.  code and adapting it to suit a particular situation will be all that
67    is required.
68    
69    
70  \begin{figure}  \begin{figure}
71  \begin{center}  \begin{center}
72  \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}  \resizebox{!}{2.5in}{\includegraphics{s_software/figs/mitgcm_goals.eps}}
73  \end{center}  \end{center}
74  \caption{  \caption{ The MITgcm architecture is designed to allow simulation of a
75  The MITgcm architecture is designed to allow simulation of a wide    wide range of physical problems on a wide range of hardware. The
76  range of physical problems on a wide range of hardware. The computational    computational resource requirements of the applications targeted
77  resource requirements of the applications targeted range from around    range from around $10^7$ bytes ($\approx 10$ megabytes) of memory to
78  $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes    $10^{11}$ bytes ($\approx 100$ gigabytes). Arithmetic operation
79  ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of    counts for the applications of interest range from $10^{9}$ floating
80  interest range from $10^{9}$ floating point operations to more than $10^{17}$    point operations to more than $10^{17}$ floating point operations.}
 floating point operations.}  
81  \label{fig:mitgcm_architecture_goals}  \label{fig:mitgcm_architecture_goals}
82  \end{figure}  \end{figure}
83    
84  \section{WRAPPER}  \section{WRAPPER}
85    \begin{rawhtml}
86  A significant element of the software architecture utilized in  <!-- CMIREDIR:wrapper: -->
87  MITgcm is a software superstructure and substructure collectively  \end{rawhtml}
88  called the WRAPPER (Wrappable Application Parallel Programming  
89  Environment Resource). All numerical and support code in MITgcm is written  A significant element of the software architecture utilized in MITgcm
90  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within  is a software superstructure and substructure collectively called the
91  the WRAPPER means that coding has to follow certain, relatively  WRAPPER (Wrappable Application Parallel Programming Environment
92  straightforward, rules and conventions ( these are discussed further in  Resource). All numerical and support code in MITgcm is written to
93  section \ref{sec:specifying_a_decomposition} ).  ``fit'' within the WRAPPER infrastructure. Writing code to ``fit''
94    within the WRAPPER means that coding has to follow certain, relatively
95  The approach taken by the WRAPPER is illustrated in figure  straightforward, rules and conventions (these are discussed further in
96  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code  section \ref{sec:specifying_a_decomposition}).
97  that fits within it from architectural differences between hardware platforms  
98  and operating systems. This allows numerical code to be easily retargetted.  The approach taken by the WRAPPER is illustrated in figure
99    \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to
100    insulate code that fits within it from architectural differences
101    between hardware platforms and operating systems. This allows
102    numerical code to be easily retargetted.
103    
104    
105  \begin{figure}  \begin{figure}
106  \begin{center}  \begin{center}
107  \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}  \resizebox{!}{4.5in}{\includegraphics{s_software/figs/fit_in_wrapper.eps}}
108  \end{center}  \end{center}
109  \caption{  \caption{
110  Numerical code is written too fit within a software support  Numerical code is written to fit within a software support
111  infrastructure called WRAPPER. The WRAPPER is portable and  infrastructure called WRAPPER. The WRAPPER is portable and
112  can be specialized for a wide range of specific target hardware and  can be specialized for a wide range of specific target hardware and
113  programming environments, without impacting numerical code that fits  programming environments, without impacting numerical code that fits
# Line 100  optimized for that platform.} Line 120  optimized for that platform.}
120  \subsection{Target hardware}  \subsection{Target hardware}
121  \label{sec:target_hardware}  \label{sec:target_hardware}
122    
123  The WRAPPER is designed to target as broad as possible a range of computer  The WRAPPER is designed to target as broad as possible a range of
124  systems. The original development of the WRAPPER took place on a  computer systems.  The original development of the WRAPPER took place
125  multi-processor, CRAY Y-MP system. On that system, numerical code performance  on a multi-processor, CRAY Y-MP system. On that system, numerical code
126  and scaling under the WRAPPER was in excess of that of an implementation that  performance and scaling under the WRAPPER was in excess of that of an
127  was tightly bound to the CRAY systems proprietary multi-tasking and  implementation that was tightly bound to the CRAY systems proprietary
128  micro-tasking approach. Later developments have been carried out on  multi-tasking and micro-tasking approach. Later developments have been
129  uniprocessor and multi-processor Sun systems with both uniform memory access  carried out on uniprocessor and multi-processor Sun systems with both
130  (UMA) and non-uniform memory access (NUMA) designs. Significant work has also  uniform memory access (UMA) and non-uniform memory access (NUMA)
131  been undertaken on x86 cluster systems, Alpha processor based clustered SMP  designs.  Significant work has also been undertaken on x86 cluster
132  systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.  systems, Alpha processor based clustered SMP systems, and on
133  The MITgcm code, operating within the WRAPPER, is also used routinely used on  cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
134  large scale MPP systems (for example T3E systems and IBM SP systems). In all  systems.  The MITgcm code, operating within the WRAPPER, is also
135  cases numerical code, operating within the WRAPPER, performs and scales very  routinely used on large scale MPP systems (for example, Cray T3E and
136  competitively with equivalent numerical code that has been modified to contain  IBM SP systems). In all cases numerical code, operating within the
137  native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.  WRAPPER, performs and scales very competitively with equivalent
138    numerical code that has been modified to contain native optimizations
139    for a particular system \cite{hoe-hill:99}.
140    
141  \subsection{Supporting hardware neutrality}  \subsection{Supporting hardware neutrality}
142    
143  The different systems listed in section \ref{sec:target_hardware} can be  The different systems listed in section \ref{sec:target_hardware} can
144  categorized in many different ways. For example, one common distinction is  be categorized in many different ways. For example, one common
145  between shared-memory parallel systems (SMP's, PVP's) and distributed memory  distinction is between shared-memory parallel systems (SMP and PVP)
146  parallel systems (for example x86 clusters and large MPP systems). This is one  and distributed memory parallel systems (for example x86 clusters and
147  example of a difference between compute platforms that can impact an  large MPP systems). This is one example of a difference between
148  application. Another common distinction is between vector processing systems  compute platforms that can impact an application. Another common
149  with highly specialized CPU's and memory subsystems and commodity  distinction is between vector processing systems with highly
150  microprocessor based systems. There are numerous other differences, especially  specialized CPUs and memory subsystems and commodity microprocessor
151  in relation to how parallel execution is supported. To capture the essential  based systems. There are numerous other differences, especially in
152  differences between different platforms the WRAPPER uses a {\it machine model}.  relation to how parallel execution is supported. To capture the
153    essential differences between different platforms the WRAPPER uses a
154    {\it machine model}.
155    
156  \subsection{WRAPPER machine model}  \subsection{WRAPPER machine model}
157    
158  Applications using the WRAPPER are not written to target just one  Applications using the WRAPPER are not written to target just one
159  particular machine (for example an IBM SP2) or just one particular family or  particular machine (for example an IBM SP2) or just one particular
160  class of machines (for example Parallel Vector Processor Systems). Instead the  family or class of machines (for example Parallel Vector Processor
161  WRAPPER provides applications with an  Systems). Instead the WRAPPER provides applications with an abstract
162  abstract {\it machine model}. The machine model is very general, however, it can  {\it machine model}. The machine model is very general, however, it
163  easily be specialized to fit, in a computationally effificent manner, any  can easily be specialized to fit, in a computationally efficient
164  computer architecture currently available to the scientific computing community.  manner, any computer architecture currently available to the
165    scientific computing community.
166    
167  \subsection{Machine model parallelism}  \subsection{Machine model parallelism}
168    \label{sec:domain_decomposition}
169   Codes operating under the WRAPPER target an abstract machine that is assumed to  \begin{rawhtml}
170  consist of one or more logical processors that can compute concurrently.    <!-- CMIREDIR:domain_decomp: -->
171  Computational work is divided amongst the logical  \end{rawhtml}
172  processors by allocating ``ownership'' to  
173  each processor of a certain set (or sets) of calculations. Each set of  Codes operating under the WRAPPER target an abstract machine that is
174  calculations owned by a particular processor is associated with a specific  assumed to consist of one or more logical processors that can compute
175  region of the physical space that is being simulated, only one processor will  concurrently.  Computational work is divided among the logical
176  be associated with each such region (domain decomposition).    processors by allocating ``ownership'' to each processor of a certain
177    set (or sets) of calculations. Each set of calculations owned by a
178  In a strict sense the logical processors over which work is divided do not need  particular processor is associated with a specific region of the
179  to correspond to physical processors. It is perfectly possible to execute a  physical space that is being simulated, only one processor will be
180  configuration decomposed for multiple logical processors on a single physical  associated with each such region (domain decomposition).
181  processor. This helps ensure that numerical code that is written to fit  
182  within the WRAPPER will parallelize with no additional effort and is  In a strict sense the logical processors over which work is divided do
183  also useful when debugging codes. Generally, however,  not need to correspond to physical processors.  It is perfectly
184  the computational domain will be subdivided over multiple logical  possible to execute a configuration decomposed for multiple logical
185  processors in order to then bind those logical processors to physical  processors on a single physical processor.  This helps ensure that
186  processor resources that can compute in parallel.  numerical code that is written to fit within the WRAPPER will
187    parallelize with no additional effort.  It is also useful for
188    debugging purposes.  Generally, however, the computational domain will
189    be subdivided over multiple logical processors in order to then bind
190    those logical processors to physical processor resources that can
191    compute in parallel.
192    
193  \subsubsection{Tiles}  \subsubsection{Tiles}
194    
195  Computationally, associated with each region of physical  Computationally, the data structures (\textit{eg.} arrays, scalar
196  space allocated to a particular logical processor, there will be data  variables, etc.) that hold the simulated state are associated with
197  structures (arrays, scalar variables etc...) that hold the simulated state of  each region of physical space and are allocated to a particular
198  that region. We refer to these data structures as being {\bf owned} by the  logical processor.  We refer to these data structures as being {\bf
199  processor to which their    owned} by the processor to which their associated region of physical
200  associated region of physical space has been allocated. Individual  space has been allocated.  Individual regions that are allocated to
201  regions that are allocated to processors are called {\bf tiles}. A  processors are called {\bf tiles}.  A processor can own more than one
202  processor can own more  tile.  Figure \ref{fig:domaindecomp} shows a physical domain being
203  than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being  mapped to a set of logical processors, with each processors owning a
204  mapped to a set of logical processors, with each processors owning a single  single region of the domain (a single tile).  Except for periods of
205  region of the domain (a single tile). Except for periods of  communication and coordination, each processor computes autonomously,
206  communication and coordination, each processor computes autonomously, working  working only with data from the tile (or tiles) that the processor
207  only with data from the tile (or tiles) that the processor owns. When multiple  owns.  When multiple tiles are alloted to a single processor, each
208  tiles are alloted to a single processor, each tile is computed on  tile is computed on independently of the other tiles, in a sequential
209  independently of the other tiles, in a sequential fashion.  fashion.
210    
211  \begin{figure}  \begin{figure}
212  \begin{center}  \begin{center}
213   \resizebox{5in}{!}{   \resizebox{5in}{!}{
214    \includegraphics{part4/domain_decomp.eps}    \includegraphics{s_software/figs/domain_decomp.eps}
215   }   }
216  \end{center}  \end{center}
217  \caption{ The WRAPPER provides support for one and two dimensional  \caption{ The WRAPPER provides support for one and two dimensional
218  decompositions of grid-point domains. The figure shows a hypothetical domain of    decompositions of grid-point domains. The figure shows a
219  total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in    hypothetical domain of total size $N_{x}N_{y}N_{z}$. This
220  two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf    hypothetical domain is decomposed in two-dimensions along the
221  tiles} are {\bf owned} by different processors. The {\bf owning}    $N_{x}$ and $N_{y}$ directions. The resulting {\bf tiles} are {\bf
222  processors perform the      owned} by different processors. The {\bf owning} processors
223  arithmetic operations associated with a {\bf tile}. Although not illustrated    perform the arithmetic operations associated with a {\bf tile}.
224  here, a single processor can {\bf own} several {\bf tiles}.    Although not illustrated here, a single processor can {\bf own}
225  Whenever a processor wishes to transfer data between tiles or    several {\bf tiles}.  Whenever a processor wishes to transfer data
226  communicate with other processors it calls a WRAPPER supplied    between tiles or communicate with other processors it calls a
227  function.    WRAPPER supplied function.  } \label{fig:domaindecomp}
 } \label{fig:domaindecomp}  
228  \end{figure}  \end{figure}
229    
230    
231  \subsubsection{Tile layout}  \subsubsection{Tile layout}
232    
233  Tiles consist of an interior region and an overlap region. The overlap region  Tiles consist of an interior region and an overlap region.  The
234  of a tile corresponds to the interior region of an adjacent tile.  overlap region of a tile corresponds to the interior region of an
235  In figure \ref{fig:tiledworld} each tile would own the region  adjacent tile.  In figure \ref{fig:tiledworld} each tile would own the
236  within the black square and hold duplicate information for overlap  region within the black square and hold duplicate information for
237  regions extending into the tiles to the north, south, east and west.  overlap regions extending into the tiles to the north, south, east and
238  During  west.  During computational phases a processor will reference data in
239  computational phases a processor will reference data in an overlap region  an overlap region whenever it requires values that lie outside the
240  whenever it requires values that outside the domain it owns. Periodically  domain it owns.  Periodically processors will make calls to WRAPPER
241  processors will make calls to WRAPPER functions to communicate data between  functions to communicate data between tiles, in order to keep the
242  tiles, in order to keep the overlap regions up to date (see section  overlap regions up to date (see section
243  \ref{sec:communication_primitives}). The WRAPPER functions can use a  \ref{sec:communication_primitives}).  The WRAPPER functions can use a
244  variety of different mechanisms to communicate data between tiles.  variety of different mechanisms to communicate data between tiles.
245    
246  \begin{figure}  \begin{figure}
247  \begin{center}  \begin{center}
248   \resizebox{5in}{!}{   \resizebox{5in}{!}{
249    \includegraphics{part4/tiled-world.eps}    \includegraphics{s_software/figs/tiled-world.eps}
250   }   }
251  \end{center}  \end{center}
252  \caption{ A global grid subdivided into tiles.  \caption{ A global grid subdivided into tiles.
# Line 228  Overlap regions are periodically updated Line 257  Overlap regions are periodically updated
257    
258  \subsection{Communication mechanisms}  \subsection{Communication mechanisms}
259    
260   Logical processors are assumed to be able to exchange information  Logical processors are assumed to be able to exchange information
261  between tiles and between each other using at least one of two possible  between tiles and between each other using at least one of two
262  mechanisms.  possible mechanisms.
263    
264  \begin{itemize}  \begin{itemize}
265  \item {\bf Shared memory communication}.  \item {\bf Shared memory communication}.  Under this mode of
266  Under this mode of communication data transfers are assumed to be possible    communication data transfers are assumed to be possible using direct
267  using direct addressing of regions of memory. In this case a CPU is able to read    addressing of regions of memory.  In this case a CPU is able to read
268  (and write) directly to regions of memory "owned" by another CPU    (and write) directly to regions of memory ``owned'' by another CPU
269  using simple programming language level assignment operations of the    using simple programming language level assignment operations of the
270  the sort shown in figure \ref{fig:simple_assign}. In this way one CPU    the sort shown in figure \ref{fig:simple_assign}.  In this way one
271  (CPU1 in the figure) can communicate information to another CPU (CPU2 in the    CPU (CPU1 in the figure) can communicate information to another CPU
272  figure) by assigning a particular value to a particular memory location.    (CPU2 in the figure) by assigning a particular value to a particular
273      memory location.
274  \item {\bf Distributed memory communication}.  
275  Under this mode of communication there is no mechanism, at the application code level,  \item {\bf Distributed memory communication}.  Under this mode of
276  for directly addressing regions of memory owned and visible to another CPU. Instead    communication there is no mechanism, at the application code level,
277  a communication library must be used as illustrated in figure    for directly addressing regions of memory owned and visible to
278  \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the    another CPU. Instead a communication library must be used as
279  communication library to communicate data from a tile that it owns to a tile    illustrated in figure \ref{fig:comm_msg}. In this case CPUs must
280  that another CPU owns. By default the WRAPPER binds to the MPI communication    call a function in the API of the communication library to
281  library \ref{MPI} for this style of communication.    communicate data from a tile that it owns to a tile that another CPU
282      owns. By default the WRAPPER binds to the MPI communication library
283      \cite{MPI-std-20} for this style of communication.
284  \end{itemize}  \end{itemize}
285    
286  The WRAPPER assumes that communication will use one of these two styles  The WRAPPER assumes that communication will use one of these two styles
287  of communication. The underlying hardware and operating system support  of communication.  The underlying hardware and operating system support
288  for the style used is not specified and can vary from system to system.  for the style used is not specified and can vary from system to system.
289    
290  \begin{figure}  \begin{figure}
# Line 267  for the style used is not specified and Line 298  for the style used is not specified and
298                                   |        END WHILE                                   |        END WHILE
299                                   |                                   |
300  \end{verbatim}  \end{verbatim}
301  \caption{ In the WRAPPER shared memory communication model, simple writes to an  \caption{In the WRAPPER shared memory communication model, simple writes to an
302  array can be made to be visible to other CPU's at the application code level.  array can be made to be visible to other CPUs at the application code level.
303  So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to  So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
304  element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)  element $3$ of array $a$, then other CPUs (for example CPU2 in the figure above)
305  will be able to see the value $8$ when they read from $a(3)$.  will be able to see the value $8$ when they read from $a(3)$.
306  This provides a very low latency and high bandwidth communication  This provides a very low latency and high bandwidth communication
307  mechanism.  mechanism.
# Line 289  mechanism. Line 320  mechanism.
320                                   |                                   |
321  \end{verbatim}  \end{verbatim}
322  \caption{ In the WRAPPER distributed memory communication model  \caption{ In the WRAPPER distributed memory communication model
323  data can not be made directly visible to other CPU's.  data can not be made directly visible to other CPUs.
324  If one CPU writes the value $8$ to element $3$ of array $a$, then  If one CPU writes the value $8$ to element $3$ of array $a$, then
325  at least one of CPU1 and/or CPU2 in the figure above will need  at least one of CPU1 and/or CPU2 in the figure above will need
326  to call a bespoke communication library in order for the updated  to call a bespoke communication library in order for the updated
327  value to be communicated between CPU's.  value to be communicated between CPUs.
328  } \label{fig:comm_msg}  } \label{fig:comm_msg}
329  \end{figure}  \end{figure}
330    
331  \subsection{Shared memory communication}  \subsection{Shared memory communication}
332  \label{sec:shared_memory_communication}  \label{sec:shared_memory_communication}
333    
334  Under shared communication independent CPU's are operating  Under shared communication independent CPUs are operating on the
335  on the exact same global address space at the application level.  exact same global address space at the application level.  This means
336  This means that CPU 1 can directly write into global  that CPU 1 can directly write into global data structures that CPU 2
337  data structures that CPU 2 ``owns'' using a simple  ``owns'' using a simple assignment at the application level.  This is
338  assignment at the application level.  the model of memory access is supported at the basic system design
339  This is the model of memory access is supported at the basic system  level in ``shared-memory'' systems such as PVP systems, SMP systems,
340  design level in ``shared-memory'' systems such as PVP systems, SMP systems,  and on distributed shared memory systems (\textit{eg.} SGI Origin, SGI
341  and on distributed shared memory systems (the SGI Origin).  Altix, and some AMD Opteron systems).  On such systems the WRAPPER
342  On such systems the WRAPPER will generally use simple read and write statements  will generally use simple read and write statements to access directly
343  to access directly application data structures when communicating between CPU's.  application data structures when communicating between CPUs.
344    
345  In a system where assignments statements, like the one in figure  In a system where assignments statements, like the one in figure
346  \ref{fig:simple_assign} map directly to  \ref{fig:simple_assign} map directly to hardware instructions that
347  hardware instructions that transport data between CPU and memory banks, this  transport data between CPU and memory banks, this can be a very
348  can be a very efficient mechanism for communication. In this case two CPU's,  efficient mechanism for communication.  In this case two CPUs, CPU1
349  CPU1 and CPU2, can communicate simply be reading and writing to an  and CPU2, can communicate simply be reading and writing to an agreed
350  agreed location and following a few basic rules. The latency of this sort  location and following a few basic rules.  The latency of this sort of
351  of communication is generally not that much higher than the hardware  communication is generally not that much higher than the hardware
352  latency of other memory accesses on the system. The bandwidth available  latency of other memory accesses on the system. The bandwidth
353  between CPU's communicating in this way can be close to the bandwidth of  available between CPUs communicating in this way can be close to the
354  the systems main-memory interconnect. This can make this method of  bandwidth of the systems main-memory interconnect.  This can make this
355  communication very efficient provided it is used appropriately.  method of communication very efficient provided it is used
356    appropriately.
357    
358  \subsubsection{Memory consistency}  \subsubsection{Memory consistency}
359  \label{sec:memory_consistency}  \label{sec:memory_consistency}
360    
361  When using shared memory communication between  When using shared memory communication between multiple processors the
362  multiple processors the WRAPPER level shields user applications from  WRAPPER level shields user applications from certain counter-intuitive
363  certain counter-intuitive system behaviors. In particular, one issue the  system behaviors.  In particular, one issue the WRAPPER layer must
364  WRAPPER layer must deal with is a systems memory model. In general the order  deal with is a systems memory model.  In general the order of reads
365  of reads and writes expressed by the textual order of an application code may  and writes expressed by the textual order of an application code may
366  not be the ordering of instructions executed by the processor performing the  not be the ordering of instructions executed by the processor
367  application. The processor performing the application instructions will always  performing the application.  The processor performing the application
368  operate so that, for the application instructions the processor is executing,  instructions will always operate so that, for the application
369  any reordering is not apparent. However, in general machines are often  instructions the processor is executing, any reordering is not
370  designed so that reordering of instructions is not hidden from other second  apparent.  However, in general machines are often designed so that
371  processors.  This means that, in general, even on a shared memory system two  reordering of instructions is not hidden from other second processors.
372  processors can observe inconsistent memory values.  This means that, in general, even on a shared memory system two
373    processors can observe inconsistent memory values.
374  The issue of memory consistency between multiple processors is discussed at  
375  length in many computer science papers, however, from a practical point of  The issue of memory consistency between multiple processors is
376  view, in order to deal with this issue, shared memory machines all provide  discussed at length in many computer science papers.  From a practical
377  some mechanism to enforce memory consistency when it is needed. The exact  point of view, in order to deal with this issue, shared memory
378  mechanism employed will vary between systems. For communication using shared  machines all provide some mechanism to enforce memory consistency when
379  memory, the WRAPPER provides a place to invoke the appropriate mechanism to  it is needed.  The exact mechanism employed will vary between systems.
380  ensure memory consistency for a particular platform.  For communication using shared memory, the WRAPPER provides a place to
381    invoke the appropriate mechanism to ensure memory consistency for a
382    particular platform.
383    
384  \subsubsection{Cache effects and false sharing}  \subsubsection{Cache effects and false sharing}
385  \label{sec:cache_effects_and_false_sharing}  \label{sec:cache_effects_and_false_sharing}
386    
387  Shared-memory machines often have local to processor memory caches  Shared-memory machines often have local to processor memory caches
388  which contain mirrored copies of main memory. Automatic cache-coherence  which contain mirrored copies of main memory.  Automatic cache-coherence
389  protocols are used to maintain consistency between caches on different  protocols are used to maintain consistency between caches on different
390  processors. These cache-coherence protocols typically enforce consistency  processors.  These cache-coherence protocols typically enforce consistency
391  between regions of memory with large granularity (typically 128 or 256 byte  between regions of memory with large granularity (typically 128 or 256 byte
392  chunks). The coherency protocols employed can be expensive relative to other  chunks).  The coherency protocols employed can be expensive relative to other
393  memory accesses and so care is taken in the WRAPPER (by padding synchronization  memory accesses and so care is taken in the WRAPPER (by padding synchronization
394  structures appropriately) to avoid unnecessary coherence traffic.  structures appropriately) to avoid unnecessary coherence traffic.
395    
396  \subsubsection{Operating system support for shared memory.}  \subsubsection{Operating system support for shared memory.}
397    
398  Applications running under multiple threads within a single process can  Applications running under multiple threads within a single process
399  use shared memory communication. In this case {\it all} the memory locations  can use shared memory communication.  In this case {\it all} the
400  in an application are potentially visible to all the compute threads. Multiple  memory locations in an application are potentially visible to all the
401  threads operating within a single process is the standard mechanism for  compute threads. Multiple threads operating within a single process is
402  supporting shared memory that the WRAPPER utilizes. Configuring and launching  the standard mechanism for supporting shared memory that the WRAPPER
403  code to run in multi-threaded mode on specific platforms is discussed in  utilizes. Configuring and launching code to run in multi-threaded mode
404  section \ref{sec:running_with_threads}.  However, on many systems, potentially  on specific platforms is discussed in section
405  very efficient mechanisms for using shared memory communication between  \ref{sec:multi_threaded_execution}.  However, on many systems,
406  multiple processes (in contrast to multiple threads within a single  potentially very efficient mechanisms for using shared memory
407  process) also exist. In most cases this works by making a limited region of  communication between multiple processes (in contrast to multiple
408  memory shared between processes. The MMAP \ref{magicgarden} and  threads within a single process) also exist. In most cases this works
409  IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do  by making a limited region of memory shared between processes. The
410  vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.  MMAP %\ref{magicgarden}
411  Extensions exist for the WRAPPER that allow these mechanisms  and IPC %\ref{magicgarden}
412  to be used for shared memory communication. However, these mechanisms are not  facilities in UNIX
413  distributed with the default WRAPPER sources, because of their proprietary  systems provide this capability as do vendor specific tools like LAPI
414  nature.  %\ref{IBMLAPI}
415    and IMC. %\ref{Memorychannel}.  
416    Extensions exist for the
417    WRAPPER that allow these mechanisms to be used for shared memory
418    communication. However, these mechanisms are not distributed with the
419    default WRAPPER sources, because of their proprietary nature.
420    
421  \subsection{Distributed memory communication}  \subsection{Distributed memory communication}
422  \label{sec:distributed_memory_communication}  \label{sec:distributed_memory_communication}
423  Many parallel systems are not constructed in a way where it is  Many parallel systems are not constructed in a way where it is
424  possible or practical for an application to use shared memory  possible or practical for an application to use shared memory for
425  for communication. For example cluster systems consist of individual computers  communication. For example cluster systems consist of individual
426  connected by a fast network. On such systems their is no notion of shared memory  computers connected by a fast network. On such systems there is no
427  at the system level. For this sort of system the WRAPPER provides support  notion of shared memory at the system level. For this sort of system
428  for communication based on a bespoke communication library  the WRAPPER provides support for communication based on a bespoke
429  (see figure \ref{fig:comm_msg}).  The default communication library used is MPI  communication library (see figure \ref{fig:comm_msg}).  The default
430  \ref{mpi}. However, it is relatively straightforward to implement bindings to  communication library used is MPI \cite{MPI-std-20}. However, it is
431  optimized platform specific communication libraries. For example the work  relatively straightforward to implement bindings to optimized platform
432  described in \ref{hoe-hill:99} substituted standard MPI communication for a  specific communication libraries. For example the work described in
433  highly optimized library.  \cite{hoe-hill:99} substituted standard MPI communication for a highly
434    optimized library.
435    
436  \subsection{Communication primitives}  \subsection{Communication primitives}
437  \label{sec:communication_primitives}  \label{sec:communication_primitives}
# Line 399  highly optimized library. Line 439  highly optimized library.
439  \begin{figure}  \begin{figure}
440  \begin{center}  \begin{center}
441   \resizebox{5in}{!}{   \resizebox{5in}{!}{
442    \includegraphics{part4/comm-primm.eps}    \includegraphics{s_software/figs/comm-primm.eps}
443   }   }
444  \end{center}  \end{center}
445  \caption{Three performance critical parallel primititives are provided  \caption{Three performance critical parallel primitives are provided
446  by the WRAPPER. These primititives are always used to communicate data    by the WRAPPER. These primitives are always used to communicate data
447  between tiles. The figure shows four tiles. The curved arrows indicate    between tiles. The figure shows four tiles. The curved arrows
448  exchange primitives which transfer data between the overlap regions at tile    indicate exchange primitives which transfer data between the overlap
449  edges and interior regions for nearest-neighbor tiles.    regions at tile edges and interior regions for nearest-neighbor
450  The straight arrows symbolize global sum operations which connect all tiles.    tiles.  The straight arrows symbolize global sum operations which
451  The global sum operation provides both a key arithmetic primitive and can    connect all tiles.  The global sum operation provides both a key
452  serve as a synchronization primitive. A third barrier primitive is also    arithmetic primitive and can serve as a synchronization primitive. A
453  provided, it behaves much like the global sum primitive.    third barrier primitive is also provided, it behaves much like the
454  } \label{fig:communication_primitives}    global sum primitive.  } \label{fig:communication_primitives}
455  \end{figure}  \end{figure}
456    
457    
458  Optimized communication support is assumed to be possibly available  Optimized communication support is assumed to be potentially available
459  for a small number of communication operations.  for a small number of communication operations.  It is also assumed
460  It is assumed that communication performance optimizations can  that communication performance optimizations can be achieved by
461  be achieved by optimizing a small number of communication primitives.  optimizing a small number of communication primitives.  Three
462  Three optimizable primitives are provided by the WRAPPER  optimizable primitives are provided by the WRAPPER
 \begin{itemize}  
 \item{\bf EXCHANGE} This operation is used to transfer data between interior  
 and overlap regions of neighboring tiles. A number of different forms of this  
 operation are supported. These different forms handle  
463  \begin{itemize}  \begin{itemize}
464  \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled  \item{\bf EXCHANGE} This operation is used to transfer data between
465  separately.    interior and overlap regions of neighboring tiles. A number of
466  \item Bindings to different communication methods.    different forms of this operation are supported. These different
467  Exchange primitives select between using shared memory or distributed    forms handle
468  memory communication.    \begin{itemize}
469  \item Transformation operations required when transporting    \item Data type differences. Sixty-four bit and thirty-two bit
470  data between different grid regions. Transferring data      fields may be handled separately.
471  between faces of a cube-sphere grid, for example, involves a rotation    \item Bindings to different communication methods.  Exchange
472  of vector components.      primitives select between using shared memory or distributed
473  \item Forward and reverse mode computations. Derivative calculations require      memory communication.
474  tangent linear and adjoint forms of the exchange primitives.    \item Transformation operations required when transporting data
475        between different grid regions. Transferring data between faces of
476  \end{itemize}      a cube-sphere grid, for example, involves a rotation of vector
477        components.
478      \item Forward and reverse mode computations. Derivative calculations
479        require tangent linear and adjoint forms of the exchange
480        primitives.
481      \end{itemize}
482    
483  \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic  \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
484  operation for the pressure inversion phase of the MITgcm algorithm.    operation for the pressure inversion phase of the MITgcm algorithm.
485  For certain configurations scaling can be highly sensitive to    For certain configurations scaling can be highly sensitive to the
486  the performance of the global sum primitive. This operation is a collective    performance of the global sum primitive. This operation is a
487  operation involving all tiles of the simulated domain. Different forms    collective operation involving all tiles of the simulated domain.
488  of the global sum primitive exist for handling    Different forms of the global sum primitive exist for handling
489  \begin{itemize}    \begin{itemize}
490  \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled    \item Data type differences. Sixty-four bit and thirty-two bit
491  separately.      fields may be handled separately.
492  \item Bindings to different communication methods.    \item Bindings to different communication methods.  Exchange
493  Exchange primitives select between using shared memory or distributed      primitives select between using shared memory or distributed
494  memory communication.      memory communication.
495  \item Forward and reverse mode computations. Derivative calculations require    \item Forward and reverse mode computations. Derivative calculations
496  tangent linear and adjoint forms of the exchange primitives.      require tangent linear and adjoint forms of the exchange
497  \end{itemize}      primitives.
498      \end{itemize}
499  \item{\bf BARRIER} The WRAPPER provides a global synchronization function    
500  called barrier. This is used to synchronize computations over all tiles.  \item{\bf BARRIER} The WRAPPER provides a global synchronization
501  The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in    function called barrier. This is used to synchronize computations
502  some cases use the same underlying code.    over all tiles.  The {\bf BARRIER} and {\bf GLOBAL SUM} primitives
503      have much in common and in some cases use the same underlying code.
504  \end{itemize}  \end{itemize}
505    
506    
# Line 480  sub-domains. Line 522  sub-domains.
522  \begin{figure}  \begin{figure}
523  \begin{center}  \begin{center}
524   \resizebox{5in}{!}{   \resizebox{5in}{!}{
525    \includegraphics{part4/tiling_detail.eps}    \includegraphics{s_software/figs/tiling_detail.eps}
526   }   }
527  \end{center}  \end{center}
528  \caption{The tiling strategy that the WRAPPER supports allows tiles  \caption{The tiling strategy that the WRAPPER supports allows tiles
# Line 498  Following the discussion above, the mach Line 540  Following the discussion above, the mach
540  presents to an application has the following characteristics  presents to an application has the following characteristics
541    
542  \begin{itemize}  \begin{itemize}
543  \item The machine consists of one or more logical processors. \vspace{-3mm}  \item The machine consists of one or more logical processors.
544  \item Each processor operates on tiles that it owns.\vspace{-3mm}  \item Each processor operates on tiles that it owns.
545  \item A processor may own more than one tile.\vspace{-3mm}  \item A processor may own more than one tile.
546  \item Processors may compute concurrently.\vspace{-3mm}  \item Processors may compute concurrently.
547  \item Exchange of information between tiles is handled by the  \item Exchange of information between tiles is handled by the
548  machine (WRAPPER) not by the application.    machine (WRAPPER) not by the application.
549  \end{itemize}  \end{itemize}
550  Behind the scenes this allows the WRAPPER to adapt the machine model  Behind the scenes this allows the WRAPPER to adapt the machine model
551  functions to exploit hardware on which  functions to exploit hardware on which
552  \begin{itemize}  \begin{itemize}
553  \item Processors may be able to communicate very efficiently with each other  \item Processors may be able to communicate very efficiently with each
554  using shared memory. \vspace{-3mm}    other using shared memory.
555  \item An alternative communication mechanism based on a relatively  \item An alternative communication mechanism based on a relatively
556  simple inter-process communication API may be required.\vspace{-3mm}    simple inter-process communication API may be required.
557  \item Shared memory may not necessarily obey sequential consistency,  \item Shared memory may not necessarily obey sequential consistency,
558  however some mechanism will exist for enforcing memory consistency.    however some mechanism will exist for enforcing memory consistency.
 \vspace{-3mm}  
559  \item Memory consistency that is enforced at the hardware level  \item Memory consistency that is enforced at the hardware level
560  may be expensive. Unnecessary triggering of consistency protocols    may be expensive. Unnecessary triggering of consistency protocols
561  should be avoided. \vspace{-3mm}    should be avoided.
562  \item Memory access patterns may need to either repetitive or highly  \item Memory access patterns may need to either repetitive or highly
563  pipelined for optimum hardware performance. \vspace{-3mm}    pipelined for optimum hardware performance.
564  \end{itemize}  \end{itemize}
565    
566  This generic model captures the essential hardware ingredients  This generic model captures the essential hardware ingredients
# Line 527  of almost all successful scientific comp Line 568  of almost all successful scientific comp
568  last 50 years.  last 50 years.
569    
570  \section{Using the WRAPPER}  \section{Using the WRAPPER}
571    \begin{rawhtml}
572  In order to support maximum portability the WRAPPER is implemented primarily  <!-- CMIREDIR:using_the_wrapper: -->
573  in sequential Fortran 77. At a practical level the key steps provided by the  \end{rawhtml}
574  WRAPPER are  
575    In order to support maximum portability the WRAPPER is implemented
576    primarily in sequential Fortran 77. At a practical level the key steps
577    provided by the WRAPPER are
578  \begin{enumerate}  \begin{enumerate}
579  \item specifying how a domain will be decomposed  \item specifying how a domain will be decomposed
580  \item starting a code in either sequential or parallel modes of operations  \item starting a code in either sequential or parallel modes of operations
581  \item controlling communication between tiles and between concurrently  \item controlling communication between tiles and between concurrently
582  computing CPU's.    computing CPUs.
583  \end{enumerate}  \end{enumerate}
584  This section describes the details of each of these operations.  This section describes the details of each of these operations.
585  Section \ref{sec:specifying_a_decomposition} explains how the way in which  Section \ref{sec:specifying_a_decomposition} explains how the way in
586  a domain is decomposed (or composed) is expressed. Section  which a domain is decomposed (or composed) is expressed. Section
587  \ref{sec:starting_a_code} describes practical details of running codes  \ref{sec:starting_the_code} describes practical details of running
588  in various different parallel modes on contemporary computer systems.  codes in various different parallel modes on contemporary computer
589  Section \ref{sec:controlling_communication} explains the internal information  systems.  Section \ref{sec:controlling_communication} explains the
590  that the WRAPPER uses to control how information is communicated between  internal information that the WRAPPER uses to control how information
591  tiles.  is communicated between tiles.
592    
593  \subsection{Specifying a domain decomposition}  \subsection{Specifying a domain decomposition}
594  \label{sec:specifying_a_decomposition}  \label{sec:specifying_a_decomposition}
# Line 584  not cause any other problems. Line 628  not cause any other problems.
628  \begin{figure}  \begin{figure}
629  \begin{center}  \begin{center}
630   \resizebox{5in}{!}{   \resizebox{5in}{!}{
631    \includegraphics{part4/size_h.eps}    \includegraphics{s_software/figs/size_h.eps}
632   }   }
633  \end{center}  \end{center}
634  \caption{ The three level domain decomposition hierarchy employed by the  \caption{ The three level domain decomposition hierarchy employed by the
# Line 651  Within a {\em bi}, {\em bj} loop Line 695  Within a {\em bi}, {\em bj} loop
695  computation is performed concurrently over as many processes and threads  computation is performed concurrently over as many processes and threads
696  as there are physical processors available to compute.  as there are physical processors available to compute.
697    
698    An exception to the the use of {\em bi} and {\em bj} in loops arises in the
699    exchange routines used when the exch2 package is used with the cubed
700    sphere.  In this case {\em bj} is generally set to 1 and the loop runs from
701    1,{\em bi}.  Within the loop {\em bi} is used to retrieve the tile number,
702    which is then used to reference exchange parameters.
703    
704  The amount of computation that can be embedded  The amount of computation that can be embedded
705  a single loop over {\em bi} and {\em bj} varies for different parts of the  a single loop over {\em bi} and {\em bj} varies for different parts of the
706  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
# Line 771  The global domain size is again ninety g Line 821  The global domain size is again ninety g
821  forty grid points in y. The two sub-domains in each process will be computed  forty grid points in y. The two sub-domains in each process will be computed
822  sequentially if they are given to a single thread within a single process.  sequentially if they are given to a single thread within a single process.
823  Alternatively if the code is invoked with multiple threads per process  Alternatively if the code is invoked with multiple threads per process
824  the two domains in y may be computed on concurrently.  the two domains in y may be computed concurrently.
825  \item  \item
826  \begin{verbatim}  \begin{verbatim}
827        PARAMETER (        PARAMETER (
# Line 807  by the application code. The startup cal Line 857  by the application code. The startup cal
857  WRAPPER is shown in figure \ref{fig:wrapper_startup}.  WRAPPER is shown in figure \ref{fig:wrapper_startup}.
858    
859  \begin{figure}  \begin{figure}
860    {\footnotesize
861  \begin{verbatim}  \begin{verbatim}
862    
863         MAIN           MAIN  
# Line 835  WRAPPER is shown in figure \ref{fig:wrap Line 886  WRAPPER is shown in figure \ref{fig:wrap
886    
887    
888  \end{verbatim}  \end{verbatim}
889    }
890  \caption{Main stages of the WRAPPER startup procedure.  \caption{Main stages of the WRAPPER startup procedure.
891  This process proceeds transfer of control to application code, which  This process proceeds transfer of control to application code, which
892  occurs through the procedure {\em THE\_MODEL\_MAIN()}.  occurs through the procedure {\em THE\_MODEL\_MAIN()}.
# Line 842  occurs through the procedure {\em THE\_M Line 894  occurs through the procedure {\em THE\_M
894  \end{figure}  \end{figure}
895    
896  \subsubsection{Multi-threaded execution}  \subsubsection{Multi-threaded execution}
897  \label{sec:multi-threaded-execution}  \label{sec:multi_threaded_execution}
898  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
899  WRAPPER may cause several coarse grain threads to be initialized. The routine  WRAPPER may cause several coarse grain threads to be initialized. The routine
900  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
# Line 917  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1 Line 969  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
969  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
970  File: {\em model/src/THE\_MODEL\_MAIN.F}\\  File: {\em model/src/THE\_MODEL\_MAIN.F}\\
971  File: {\em eesupp/src/MAIN.F}\\  File: {\em eesupp/src/MAIN.F}\\
972  File: {\em tools/genmake}\\  File: {\em tools/genmake2}\\
973  File: {\em eedata}\\  File: {\em eedata}\\
974  CPP:  {\em TARGET\_SUN}\\  CPP:  {\em TARGET\_SUN}\\
975  CPP:  {\em TARGET\_DEC}\\  CPP:  {\em TARGET\_DEC}\\
# Line 930  Parameter:  {\em nTy} Line 982  Parameter:  {\em nTy}
982  } \\  } \\
983    
984  \subsubsection{Multi-process execution}  \subsubsection{Multi-process execution}
985  \label{sec:multi-process-execution}  \label{sec:multi_process_execution}
986    
987  Despite its appealing programming model, multi-threaded execution remains  Despite its appealing programming model, multi-threaded execution
988  less common then multi-process execution. One major reason for this  remains less common than multi-process execution. One major reason for
989  is that many system libraries are still not ``thread-safe''. This means that for  this is that many system libraries are still not ``thread-safe''. This
990  example on some systems it is not safe to call system routines to  means that, for example, on some systems it is not safe to call system
991  do I/O when running in multi-threaded mode, except for in a limited set of  routines to perform I/O when running in multi-threaded mode (except,
992  circumstances. Another reason is that support for multi-threaded programming  perhaps, in a limited set of circumstances).  Another reason is that
993  models varies between systems.  support for multi-threaded programming models varies between systems.
994    
995  Multi-process execution is more ubiquitous.  Multi-process execution is more ubiquitous.  In order to run code in a
996  In order to run code in a multi-process configuration a decomposition  multi-process configuration a decomposition specification (see section
997  specification ( see section \ref{sec:specifying_a_decomposition})  \ref{sec:specifying_a_decomposition}) is given (in which the at least
998  is given ( in which the at least one of the  one of the parameters {\em nPx} or {\em nPy} will be greater than one)
999  parameters {\em nPx} or {\em nPy} will be greater than one)  and then, as for multi-threaded operation, appropriate compile time
1000  and then, as for multi-threaded operation,  and run time steps must be taken.
1001  appropriate compile time and run time steps must be taken.  
1002    \paragraph{Compilation} Multi-process execution under the WRAPPER
1003  \paragraph{Compilation} Multi-process execution under the WRAPPER  assumes that the portable, MPI libraries are available for controlling
1004  assumes that the portable, MPI libraries are available  the start-up of multiple processes. The MPI libraries are not
1005  for controlling the start-up of multiple processes. The MPI libraries  required, although they are usually used, for performance critical
1006  are not required, although they are usually used, for performance  communication. However, in order to simplify the task of controlling
1007  critical communication. However, in order to simplify the task  and coordinating the start up of a large number (hundreds and possibly
1008  of controlling and coordinating the start up of a large number  even thousands) of copies of the same program, MPI is used. The calls
1009  (hundreds and possibly even thousands) of copies of the same  to the MPI multi-process startup routines must be activated at compile
1010  program, MPI is used. The calls to the MPI multi-process startup  time.  Currently MPI libraries are invoked by specifying the
1011  routines must be activated at compile time. This is done  appropriate options file with the {\tt-of} flag when running the {\em
1012  by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}    genmake2} script, which generates the Makefile for compiling and
1013  flags in the {\em CPP\_EEOPTIONS.h} file.\\  linking MITgcm.  (Previously this was done by setting the {\em
1014      ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
1015  \fbox{    CPP\_EEOPTIONS.h} file.)  More detailed information about the use of
1016  \begin{minipage}{4.75in}  {\em genmake2} for specifying
1017  File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\  local compiler flags is located in section \ref{sec:genmake}.\\
 CPP:  {\em ALLOW\_USE\_MPI}\\  
 CPP:  {\em ALWAYS\_USE\_MPI}\\  
 Parameter:  {\em nPx}\\  
 Parameter:  {\em nPy}  
 \end{minipage}  
 } \\  
1018    
 Additionally, compile time options are required to link in the  
 MPI libraries and header files. Examples of these options  
 can be found in the {\em genmake} script that creates makefiles  
 for compilation. When this script is executed with the {bf -mpi}  
 flag it will generate a makefile that includes  
 paths for search for MPI head files and for linking in  
 MPI libraries. For example the {\bf -mpi} flag on a  
  Silicon Graphics IRIX system causes a  
 Makefile with the compilation command  
 Graphics IRIX system \begin{verbatim}  
 mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI  
 \end{verbatim}  
 to be generated.  
 This is the correct set of options for using the MPICH open-source  
 version of MPI, when it has been installed under the subdirectory  
 /usr/local/mpi.  
 However, on many systems there may be several  
 versions of MPI installed. For example many systems have both  
 the open source MPICH set of libraries and a vendor specific native form  
 of the MPI libraries. The correct setup to use will depend on the  
 local configuration of your system.\\  
1019    
1020  \fbox{  \fbox{
1021  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
1022  File: {\em tools/genmake}  Directory: {\em tools/build\_options}\\
1023    File: {\em tools/genmake2}
1024  \end{minipage}  \end{minipage}
1025  } \\  } \\
1026  \paragraph{\bf Execution} The mechanics of starting a program in  \paragraph{\bf Execution} The mechanics of starting a program in
1027  multi-process mode under MPI is not standardized. Documentation  multi-process mode under MPI is not standardized. Documentation
1028  associated with the distribution of MPI installed on a system will  associated with the distribution of MPI installed on a system will
1029  describe how to start a program using that distribution.  describe how to start a program using that distribution.  For the
1030  For the free, open-source MPICH system the MITgcm program is started  open-source MPICH system, the MITgcm program can be started using a
1031  using a command such as  command such as
1032  \begin{verbatim}  \begin{verbatim}
1033  mpirun -np 64 -machinefile mf ./mitgcmuv  mpirun -np 64 -machinefile mf ./mitgcmuv
1034  \end{verbatim}  \end{verbatim}
1035  In this example the text {\em -np 64} specifices the number of processes  In this example the text {\em -np 64} specifies the number of
1036  that will be created. The numeric value {\em 64} must be equal to the  processes that will be created. The numeric value {\em 64} must be
1037  product of the processor grid settings of {\em nPx} and {\em nPy}  equal to the product of the processor grid settings of {\em nPx} and
1038  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file  {\em nPy} in the file {\em SIZE.h}. The parameter {\em mf} specifies
1039  called ``mf'' will be read to get a list of processor names on  that a text file called ``mf'' will be read to get a list of processor
1040  which the sixty-four processes will execute. The syntax of this file  names on which the sixty-four processes will execute. The syntax of
1041  is specified by the MPI distribution  this file is specified by the MPI distribution.
1042  \\  \\
1043    
1044  \fbox{  \fbox{
1045  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1025  Parameter: {\em nPy} Line 1051  Parameter: {\em nPy}
1051    
1052    
1053  \paragraph{Environment variables}  \paragraph{Environment variables}
1054  On most systems multi-threaded execution also requires the setting  On most systems multi-threaded execution also requires the setting of
1055  of a special environment variable. On many machines this variable  a special environment variable. On many machines this variable is
1056  is called PARALLEL and its values should be set to the number  called PARALLEL and its values should be set to the number of parallel
1057  of parallel threads required. Generally the help pages associated  threads required. Generally the help or manual pages associated with
1058  with the multi-threaded compiler on a machine will explain  the multi-threaded compiler on a machine will explain how to set the
1059  how to set the required environment variables for that machines.  required environment variables.
1060    
1061  \paragraph{Runtime input parameters}  \paragraph{Runtime input parameters}
1062  Finally the file {\em eedata} needs to be configured to indicate  Finally the file {\em eedata} needs to be configured to indicate the
1063  the number of threads to be used in the x and y directions.  number of threads to be used in the x and y directions.  The variables
1064  The variables {\em nTx} and {\em nTy} in this file are used to  {\em nTx} and {\em nTy} in this file are used to specify the
1065  specify the information required. The product of {\em nTx} and  information required. The product of {\em nTx} and {\em nTy} must be
1066  {\em nTy} must be equal to the number of threads spawned i.e.  equal to the number of threads spawned i.e.  the setting of the
1067  the setting of the environment variable PARALLEL.  environment variable PARALLEL.  The value of {\em nTx} must subdivide
1068  The value of {\em nTx} must subdivide the number of sub-domains  the number of sub-domains in x ({\em nSx}) exactly. The value of {\em
1069  in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the    nTy} must subdivide the number of sub-domains in y ({\em nSy})
1070  number of sub-domains in y ({\em nSy}) exactly.  exactly.  The multiprocess startup of the MITgcm executable {\em
1071  The multiprocess startup of the MITgcm executable {\em mitgcmuv}    mitgcmuv} is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1072  is controlled by the routines {\em EEBOOT\_MINIMAL()} and  {\em INI\_PROCS()}. The first routine performs basic steps required to
1073  {\em INI\_PROCS()}. The first routine performs basic steps required  make sure each process is started and has a textual output stream
1074  to make sure each process is started and has a textual output  associated with it. By default two output files are opened for each
1075  stream associated with it. By default two output files are opened  process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.  The {\bf
1076  for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.    NNNNN} part of the name is filled in with the process number so that
1077  The {\bf NNNNN} part of the name is filled in with the process  process number 0 will create output files {\bf STDOUT.0000} and {\bf
1078  number so that process number 0 will create output files    STDERR.0000}, process number 1 will create output files {\bf
1079  {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create    STDOUT.0001} and {\bf STDERR.0001}, etc. These files are used for
1080  output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files  reporting status and configuration information and for reporting error
1081  are used for reporting status and configuration information and  conditions on a process by process basis.  The {\em EEBOOT\_MINIMAL()}
1082  for reporting error conditions on a process by process basis.  procedure also sets the variables {\em myProcId} and {\em
1083  The {\em EEBOOT\_MINIMAL()} procedure also sets the variables    MPI\_COMM\_MODEL}.  These variables are related to processor
1084  {\em myProcId} and {\em MPI\_COMM\_MODEL}.  identification are are used later in the routine {\em INI\_PROCS()} to
1085  These variables are related  allocate tiles to processes.
1086  to processor identification are are used later in the routine  
1087  {\em INI\_PROCS()} to allocate tiles to processes.  Allocation of processes to tiles is controlled by the routine {\em
1088      INI\_PROCS()}. For each process this routine sets the variables {\em
1089  Allocation of processes to tiles in controlled by the routine    myXGlobalLo} and {\em myYGlobalLo}.  These variables specify, in
1090  {\em INI\_PROCS()}. For each process this routine sets  index space, the coordinates of the southernmost and westernmost
1091  the variables {\em myXGlobalLo} and {\em myYGlobalLo}.  corner of the southernmost and westernmost tile owned by this process.
1092  These variables specify (in index space) the coordinate  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are
1093  of the southern most and western most corner of the  also set in this routine. These are used to identify processes holding
1094  southern most and western most tile owned by this process.  tiles to the west, east, south and north of a given process. These
1095  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}  values are stored in global storage in the header file {\em
1096  are also set in this routine. These are used to identify    EESUPPORT.h} for use by communication routines.  The above does not
1097  processes holding tiles to the west, east, south and north  hold when the exch2 package is used.  The exch2 sets its own
1098  of this process. These values are stored in global storage  parameters to specify the global indices of tiles and their
1099  in the header file {\em EESUPPORT.h} for use by  relationships to each other.  See the documentation on the exch2
1100  communication routines.  package (\ref{sec:exch2}) for details.
1101  \\  \\
1102    
1103  \fbox{  \fbox{
# Line 1092  Parameter: {\em pidN       } Line 1118  Parameter: {\em pidN       }
1118    
1119    
1120  \subsection{Controlling communication}  \subsection{Controlling communication}
1121    \label{sec:controlling_communication}
1122  The WRAPPER maintains internal information that is used for communication  The WRAPPER maintains internal information that is used for communication
1123  operations and that can be customized for different platforms. This section  operations and that can be customized for different platforms. This section
1124  describes the information that is held and used.  describes the information that is held and used.
1125    
1126  \begin{enumerate}  \begin{enumerate}
1127  \item {\bf Tile-tile connectivity information} For each tile the WRAPPER  \item {\bf Tile-tile connectivity information}
1128  sets a flag that sets the tile number to the north, south, east and    For each tile the WRAPPER sets a flag that sets the tile number to
1129  west of that tile. This number is unique over all tiles in a    the north, south, east and west of that tile. This number is unique
1130  configuration. The number is held in the variables {\em tileNo}    over all tiles in a configuration. Except when using the cubed
1131  ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},    sphere and the exch2 package, the number is held in the variables
1132  {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile    {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em
1133  that specifies the type of communication that is used between tiles.      tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also
1134  This information is held in the variables {\em tileCommModeN},    stored with each tile that specifies the type of communication that
1135  {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.    is used between tiles.  This information is held in the variables
1136  This latter set of variables can take one of the following values    {\em tileCommModeN}, {\em tileCommModeS}, {\em tileCommModeE} and
1137  {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.    {\em tileCommModeW}.  This latter set of variables can take one of
1138  A value of {\em COMM\_NONE} is used to indicate that a tile has no    the following values {\em COMM\_NONE}, {\em COMM\_MSG}, {\em
1139  neighbor to communicate with on a particular face. A value      COMM\_PUT} and {\em COMM\_GET}.  A value of {\em COMM\_NONE} is
1140  of {\em COMM\_MSG} is used to indicated that some form of distributed    used to indicate that a tile has no neighbor to communicate with on
1141  memory communication is required to communicate between    a particular face. A value of {\em COMM\_MSG} is used to indicate
1142  these tile faces ( see section \ref{sec:distributed_memory_communication}).    that some form of distributed memory communication is required to
1143  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate    communicate between these tile faces (see section
1144  forms of shared memory communication ( see section    \ref{sec:distributed_memory_communication}).  A value of {\em
1145  \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates      COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
1146  that a CPU should communicate by writing to data structures owned by another    memory communication (see section
1147  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading    \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value
1148  from data structures owned by another CPU. These flags affect the behavior    indicates that a CPU should communicate by writing to data
1149  of the WRAPPER exchange primitive    structures owned by another CPU. A {\em COMM\_GET} value indicates
1150  (see figure \ref{fig:communication_primitives}). The routine    that a CPU should communicate by reading from data structures owned
1151  {\em ini\_communication\_patterns()} is responsible for setting the    by another CPU. These flags affect the behavior of the WRAPPER
1152  communication mode values for each tile.    exchange primitive (see figure \ref{fig:communication_primitives}).
1153  \\    The routine {\em ini\_communication\_patterns()} is responsible for
1154      setting the communication mode values for each tile.
1155    
1156      When using the cubed sphere configuration with the exch2 package,
1157      the relationships between tiles and their communication methods are
1158      set by the exch2 package and stored in different variables.  See the
1159      exch2 package documentation (\ref{sec:exch2} for details.
1160    
1161  \fbox{  \fbox{
1162  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1142  Parameter: {\em tileCommModeS} \\ Line 1175  Parameter: {\em tileCommModeS} \\
1175  } \\  } \\
1176    
1177  \item {\bf MP directives}  \item {\bf MP directives}
1178  The WRAPPER transfers control to numerical application code through    The WRAPPER transfers control to numerical application code through
1179  the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way    the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1180  that allows for it to be invoked by several threads. Support for this    that allows for it to be invoked by several threads. Support for
1181  is based on using multi-processing (MP) compiler directives.    this is based on either multi-processing (MP) compiler directives or
1182  Most commercially available Fortran compilers support the generation    specific calls to multi-threading libraries (\textit{eg.} POSIX
1183  of code to spawn multiple threads through some form of compiler directives.    threads).  Most commercially available Fortran compilers support the
1184  As this is generally much more convenient than writing code to interface    generation of code to spawn multiple threads through some form of
1185  to operating system libraries to explicitly spawn threads, and on some systems    compiler directives.  Compiler directives are generally more
1186  this may be the only method available the WRAPPER is distributed with    convenient than writing code to explicitly spawning threads.  And,
1187  template MP directives for a number of systems.    on some systems, compiler directives may be the only method
1188      available.  The WRAPPER is distributed with template MP directives
1189   These directives are inserted into the code just before and after the    for a number of systems.
1190  transfer of control to numerical algorithm code through the routine  
1191  {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of    These directives are inserted into the code just before and after
1192  the code that performs this process for a Silicon Graphics system.    the transfer of control to numerical algorithm code through the
1193  This code is extracted from the files {\em main.F} and    routine {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows
1194  {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies    an example of the code that performs this process for a Silicon
1195  how many instances of the routine {\em THE\_MODEL\_MAIN} will    Graphics system.  This code is extracted from the files {\em main.F}
1196  be created. The value of {\em nThreads} is set in the routine    and {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads}
1197  {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the    specifies how many instances of the routine {\em THE\_MODEL\_MAIN}
1198  the product of the parameters {\em nTx} and {\em nTy} that    will be created. The value of {\em nThreads} is set in the routine
1199  are read from the file {\em eedata}. If the value of {\em nThreads}    {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the the
1200  is inconsistent with the number of threads requested from the    product of the parameters {\em nTx} and {\em nTy} that are read from
1201  operating system (for example by using an environment    the file {\em eedata}. If the value of {\em nThreads} is
1202  variable as described in section \ref{sec:multi_threaded_execution})    inconsistent with the number of threads requested from the operating
1203  then usually an error will be reported by the routine    system (for example by using an environment variable as described in
1204  {\em CHECK\_THREADS}.\\    section \ref{sec:multi_threaded_execution}) then usually an error
1205      will be reported by the routine {\em CHECK\_THREADS}.
1206    
1207  \fbox{  \fbox{
1208  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1184  Parameter: {\em nTy} \\ Line 1218  Parameter: {\em nTy} \\
1218  }  }
1219    
1220  \item {\bf memsync flags}  \item {\bf memsync flags}
1221  As discussed in section \ref{sec:memory_consistency}, when using shared memory,    As discussed in section \ref{sec:memory_consistency}, a low-level
1222  a low-level system function may be need to force memory consistency.    system function may be need to force memory consistency on some
1223  The routine {\em MEMSYNC()} is used for this purpose. This routine should    shared memory systems.  The routine {\em MEMSYNC()} is used for this
1224  not need modifying and the information below is only provided for    purpose. This routine should not need modifying and the information
1225  completeness. A logical parameter {\em exchNeedsMemSync} set    below is only provided for completeness. A logical parameter {\em
1226  in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether      exchNeedsMemSync} set in the routine {\em
1227  the {\em MEMSYNC()} primitive is called. In general this      INI\_COMMUNICATION\_PATTERNS()} controls whether the {\em
1228  routine is only used for multi-threaded execution.      MEMSYNC()} primitive is called. In general this routine is only
1229  The code that goes into the {\em MEMSYNC()}    used for multi-threaded execution.  The code that goes into the {\em
1230   routine is specific to the compiler and      MEMSYNC()} routine is specific to the compiler and processor used.
1231  processor being used for multi-threaded execution and in general    In some cases, it must be written using a short code snippet of
1232  must be written using a short code snippet of assembly language.    assembly language.  For an Ultra Sparc system the following code
1233  For an Ultra Sparc system the following code snippet is used    snippet is used
1234  \begin{verbatim}  \begin{verbatim}
1235  asm("membar #LoadStore|#StoreStore");  asm("membar #LoadStore|#StoreStore");
1236  \end{verbatim}  \end{verbatim}
# Line 1210  asm("lock; addl $0,0(%%esp)": : :"memory Line 1244  asm("lock; addl $0,0(%%esp)": : :"memory
1244  \end{verbatim}  \end{verbatim}
1245    
1246  \item {\bf Cache line size}  \item {\bf Cache line size}
1247  As discussed in section \ref{sec:cache_effects_and_false_sharing},    As discussed in section \ref{sec:cache_effects_and_false_sharing},
1248  milti-threaded codes explicitly avoid penalties associated with excessive    milti-threaded codes explicitly avoid penalties associated with
1249  coherence traffic on an SMP system. To do this the sgared memory data structures    excessive coherence traffic on an SMP system. To do this the shared
1250  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines    memory data structures used by the {\em GLOBAL\_SUM}, {\em
1251  are padded. The variables that control the padding are set in the      GLOBAL\_MAX} and {\em BARRIER} routines are padded. The variables
1252  header file {\em EEPARAMS.h}. These variables are called    that control the padding are set in the header file {\em
1253  {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and      EEPARAMS.h}. These variables are called {\em cacheLineSize}, {\em
1254  {\em lShare8}. The default values should not normally need changing.      lShare1}, {\em lShare4} and {\em lShare8}. The default values
1255      should not normally need changing.
1256    
1257  \item {\bf \_BARRIER}  \item {\bf \_BARRIER}
1258  This is a CPP macro that is expanded to a call to a routine    This is a CPP macro that is expanded to a call to a routine which
1259  which synchronises all the logical processors running under the    synchronizes all the logical processors running under the WRAPPER.
1260  WRAPPER. Using a macro here preserves flexibility to insert    Using a macro here preserves flexibility to insert a specialized
1261  a specialized call in-line into application code. By default this    call in-line into application code. By default this resolves to
1262  resolves to calling the procedure {\em BARRIER()}. The default    calling the procedure {\em BARRIER()}. The default setting for the
1263  setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.    \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1264    
1265  \item {\bf \_GSUM}  \item {\bf \_GSUM}
1266  This is a CPP macro that is expanded to a call to a routine    This is a CPP macro that is expanded to a call to a routine which
1267  which sums up a floating point numner    sums up a floating point number over all the logical processors
1268  over all the logical processors running under the    running under the WRAPPER. Using a macro here provides extra
1269  WRAPPER. Using a macro here provides extra flexibility to insert    flexibility to insert a specialized call in-line into application
1270  a specialized call in-line into application code. By default this    code. By default this resolves to calling the procedure {\em
1271  resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for      GLOBAL\_SUM\_R8()} ( for 64-bit floating point operands) or {\em
1272  84=bit floating point operands)      GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The
1273  or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default    default setting for the \_GSUM macro is given in the file {\em
1274  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.      CPP\_EEMACROS.h}.  The \_GSUM macro is a performance critical
1275  The \_GSUM macro is a performance critical operation, especially for    operation, especially for large processor count, small tile size
1276  large processor count, small tile size configurations.    configurations.  The custom communication example discussed in
1277  The custom communication example discussed in section \ref{sec:jam_example}    section \ref{sec:jam_example} shows how the macro is used to invoke
1278  shows how the macro is used to invoke a custom global sum routine    a custom global sum routine for a specific set of hardware.
 for a specific set of hardware.  
1279    
1280  \item {\bf \_EXCH}  \item {\bf \_EXCH}
1281  The \_EXCH CPP macro is used to update tile overlap regions.    The \_EXCH CPP macro is used to update tile overlap regions.  It is
1282  It is qualified by a suffix indicating whether overlap updates are for    qualified by a suffix indicating whether overlap updates are for
1283  two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )    two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1284  physical fields and whether fields are 32-bit floating point    physical fields and whether fields are 32-bit floating point (
1285  ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point    \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point (
1286  ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined    \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined in
1287  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the    the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
1288  \_EXCH operation plays a crucial role in scaling to small tile,    operation plays a crucial role in scaling to small tile, large
1289  large logical and physical processor count configurations.    logical and physical processor count configurations.  The example in
1290  The example in section \ref{sec:jam_example} discusses defining an    section \ref{sec:jam_example} discusses defining an optimized and
1291  optimised and specialized form on the \_EXCH operation.    specialized form on the \_EXCH operation.
1292    
1293  The \_EXCH operation is also central to supporting grids such as    The \_EXCH operation is also central to supporting grids such as the
1294  the cube-sphere grid. In this class of grid a rotation may be required    cube-sphere grid. In this class of grid a rotation may be required
1295  between tiles. Aligning the coordinate requiring rotation with the    between tiles. Aligning the coordinate requiring rotation with the
1296  tile decomposistion, allows the coordinate transformation to    tile decomposition, allows the coordinate transformation to be
1297  be embedded within a custom form of the \_EXCH primitive.    embedded within a custom form of the \_EXCH primitive.  In these
1298      cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1299      package documentation \ref{sec:exch2}.
1300    
1301  \item {\bf Reverse Mode}  \item {\bf Reverse Mode}
1302  The communication primitives \_EXCH and \_GSUM both employ    The communication primitives \_EXCH and \_GSUM both employ
1303  hand-written adjoint forms (or reverse mode) forms.    hand-written adjoint forms (or reverse mode) forms.  These reverse
1304  These reverse mode forms can be found in the    mode forms can be found in the source code directory {\em
1305  sourc code directory {\em pkg/autodiff}.      pkg/autodiff}.  For the global sum primitive the reverse mode form
1306  For the global sum primitive the reverse mode form    calls are to {\em GLOBAL\_ADSUM\_R4} and {\em GLOBAL\_ADSUM\_R8}.
1307  calls are to {\em GLOBAL\_ADSUM\_R4} and    The reverse mode form of the exchange primitives are found in
1308  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the    routines prefixed {\em ADEXCH}. The exchange routines make calls to
1309  exchamge primitives are found in routines    the same low-level communication primitives as the forward mode
1310  prefixed {\em ADEXCH}. The exchange routines make calls to    operations. However, the routine argument {\em simulationMode} is
1311  the same low-level communication primitives as the forward mode    set to the value {\em REVERSE\_SIMULATION}. This signifies to the
1312  operations. However, the routine argument {\em simulationMode}    low-level routines that the adjoint forms of the appropriate
1313  is set to the value {\em REVERSE\_SIMULATION}. This signifies    communication operation should be performed.
1314  ti the low-level routines that the adjoint forms of the  
 appropriate communication operation should be performed.  
1315  \item {\bf MAX\_NO\_THREADS}  \item {\bf MAX\_NO\_THREADS}
1316  The variable {\em MAX\_NO\_THREADS} is used to indicate the    The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum
1317  maximum number of OS threads that a code will use. This    number of OS threads that a code will use. This value defaults to
1318  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.    thirty-two and is set in the file {\em EEPARAMS.h}.  For single
1319  For single threaded execution it can be reduced to one if required.    threaded execution it can be reduced to one if required.  The value
1320  The va;lue is largely private to the WRAPPER and application code    is largely private to the WRAPPER and application code will not
1321  will nor normally reference the value, except in the following scenario.    normally reference the value, except in the following scenario.
1322    
1323  For certain physical parametrization schemes it is necessary to have    For certain physical parametrization schemes it is necessary to have
1324  a substantial number of work arrays. Where these arrays are allocated    a substantial number of work arrays. Where these arrays are
1325  in heap storage ( for example COMMON blocks ) multi-threaded    allocated in heap storage (for example COMMON blocks) multi-threaded
1326  execution will require multiple instances of the COMMON block data.    execution will require multiple instances of the COMMON block data.
1327  This can be achieved using a Fortran 90 module construct, however,    This can be achieved using a Fortran 90 module construct.  However,
1328  if this might be unavailable then the work arrays can be extended    if this mechanism is unavailable then the work arrays can be extended
1329  with dimensions use the tile dimensioning scheme of {\em nSx}    with dimensions using the tile dimensioning scheme of {\em nSx} and
1330  and {\em nSy} ( as described in section    {\em nSy} (as described in section
1331  \ref{sec:specifying_a_decomposition}). However, if the configuration    \ref{sec:specifying_a_decomposition}). However, if the
1332  being specified involves many more tiles than OS threads then    configuration being specified involves many more tiles than OS
1333  it can save memory resources to reduce the variable    threads then it can save memory resources to reduce the variable
1334  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that    {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
1335  will be used and to declare the physical parameterisation    that will be used and to declare the physical parameterization work
1336  work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.    arrays with a single {\em MAX\_NO\_THREADS} extra dimension.  An
1337  An example of this is given in the verification experiment    example of this is given in the verification experiment {\em
1338  {\em aim.5l\_cs}. Here the default setting of      aim.5l\_cs}. Here the default setting of {\em MAX\_NO\_THREADS} is
1339  {\em MAX\_NO\_THREADS} is altered to    altered to
1340  \begin{verbatim}  \begin{verbatim}
1341        INTEGER MAX_NO_THREADS        INTEGER MAX_NO_THREADS
1342        PARAMETER ( MAX_NO_THREADS =    6 )        PARAMETER ( MAX_NO_THREADS =    6 )
1343  \end{verbatim}  \end{verbatim}
1344  and several work arrays for storing intermediate calculations are    and several work arrays for storing intermediate calculations are
1345  created with declarations of the form.    created with declarations of the form.
1346  \begin{verbatim}  \begin{verbatim}
1347        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1348  \end{verbatim}  \end{verbatim}
1349  This declaration scheme is not used widely, becuase most global data    This declaration scheme is not used widely, because most global data
1350  is used for permanent not temporary storage of state information.    is used for permanent not temporary storage of state information.
1351  In the case of permanent state information this approach cannot be used    In the case of permanent state information this approach cannot be
1352  because there has to be enough storage allocated for all tiles.    used because there has to be enough storage allocated for all tiles.
1353  However, the technique can sometimes be a useful scheme for reducing memory    However, the technique can sometimes be a useful scheme for reducing
1354  requirements in complex physical paramterisations.    memory requirements in complex physical parameterizations.
1355  \end{enumerate}  \end{enumerate}
1356    
1357  \begin{figure}  \begin{figure}
# Line 1336  C--     Invoke nThreads instances of the Line 1372  C--     Invoke nThreads instances of the
1372    
1373        ENDDO        ENDDO
1374  \end{verbatim}  \end{verbatim}
1375  \caption{Prior to transferring control to    \caption{Prior to transferring control to the procedure {\em
1376  the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use        THE\_MODEL\_MAIN()} the WRAPPER may use MP directives to spawn
1377  MP directives to spawn multiple threads.      multiple threads.  } \label{fig:mp_directives}
 } \label{fig:mp_directives}  
1378  \end{figure}  \end{figure}
1379    
1380    
# Line 1348  MP directives to spawn multiple threads. Line 1383  MP directives to spawn multiple threads.
1383  The isolation of performance critical communication primitives and the  The isolation of performance critical communication primitives and the
1384  sub-division of the simulation domain into tiles is a powerful tool.  sub-division of the simulation domain into tiles is a powerful tool.
1385  Here we show how it can be used to improve application performance and  Here we show how it can be used to improve application performance and
1386  how it can be used to adapt to new gridding approaches.  how it can be used to adapt to new griding approaches.
1387    
1388  \subsubsection{JAM example}  \subsubsection{JAM example}
1389  \label{sec:jam_example}  \label{sec:jam_example}
1390  On some platforms a big performance boost can be obtained by  On some platforms a big performance boost can be obtained by binding
1391  binding the communication routines {\em \_EXCH} and  the communication routines {\em \_EXCH} and {\em \_GSUM} to
1392  {\em \_GSUM} to specialized native libraries ) fro example the  specialized native libraries (for example, the shmem library on CRAY
1393  shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag  T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag is used as an
1394  is used as an illustration of a specialized communication configuration  illustration of a specialized communication configuration that
1395  that substitutes for standard, portable forms of {\em \_EXCH} and  substitutes for standard, portable forms of {\em \_EXCH} and {\em
1396  {\em \_GSUM}. It affects three source files {\em eeboot.F},    \_GSUM}. It affects three source files {\em eeboot.F}, {\em
1397  {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined    CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined is has
1398  is has the following effects.  the following effects.
1399  \begin{itemize}  \begin{itemize}
1400  \item An extra phase is included at boot time to initialize the custom  \item An extra phase is included at boot time to initialize the custom
1401  communications library ( see {\em ini\_jam.F}).    communications library ( see {\em ini\_jam.F}).
1402  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1403  with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})    with calls to custom routines (see {\em gsum\_jam.F} and {\em
1404        exch\_jam.F})
1405  \item a highly specialized form of the exchange operator (optimized  \item a highly specialized form of the exchange operator (optimized
1406  for overlap regions of width one) is substituted into the elliptic    for overlap regions of width one) is substituted into the elliptic
1407  solver routine {\em cg2d.F}.    solver routine {\em cg2d.F}.
1408  \end{itemize}  \end{itemize}
1409  Developing specialized code for other libraries follows a similar  Developing specialized code for other libraries follows a similar
1410  pattern.  pattern.
1411    
1412  \subsubsection{Cube sphere communication}  \subsubsection{Cube sphere communication}
1413  \label{sec:cube_sphere_communication}  \label{sec:cube_sphere_communication}
1414  Actual {\em \_EXCH} routine code is generated automatically from  Actual {\em \_EXCH} routine code is generated automatically from a
1415  a series of template files, for example {\em exch\_rx.template}.  series of template files, for example {\em exch\_rx.template}.  This
1416  This is done to allow a large number of variations on the exchange  is done to allow a large number of variations on the exchange process
1417  process to be maintained. One set of variations supports the  to be maintained. One set of variations supports the cube sphere grid.
1418  cube sphere grid. Support for a cube sphere grid in MITgcm is based  Support for a cube sphere grid in MITgcm is based on having each face
1419  on having each face of the cube as a separate tile (or tiles).  of the cube as a separate tile or tiles.  The exchange routines are
1420  The exchange routines are then able to absorb much of the  then able to absorb much of the detailed rotation and reorientation
1421  detailed rotation and reorientation required when moving around the  required when moving around the cube grid. The set of {\em \_EXCH}
1422  cube grid. The set of {\em \_EXCH} routines that contain the  routines that contain the word cube in their name perform these
1423  word cube in their name perform these transformations.  transformations.  They are invoked when the run-time logical parameter
 They are invoked when the run-time logical parameter  
1424  {\em useCubedSphereExchange} is set true. To facilitate the  {\em useCubedSphereExchange} is set true. To facilitate the
1425  transformations on a staggered C-grid, exchange operations are defined  transformations on a staggered C-grid, exchange operations are defined
1426  separately for both vector and scalar quantities and for  separately for both vector and scalar quantities and for grid-centered
1427  grid-centered and for grid-face and corner quantities.  and for grid-face and grid-corner quantities.  Three sets of exchange
1428  Three sets of exchange routines are defined. Routines  routines are defined. Routines with names of the form {\em exch\_rx}
1429  with names of the form {\em exch\_rx} are used to exchange  are used to exchange cell centered scalar quantities. Routines with
1430  cell centered scalar quantities. Routines with names of the form  names of the form {\em exch\_uv\_rx} are used to exchange vector
1431  {\em exch\_uv\_rx} are used to exchange vector quantities located at  quantities located at the C-grid velocity points. The vector
1432  the C-grid velocity points. The vector quantities exchanged by the  quantities exchanged by the {\em exch\_uv\_rx} routines can either be
1433  {\em exch\_uv\_rx} routines can either be signed (for example velocity  signed (for example velocity components) or un-signed (for example
1434  components) or un-signed (for example grid-cell separations).  grid-cell separations).  Routines with names of the form {\em
1435  Routines with names of the form {\em exch\_z\_rx} are used to exchange    exch\_z\_rx} are used to exchange quantities at the C-grid vorticity
1436  quantities at the C-grid vorticity point locations.  point locations.
1437    
1438    
1439    
1440    
1441  \section{MITgcm execution under WRAPPER}  \section{MITgcm execution under WRAPPER}
1442    \begin{rawhtml}
1443    <!-- CMIREDIR:mitgcm_wrapper: -->
1444    \end{rawhtml}
1445    
1446  Fitting together the WRAPPER elements, package elements and  Fitting together the WRAPPER elements, package elements and
1447  MITgcm core equation elements of the source code produces calling  MITgcm core equation elements of the source code produces calling
# Line 1414  sequence shown in section \ref{sec:calli Line 1452  sequence shown in section \ref{sec:calli
1452    
1453  WRAPPER layer.  WRAPPER layer.
1454    
1455    {\footnotesize
1456  \begin{verbatim}  \begin{verbatim}
1457    
1458         MAIN           MAIN  
# Line 1441  WRAPPER layer. Line 1480  WRAPPER layer.
1480         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine
1481    
1482  \end{verbatim}  \end{verbatim}
1483    }
1484    
1485  Core equations plus packages.  Core equations plus packages.
1486    
1487    {\footnotesize
1488  \begin{verbatim}  \begin{verbatim}
1489  C  C
 C  
1490  C Invocation from WRAPPER level...  C Invocation from WRAPPER level...
1491  C  :  C  :
1492  C  :  C  :
# Line 1510  C    | | |-CTRL_INIT           :: Contro Line 1550  C    | | |-CTRL_INIT           :: Contro
1550  C    | | |-OPTIM_READPARMS     :: Optimisation support package. see pkg/ctrl  C    | | |-OPTIM_READPARMS     :: Optimisation support package. see pkg/ctrl
1551  C    | | |-GRDCHK_READPARMS    :: Gradient check package. see pkg/grdchk  C    | | |-GRDCHK_READPARMS    :: Gradient check package. see pkg/grdchk
1552  C    | | |-ECCO_READPARMS      :: ECCO Support Package. see pkg/ecco  C    | | |-ECCO_READPARMS      :: ECCO Support Package. see pkg/ecco
1553    C    | | |-PTRACERS_READPARMS  :: multiple tracer package, see pkg/ptracers
1554    C    | | |-GCHEM_READPARMS     :: tracer interface package, see pkg/gchem
1555  C    | |  C    | |
1556  C    | |-PACKAGES_CHECK  C    | |-PACKAGES_CHECK
1557  C    | | |  C    | | |
1558  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp
1559  C    | | |-OBCS_CHECK          :: Open bndy Package. pkg/obcs  C    | | |-OBCS_CHECK          :: Open bndy Pacakge. pkg/obcs
1560  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi
1561  C    | |  C    | |
1562  C    | |-PACKAGES_INIT_FIXED  C    | |-PACKAGES_INIT_FIXED
1563  C    | | |-OBCS_INIT_FIXED     :: Open bndy Package. see pkg/obcs  C    | | |-OBCS_INIT_FIXED     :: Open bndy Package. see pkg/obcs
1564  C    | | |-FLT_INIT            :: Floats Package. see pkg/flt  C    | | |-FLT_INIT            :: Floats Package. see pkg/flt
1565    C    | | |-GCHEM_INIT_FIXED    :: tracer interface pachage, see pkg/gchem
1566  C    | |  C    | |
1567  C    | |-ZONAL_FILT_INIT       :: FFT filter Package. see pkg/zonal_filt  C    | |-ZONAL_FILT_INIT       :: FFT filter Package. see pkg/zonal_filt
1568  C    | |  C    | |
1569  C    | |-INI_CG2D              :: 2d con. grad solver initialisation.  C    | |-INI_CG2D              :: 2d con. grad solver initialization.
1570  C    | |  C    | |
1571  C    | |-INI_CG3D              :: 3d con. grad solver initialisation.  C    | |-INI_CG3D              :: 3d con. grad solver initialization.
1572  C    | |  C    | |
1573  C    | |-CONFIG_SUMMARY        :: Provide synopsis of kernel setup.  C    | |-CONFIG_SUMMARY        :: Provide synopsis of kernel setup.
1574  C    |                         :: Includes annotated table of kernel  C    |                         :: Includes annotated table of kernel
# Line 1550  C    | | | Line 1593  C    | | |
1593  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,
1594  C    | | |              :: sphere options are coded.  C    | | |              :: sphere options are coded.
1595  C    | | |  C    | | |
1596  C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.  C    | | |-INI_CG2D     :: 2d con. grad solver initialization.
1597  C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.  C    | | |-INI_CG3D     :: 3d con. grad solver initialization.
1598  C    | | |-INI_MIXING   :: Initialise diapycnal diffusivity.  C    | | |-INI_MIXING   :: Initialize diapycnal diffusivity.
1599  C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical  C    | | |-INI_DYNVARS  :: Initialize to zero all DYNVARS.h arrays (dynamical
1600  C    | | |              :: fields).  C    | | |              :: fields).
1601  C    | | |  C    | | |
1602  C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero  C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero
# Line 1561  C    | | | |-INI_VEL    :: Initialize 3D Line 1604  C    | | | |-INI_VEL    :: Initialize 3D
1604  C    | | | |-INI_THETA  :: Set model initial temperature field.  C    | | | |-INI_THETA  :: Set model initial temperature field.
1605  C    | | | |-INI_SALT   :: Set model initial salinity field.  C    | | | |-INI_SALT   :: Set model initial salinity field.
1606  C    | | | |-INI_PSURF  :: Set model initial free-surface height/pressure.  C    | | | |-INI_PSURF  :: Set model initial free-surface height/pressure.
1607  C    | | |  C    | | | |-INI_PRESSURE :: Compute model initial hydrostatic pressure
1608  C    | | |-INI_TR1      :: Set initial tracer 1 distribution.  C    | | | |-READ_CHECKPOINT :: Read the checkpoint
1609  C    | | |  C    | | |
1610  C    | | |-THE_CORRECTION_STEP :: Step forward to next time step.  C    | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1611  C    | | | |                   :: Here applied to move restart conditions  C    | | | |                   :: Here applied to move restart conditions
# Line 1589  C    | | | |-FIND_RHO  :: Find adjacent Line 1632  C    | | | |-FIND_RHO  :: Find adjacent
1632  C    | | | |-CONVECT   :: Mix static instability.  C    | | | |-CONVECT   :: Mix static instability.
1633  C    | | | |-TIMEAVE_CUMULATE :: Update convection statistics.  C    | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1634  C    | | |  C    | | |
1635  C    | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving  C    | | |-PACKAGES_INIT_VARIABLES :: Does initialization of time evolving
1636  C    | | | |                       :: package data.  C    | | | |                       :: package data.
1637  C    | | | |  C    | | | |
1638  C    | | | |-GMREDI_INIT          :: GM package. ( see pkg/gmredi )  C    | | | |-GMREDI_INIT          :: GM package. ( see pkg/gmredi )
1639  C    | | | |-KPP_INIT             :: KPP package. ( see pkg/kpp )  C    | | | |-KPP_INIT             :: KPP package. ( see pkg/kpp )
1640  C    | | | |-KPP_OPEN_DIAGS      C    | | | |-KPP_OPEN_DIAGS    
1641  C    | | | |-OBCS_INIT_VARIABLES  :: Open bndy. package. ( see pkg/obcs )  C    | | | |-OBCS_INIT_VARIABLES  :: Open bndy. package. ( see pkg/obcs )
1642    C    | | | |-PTRACERS_INIT        :: multi. tracer package,(see pkg/ptracers)
1643    C    | | | |-GCHEM_INIT           :: tracer interface pkg (see pkh/gchem)
1644  C    | | | |-AIM_INIT             :: Interm. atmos package. ( see pkg/aim )  C    | | | |-AIM_INIT             :: Interm. atmos package. ( see pkg/aim )
1645  C    | | | |-CTRL_MAP_INI         :: Control vector package.( see pkg/ctrl )  C    | | | |-CTRL_MAP_INI         :: Control vector package.( see pkg/ctrl )
1646  C    | | | |-COST_INIT            :: Cost function package. ( see pkg/cost )  C    | | | |-COST_INIT            :: Cost function package. ( see pkg/cost )
# Line 1638  C/\  | | | |                    :: Simpl Line 1683  C/\  | | | |                    :: Simpl
1683  C/\  | | | |                    :: for forcing datasets.  C/\  | | | |                    :: for forcing datasets.
1684  C/\  | | | |                    C/\  | | | |                  
1685  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.
1686    C/\  | | |-SEAICE_MODEL   :: Compute sea-ice terms. ( pkg/seaice )
1687    C/\  | | |-FREEZE         :: Limit surface temperature.
1688    C/\  | | |-GCHEM_FIELD_LOAD :: load tracer forcing fields (pkg/gchem)
1689  C/\  | | |  C/\  | | |
1690  C/\  | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.  C/\  | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1691  C/\  | | | |  C/\  | | | |
1692  C/\  | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.  C/\  | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1693  C/\  | | | |-OBCS_APPLY_W    :: Open bndy. package ( see pkg/obcs ).  C/\  | | | |-OBCS_APPLY_W    :: Open bndy. package ( see pkg/obcs ).
1694  C/\  | | | |-FIND_RHO        :: Calculates [rho(S,T,z)-Rhonil] of a slice  C/\  | | | |-FIND_RHO        :: Calculates [rho(S,T,z)-RhoConst] of a slice
1695  C/\  | | | |-GRAD_SIGMA      :: Calculate isoneutral gradients  C/\  | | | |-GRAD_SIGMA      :: Calculate isoneutral gradients
1696  C/\  | | | |-CALC_IVDC       :: Set Implicit Vertical Diffusivity for Convection  C/\  | | | |-CALC_IVDC       :: Set Implicit Vertical Diffusivity for Convection
1697  C/\  | | | |  C/\  | | | |
1698  C/\  | | | |-OBCS_CALC            :: Open bndy. package ( see pkg/obcs ).  C/\  | | | |-OBCS_CALC            :: Open bndy. package ( see pkg/obcs ).
1699  C/\  | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned  C/\  | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1700  C/\  | | | |                      :: forcing terms.  C/\  | | | | |                    :: forcing terms.
1701    C/\  | | | | |-PTRACERS_FORCING_SURF :: Tracer package ( see pkg/ptracers ).
1702  C/\  | | | |  C/\  | | | |
1703  C/\  | | | |-GMREDI_CALC_TENSOR   :: GM package ( see pkg/gmredi ).  C/\  | | | |-GMREDI_CALC_TENSOR   :: GM package ( see pkg/gmredi ).
1704  C/\  | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).  C/\  | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
# Line 1667  C/\  | | | | Line 1716  C/\  | | | |
1716  C/\  | | | |-CALC_GT              :: Calculate the temperature tendency terms  C/\  | | | |-CALC_GT              :: Calculate the temperature tendency terms
1717  C/\  | | | | |  C/\  | | | | |
1718  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
1719  C/\  | | | | |                    :: ( see pkg/gad )  C/\  | | | | | |                  :: ( see pkg/gad )
1720    C/\  | | | | | |-KPP_TRANSPORT_T  :: KPP non-local transport ( see pkg/kpp ).
1721    C/\  | | | | |
1722  C/\  | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.  C/\  | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1723  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
1724  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
# Line 1677  C/\  | | | | Line 1728  C/\  | | | |
1728  C/\  | | | |-CALC_GS              :: Calculate the salinity tendency terms  C/\  | | | |-CALC_GS              :: Calculate the salinity tendency terms
1729  C/\  | | | | |  C/\  | | | | |
1730  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
1731  C/\  | | | | |                    :: ( see pkg/gad )  C/\  | | | | | |                  :: ( see pkg/gad )
1732    C/\  | | | | | |-KPP_TRANSPORT_S  :: KPP non-local transport ( see pkg/kpp ).
1733    C/\  | | | | |
1734  C/\  | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.  C/\  | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1735  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
1736  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1737  C/\  | | | |  C/\  | | | |
1738  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time
1739  C/\  | | | |  C/\  | | | |
1740  C/\  | | | |-CALC_GTR1            :: Calculate other tracer(s) tendency terms  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time
1741    C/\  | | | |
1742    C/\  | | | |-PTRACERS_INTEGRATE   :: Integrate other tracer(s) (see pkg/ptracers).
1743  C/\  | | | | |  C/\  | | | | |
1744  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
1745  C/\  | | | | |                    :: ( see pkg/gad )  C/\  | | | | | |                  :: ( see pkg/gad )
1746  C/\  | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.  C/\  | | | | | |-KPP_TRANSPORT_PTR:: KPP non-local transport ( see pkg/kpp ).
1747    C/\  | | | | |
1748    C/\  | | | | |-PTRACERS_FORCING   :: Problem specific forcing for tracer.
1749    C/\  | | | | |-GCHEM_FORCING_INT  :: tracer forcing for gchem pkg (if all
1750    C/\  | | | | |                       tendancy terms calcualted together)
1751  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
1752  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1753    C/\  | | | | |-TIMESTEP_TRACER    :: Step tracer field forward in time
1754  C/\  | | | |  C/\  | | | |
 C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time  
1755  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).
 C/\  | | | |-FREEZE               :: Limit range of temperature.  
1756  C/\  | | | |  C/\  | | | |
1757  C/\  | | | |-IMPLDIFF             :: Solve vertical implicit diffusion equation.  C/\  | | | |-IMPLDIFF             :: Solve vertical implicit diffusion equation.
1758  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).
# Line 1753  C/\  | | | Line 1811  C/\  | | |
1811  C/\  | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.  C/\  | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1812  C/\  | | | |-EXCH                                                    C/\  | | | |-EXCH                                                  
1813  C/\  | | |  C/\  | | |
1814    C/\  | | |-GCHEM_FORCING_SEP :: tracer forcing for gchem pkg (if
1815    C/\  | | |                      tracer dependent tendencies calculated
1816    C/\  | | |                      separatly)
1817    C/\  | | |
1818  C/\  | | |-FLT_MAIN         :: Float package ( pkg/flt ).  C/\  | | |-FLT_MAIN         :: Float package ( pkg/flt ).
1819  C/\  | | |  C/\  | | |
1820  C/\  | | |-MONITOR          :: Monitor package ( pkg/monitor ).  C/\  | | |-MONITOR          :: Monitor package ( pkg/monitor ).
# Line 1763  C/\  | | | |-TIMEAVE_STATV_WRITE :: Time Line 1825  C/\  | | | |-TIMEAVE_STATV_WRITE :: Time
1825  C/\  | | | |-AIM_WRITE_DIAGS     :: Intermed. atmos diags. see pkg/aim  C/\  | | | |-AIM_WRITE_DIAGS     :: Intermed. atmos diags. see pkg/aim
1826  C/\  | | | |-GMREDI_DIAGS        :: GM diags. see pkg/gmredi  C/\  | | | |-GMREDI_DIAGS        :: GM diags. see pkg/gmredi
1827  C/\  | | | |-KPP_DO_DIAGS        :: KPP diags. see pkg/kpp  C/\  | | | |-KPP_DO_DIAGS        :: KPP diags. see pkg/kpp
1828    C/\  | | | |-SBO_CALC            :: SBO diags. see pkg/sbo
1829    C/\  | | | |-SBO_DIAGS           :: SBO diags. see pkg/sbo
1830    C/\  | | | |-SEAICE_DO_DIAGS     :: SEAICE diags. see pkg/seaice
1831    C/\  | | | |-GCHEM_DIAGS         :: gchem diags. see pkg/gchem
1832  C/\  | | |  C/\  | | |
1833  C/\  | | |-WRITE_CHECKPOINT :: Do I/O for restart files.  C/\  | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1834  C/\  | |  C/\  | |
# Line 1780  C    |-TIMER_PRINTALL :: Computational t Line 1846  C    |-TIMER_PRINTALL :: Computational t
1846  C    |  C    |
1847  C    |-COMM_STATS     :: Summarise inter-proc and inter-thread communication  C    |-COMM_STATS     :: Summarise inter-proc and inter-thread communication
1848  C                     :: events.  C                     :: events.
1849  C  C
1850  \end{verbatim}  \end{verbatim}
1851    }
1852    
1853  \subsection{Measuring and Characterizing Performance}  \subsection{Measuring and Characterizing Performance}
1854    

Legend:
Removed from v.1.4  
changed lines
  Added in v.1.26

  ViewVC Help
Powered by ViewVC 1.1.22