/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Annotation of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.26 - (hide annotations) (download) (as text)
Mon Aug 30 23:09:22 2010 UTC (14 years, 10 months ago) by jmc
Branch: MAIN
CVS Tags: checkpoint01, HEAD
Changes since 1.25: +48 -43 lines
File MIME type: application/x-tex
clean-up latex built:
 (remove multiple definition of label; fix missing reference; replace
  non-standard latex stuff; ...)

1 jmc 1.26 % $Header: /u/gcmpack/manual/s_software/text/sarch.tex,v 1.25 2010/08/27 13:17:59 jmc Exp $
2 cnh 1.1
3 edhill 1.22 This chapter focuses on describing the {\bf WRAPPER} environment
4     within which both the core numerics and the pluggable packages
5     operate. The description presented here is intended to be a detailed
6     exposition and contains significant background material, as well as
7     advanced details on working with the WRAPPER. The tutorial sections
8 jmc 1.26 of this manual (see sections \ref{sec:modelExamples} and
9     \ref{sec:tutorialIII}) contain more succinct, step-by-step
10 edhill 1.22 instructions on running basic numerical experiments, of varous types,
11     both sequentially and in parallel. For many projects simply starting
12     from an example code and adapting it to suit a particular situation
13     will be all that is required. The first part of this chapter
14     discusses the MITgcm architecture at an abstract level. In the second
15     part of the chapter we described practical details of the MITgcm
16     implementation and of current tools and operating system features that
17     are employed.
18 cnh 1.1
19     \section{Overall architectural goals}
20 edhill 1.20 \begin{rawhtml}
21     <!-- CMIREDIR:overall_architectural_goals: -->
22     \end{rawhtml}
23 cnh 1.1
24     Broadly, the goals of the software architecture employed in MITgcm are
25     three-fold
26    
27     \begin{itemize}
28 edhill 1.22 \item We wish to be able to study a very broad range of interesting
29     and challenging rotating fluids problems.
30     \item We wish the model code to be readily targeted to a wide range of
31     platforms
32     \item On any given platform we would like to be able to achieve
33     performance comparable to an implementation developed and
34     specialized specifically for that platform.
35 cnh 1.1 \end{itemize}
36    
37 edhill 1.22 These points are summarized in figure
38     \ref{fig:mitgcm_architecture_goals} which conveys the goals of the
39     MITgcm design. The goals lead to a software architecture which at the
40     high-level can be viewed as consisting of
41 cnh 1.1
42     \begin{enumerate}
43 edhill 1.21 \item A core set of numerical and support code. This is discussed in
44     detail in section \ref{chap:discretization}.
45    
46     \item A scheme for supporting optional ``pluggable'' {\bf packages}
47     (containing for example mixed-layer schemes, biogeochemical schemes,
48     atmospheric physics). These packages are used both to overlay
49     alternate dynamics and to introduce specialized physical content
50     onto the core numerical code. An overview of the {\bf package}
51     scheme is given at the start of part \ref{chap:packagesI}.
52    
53     \item A support framework called {\bf WRAPPER} (Wrappable Application
54     Parallel Programming Environment Resource), within which the core
55     numerics and pluggable packages operate.
56 cnh 1.1 \end{enumerate}
57    
58 edhill 1.21 This chapter focuses on describing the {\bf WRAPPER} environment under
59     which both the core numerics and the pluggable packages function. The
60     description presented here is intended to be a detailed exposition and
61     contains significant background material, as well as advanced details
62     on working with the WRAPPER. The examples section of this manual
63     (part \ref{chap:getting_started}) contains more succinct, step-by-step
64     instructions on running basic numerical experiments both sequentially
65     and in parallel. For many projects simply starting from an example
66     code and adapting it to suit a particular situation will be all that
67     is required.
68 cnh 1.1
69 adcroft 1.2
70 cnh 1.1 \begin{figure}
71     \begin{center}
72 jmc 1.25 \resizebox{!}{2.5in}{\includegraphics{s_software/figs/mitgcm_goals.eps}}
73 cnh 1.1 \end{center}
74 edhill 1.22 \caption{ The MITgcm architecture is designed to allow simulation of a
75     wide range of physical problems on a wide range of hardware. The
76     computational resource requirements of the applications targeted
77     range from around $10^7$ bytes ($\approx 10$ megabytes) of memory to
78     $10^{11}$ bytes ($\approx 100$ gigabytes). Arithmetic operation
79     counts for the applications of interest range from $10^{9}$ floating
80     point operations to more than $10^{17}$ floating point operations.}
81 adcroft 1.2 \label{fig:mitgcm_architecture_goals}
82 cnh 1.1 \end{figure}
83    
84     \section{WRAPPER}
85 afe 1.17 \begin{rawhtml}
86 afe 1.18 <!-- CMIREDIR:wrapper: -->
87 afe 1.17 \end{rawhtml}
88 cnh 1.1
89 edhill 1.22 A significant element of the software architecture utilized in MITgcm
90     is a software superstructure and substructure collectively called the
91     WRAPPER (Wrappable Application Parallel Programming Environment
92     Resource). All numerical and support code in MITgcm is written to
93     ``fit'' within the WRAPPER infrastructure. Writing code to ``fit''
94     within the WRAPPER means that coding has to follow certain, relatively
95     straightforward, rules and conventions (these are discussed further in
96 jmc 1.26 section \ref{sec:specifying_a_decomposition}).
97 cnh 1.1
98 edhill 1.22 The approach taken by the WRAPPER is illustrated in figure
99     \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to
100     insulate code that fits within it from architectural differences
101     between hardware platforms and operating systems. This allows
102     numerical code to be easily retargetted.
103 adcroft 1.2
104    
105 cnh 1.1 \begin{figure}
106     \begin{center}
107 jmc 1.25 \resizebox{!}{4.5in}{\includegraphics{s_software/figs/fit_in_wrapper.eps}}
108 cnh 1.1 \end{center}
109 adcroft 1.2 \caption{
110 edhill 1.10 Numerical code is written to fit within a software support
111 cnh 1.1 infrastructure called WRAPPER. The WRAPPER is portable and
112 cnh 1.4 can be specialized for a wide range of specific target hardware and
113 cnh 1.1 programming environments, without impacting numerical code that fits
114     within the WRAPPER. Codes that fit within the WRAPPER can generally be
115     made to run as fast on a particular platform as codes specially
116 adcroft 1.2 optimized for that platform.}
117     \label{fig:fit_in_wrapper}
118 cnh 1.1 \end{figure}
119    
120     \subsection{Target hardware}
121 jmc 1.26 \label{sec:target_hardware}
122 cnh 1.1
123 edhill 1.21 The WRAPPER is designed to target as broad as possible a range of
124     computer systems. The original development of the WRAPPER took place
125     on a multi-processor, CRAY Y-MP system. On that system, numerical code
126     performance and scaling under the WRAPPER was in excess of that of an
127     implementation that was tightly bound to the CRAY systems proprietary
128     multi-tasking and micro-tasking approach. Later developments have been
129     carried out on uniprocessor and multi-processor Sun systems with both
130     uniform memory access (UMA) and non-uniform memory access (NUMA)
131     designs. Significant work has also been undertaken on x86 cluster
132     systems, Alpha processor based clustered SMP systems, and on
133     cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
134     systems. The MITgcm code, operating within the WRAPPER, is also
135     routinely used on large scale MPP systems (for example, Cray T3E and
136     IBM SP systems). In all cases numerical code, operating within the
137     WRAPPER, performs and scales very competitively with equivalent
138     numerical code that has been modified to contain native optimizations
139 jmc 1.26 for a particular system \cite{hoe-hill:99}.
140 cnh 1.1
141     \subsection{Supporting hardware neutrality}
142    
143 jmc 1.26 The different systems listed in section \ref{sec:target_hardware} can
144 edhill 1.21 be categorized in many different ways. For example, one common
145     distinction is between shared-memory parallel systems (SMP and PVP)
146     and distributed memory parallel systems (for example x86 clusters and
147     large MPP systems). This is one example of a difference between
148     compute platforms that can impact an application. Another common
149     distinction is between vector processing systems with highly
150     specialized CPUs and memory subsystems and commodity microprocessor
151     based systems. There are numerous other differences, especially in
152     relation to how parallel execution is supported. To capture the
153     essential differences between different platforms the WRAPPER uses a
154     {\it machine model}.
155 cnh 1.1
156     \subsection{WRAPPER machine model}
157    
158 edhill 1.21 Applications using the WRAPPER are not written to target just one
159     particular machine (for example an IBM SP2) or just one particular
160     family or class of machines (for example Parallel Vector Processor
161     Systems). Instead the WRAPPER provides applications with an abstract
162     {\it machine model}. The machine model is very general, however, it
163     can easily be specialized to fit, in a computationally efficient
164     manner, any computer architecture currently available to the
165     scientific computing community.
166 cnh 1.1
167     \subsection{Machine model parallelism}
168 jmc 1.26 \label{sec:domain_decomposition}
169 afe 1.17 \begin{rawhtml}
170 afe 1.18 <!-- CMIREDIR:domain_decomp: -->
171 afe 1.17 \end{rawhtml}
172 cnh 1.1
173 edhill 1.21 Codes operating under the WRAPPER target an abstract machine that is
174     assumed to consist of one or more logical processors that can compute
175     concurrently. Computational work is divided among the logical
176     processors by allocating ``ownership'' to each processor of a certain
177     set (or sets) of calculations. Each set of calculations owned by a
178     particular processor is associated with a specific region of the
179     physical space that is being simulated, only one processor will be
180     associated with each such region (domain decomposition).
181    
182     In a strict sense the logical processors over which work is divided do
183     not need to correspond to physical processors. It is perfectly
184     possible to execute a configuration decomposed for multiple logical
185     processors on a single physical processor. This helps ensure that
186     numerical code that is written to fit within the WRAPPER will
187     parallelize with no additional effort. It is also useful for
188     debugging purposes. Generally, however, the computational domain will
189     be subdivided over multiple logical processors in order to then bind
190     those logical processors to physical processor resources that can
191     compute in parallel.
192 cnh 1.1
193     \subsubsection{Tiles}
194    
195 edhill 1.21 Computationally, the data structures (\textit{eg.} arrays, scalar
196     variables, etc.) that hold the simulated state are associated with
197     each region of physical space and are allocated to a particular
198     logical processor. We refer to these data structures as being {\bf
199     owned} by the processor to which their associated region of physical
200     space has been allocated. Individual regions that are allocated to
201     processors are called {\bf tiles}. A processor can own more than one
202     tile. Figure \ref{fig:domaindecomp} shows a physical domain being
203     mapped to a set of logical processors, with each processors owning a
204     single region of the domain (a single tile). Except for periods of
205     communication and coordination, each processor computes autonomously,
206     working only with data from the tile (or tiles) that the processor
207     owns. When multiple tiles are alloted to a single processor, each
208     tile is computed on independently of the other tiles, in a sequential
209     fashion.
210 cnh 1.1
211     \begin{figure}
212     \begin{center}
213 adcroft 1.2 \resizebox{5in}{!}{
214 jmc 1.25 \includegraphics{s_software/figs/domain_decomp.eps}
215 cnh 1.1 }
216     \end{center}
217 edhill 1.21 \caption{ The WRAPPER provides support for one and two dimensional
218     decompositions of grid-point domains. The figure shows a
219     hypothetical domain of total size $N_{x}N_{y}N_{z}$. This
220     hypothetical domain is decomposed in two-dimensions along the
221     $N_{x}$ and $N_{y}$ directions. The resulting {\bf tiles} are {\bf
222     owned} by different processors. The {\bf owning} processors
223     perform the arithmetic operations associated with a {\bf tile}.
224     Although not illustrated here, a single processor can {\bf own}
225     several {\bf tiles}. Whenever a processor wishes to transfer data
226     between tiles or communicate with other processors it calls a
227     WRAPPER supplied function. } \label{fig:domaindecomp}
228 cnh 1.1 \end{figure}
229    
230    
231     \subsubsection{Tile layout}
232    
233 edhill 1.21 Tiles consist of an interior region and an overlap region. The
234     overlap region of a tile corresponds to the interior region of an
235     adjacent tile. In figure \ref{fig:tiledworld} each tile would own the
236     region within the black square and hold duplicate information for
237     overlap regions extending into the tiles to the north, south, east and
238     west. During computational phases a processor will reference data in
239     an overlap region whenever it requires values that lie outside the
240     domain it owns. Periodically processors will make calls to WRAPPER
241     functions to communicate data between tiles, in order to keep the
242     overlap regions up to date (see section
243 jmc 1.26 \ref{sec:communication_primitives}). The WRAPPER functions can use a
244 cnh 1.1 variety of different mechanisms to communicate data between tiles.
245    
246     \begin{figure}
247     \begin{center}
248 adcroft 1.2 \resizebox{5in}{!}{
249 jmc 1.25 \includegraphics{s_software/figs/tiled-world.eps}
250 cnh 1.1 }
251     \end{center}
252     \caption{ A global grid subdivided into tiles.
253     Tiles contain a interior region and an overlap region.
254     Overlap regions are periodically updated from neighboring tiles.
255     } \label{fig:tiledworld}
256     \end{figure}
257    
258     \subsection{Communication mechanisms}
259    
260 edhill 1.21 Logical processors are assumed to be able to exchange information
261     between tiles and between each other using at least one of two
262     possible mechanisms.
263 cnh 1.1
264     \begin{itemize}
265 edhill 1.21 \item {\bf Shared memory communication}. Under this mode of
266     communication data transfers are assumed to be possible using direct
267     addressing of regions of memory. In this case a CPU is able to read
268     (and write) directly to regions of memory ``owned'' by another CPU
269     using simple programming language level assignment operations of the
270     the sort shown in figure \ref{fig:simple_assign}. In this way one
271     CPU (CPU1 in the figure) can communicate information to another CPU
272     (CPU2 in the figure) by assigning a particular value to a particular
273     memory location.
274    
275     \item {\bf Distributed memory communication}. Under this mode of
276     communication there is no mechanism, at the application code level,
277     for directly addressing regions of memory owned and visible to
278     another CPU. Instead a communication library must be used as
279     illustrated in figure \ref{fig:comm_msg}. In this case CPUs must
280     call a function in the API of the communication library to
281     communicate data from a tile that it owns to a tile that another CPU
282     owns. By default the WRAPPER binds to the MPI communication library
283 jmc 1.26 \cite{MPI-std-20} for this style of communication.
284 cnh 1.1 \end{itemize}
285    
286     The WRAPPER assumes that communication will use one of these two styles
287 edhill 1.21 of communication. The underlying hardware and operating system support
288 cnh 1.1 for the style used is not specified and can vary from system to system.
289    
290     \begin{figure}
291     \begin{verbatim}
292    
293     CPU1 | CPU2
294     ==== | ====
295     |
296     a(3) = 8 | WHILE ( a(3) .NE. 8 )
297     | WAIT
298     | END WHILE
299     |
300     \end{verbatim}
301 edhill 1.21 \caption{In the WRAPPER shared memory communication model, simple writes to an
302     array can be made to be visible to other CPUs at the application code level.
303 cnh 1.1 So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
304 edhill 1.21 element $3$ of array $a$, then other CPUs (for example CPU2 in the figure above)
305 cnh 1.1 will be able to see the value $8$ when they read from $a(3)$.
306     This provides a very low latency and high bandwidth communication
307     mechanism.
308     } \label{fig:simple_assign}
309     \end{figure}
310    
311     \begin{figure}
312     \begin{verbatim}
313    
314     CPU1 | CPU2
315     ==== | ====
316     |
317     a(3) = 8 | WHILE ( a(3) .NE. 8 )
318     CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
319     | END WHILE
320     |
321     \end{verbatim}
322     \caption{ In the WRAPPER distributed memory communication model
323 edhill 1.21 data can not be made directly visible to other CPUs.
324 cnh 1.1 If one CPU writes the value $8$ to element $3$ of array $a$, then
325     at least one of CPU1 and/or CPU2 in the figure above will need
326     to call a bespoke communication library in order for the updated
327 edhill 1.21 value to be communicated between CPUs.
328 cnh 1.1 } \label{fig:comm_msg}
329     \end{figure}
330    
331     \subsection{Shared memory communication}
332 jmc 1.26 \label{sec:shared_memory_communication}
333 cnh 1.1
334 edhill 1.21 Under shared communication independent CPUs are operating on the
335     exact same global address space at the application level. This means
336     that CPU 1 can directly write into global data structures that CPU 2
337     ``owns'' using a simple assignment at the application level. This is
338     the model of memory access is supported at the basic system design
339     level in ``shared-memory'' systems such as PVP systems, SMP systems,
340     and on distributed shared memory systems (\textit{eg.} SGI Origin, SGI
341     Altix, and some AMD Opteron systems). On such systems the WRAPPER
342     will generally use simple read and write statements to access directly
343     application data structures when communicating between CPUs.
344    
345     In a system where assignments statements, like the one in figure
346     \ref{fig:simple_assign} map directly to hardware instructions that
347     transport data between CPU and memory banks, this can be a very
348     efficient mechanism for communication. In this case two CPUs, CPU1
349     and CPU2, can communicate simply be reading and writing to an agreed
350     location and following a few basic rules. The latency of this sort of
351     communication is generally not that much higher than the hardware
352     latency of other memory accesses on the system. The bandwidth
353     available between CPUs communicating in this way can be close to the
354     bandwidth of the systems main-memory interconnect. This can make this
355     method of communication very efficient provided it is used
356     appropriately.
357 cnh 1.1
358     \subsubsection{Memory consistency}
359 jmc 1.26 \label{sec:memory_consistency}
360 cnh 1.1
361 edhill 1.21 When using shared memory communication between multiple processors the
362     WRAPPER level shields user applications from certain counter-intuitive
363     system behaviors. In particular, one issue the WRAPPER layer must
364     deal with is a systems memory model. In general the order of reads
365     and writes expressed by the textual order of an application code may
366     not be the ordering of instructions executed by the processor
367     performing the application. The processor performing the application
368     instructions will always operate so that, for the application
369     instructions the processor is executing, any reordering is not
370     apparent. However, in general machines are often designed so that
371     reordering of instructions is not hidden from other second processors.
372     This means that, in general, even on a shared memory system two
373     processors can observe inconsistent memory values.
374    
375     The issue of memory consistency between multiple processors is
376     discussed at length in many computer science papers. From a practical
377     point of view, in order to deal with this issue, shared memory
378     machines all provide some mechanism to enforce memory consistency when
379     it is needed. The exact mechanism employed will vary between systems.
380     For communication using shared memory, the WRAPPER provides a place to
381     invoke the appropriate mechanism to ensure memory consistency for a
382     particular platform.
383 cnh 1.1
384     \subsubsection{Cache effects and false sharing}
385 jmc 1.26 \label{sec:cache_effects_and_false_sharing}
386 cnh 1.1
387     Shared-memory machines often have local to processor memory caches
388 edhill 1.21 which contain mirrored copies of main memory. Automatic cache-coherence
389 cnh 1.1 protocols are used to maintain consistency between caches on different
390 edhill 1.21 processors. These cache-coherence protocols typically enforce consistency
391 cnh 1.1 between regions of memory with large granularity (typically 128 or 256 byte
392 edhill 1.21 chunks). The coherency protocols employed can be expensive relative to other
393 cnh 1.1 memory accesses and so care is taken in the WRAPPER (by padding synchronization
394     structures appropriately) to avoid unnecessary coherence traffic.
395    
396     \subsubsection{Operating system support for shared memory.}
397    
398 edhill 1.21 Applications running under multiple threads within a single process
399     can use shared memory communication. In this case {\it all} the
400     memory locations in an application are potentially visible to all the
401     compute threads. Multiple threads operating within a single process is
402     the standard mechanism for supporting shared memory that the WRAPPER
403     utilizes. Configuring and launching code to run in multi-threaded mode
404     on specific platforms is discussed in section
405 jmc 1.26 \ref{sec:multi_threaded_execution}. However, on many systems,
406 edhill 1.21 potentially very efficient mechanisms for using shared memory
407     communication between multiple processes (in contrast to multiple
408     threads within a single process) also exist. In most cases this works
409     by making a limited region of memory shared between processes. The
410 jmc 1.26 MMAP %\ref{magicgarden}
411     and IPC %\ref{magicgarden}
412     facilities in UNIX
413 edhill 1.21 systems provide this capability as do vendor specific tools like LAPI
414 jmc 1.26 %\ref{IBMLAPI}
415     and IMC. %\ref{Memorychannel}.
416     Extensions exist for the
417 edhill 1.21 WRAPPER that allow these mechanisms to be used for shared memory
418     communication. However, these mechanisms are not distributed with the
419     default WRAPPER sources, because of their proprietary nature.
420 cnh 1.1
421     \subsection{Distributed memory communication}
422 jmc 1.26 \label{sec:distributed_memory_communication}
423 cnh 1.1 Many parallel systems are not constructed in a way where it is
424 edhill 1.22 possible or practical for an application to use shared memory for
425     communication. For example cluster systems consist of individual
426     computers connected by a fast network. On such systems there is no
427     notion of shared memory at the system level. For this sort of system
428     the WRAPPER provides support for communication based on a bespoke
429     communication library (see figure \ref{fig:comm_msg}). The default
430     communication library used is MPI \cite{MPI-std-20}. However, it is
431     relatively straightforward to implement bindings to optimized platform
432     specific communication libraries. For example the work described in
433 jmc 1.26 \cite{hoe-hill:99} substituted standard MPI communication for a highly
434 edhill 1.22 optimized library.
435 cnh 1.1
436     \subsection{Communication primitives}
437 jmc 1.26 \label{sec:communication_primitives}
438 cnh 1.1
439     \begin{figure}
440     \begin{center}
441 adcroft 1.2 \resizebox{5in}{!}{
442 jmc 1.25 \includegraphics{s_software/figs/comm-primm.eps}
443 cnh 1.1 }
444     \end{center}
445 cnh 1.5 \caption{Three performance critical parallel primitives are provided
446 edhill 1.21 by the WRAPPER. These primitives are always used to communicate data
447     between tiles. The figure shows four tiles. The curved arrows
448     indicate exchange primitives which transfer data between the overlap
449     regions at tile edges and interior regions for nearest-neighbor
450     tiles. The straight arrows symbolize global sum operations which
451     connect all tiles. The global sum operation provides both a key
452     arithmetic primitive and can serve as a synchronization primitive. A
453     third barrier primitive is also provided, it behaves much like the
454     global sum primitive. } \label{fig:communication_primitives}
455 cnh 1.1 \end{figure}
456    
457    
458 edhill 1.21 Optimized communication support is assumed to be potentially available
459     for a small number of communication operations. It is also assumed
460     that communication performance optimizations can be achieved by
461     optimizing a small number of communication primitives. Three
462     optimizable primitives are provided by the WRAPPER
463 cnh 1.1 \begin{itemize}
464 edhill 1.21 \item{\bf EXCHANGE} This operation is used to transfer data between
465     interior and overlap regions of neighboring tiles. A number of
466     different forms of this operation are supported. These different
467     forms handle
468     \begin{itemize}
469     \item Data type differences. Sixty-four bit and thirty-two bit
470     fields may be handled separately.
471     \item Bindings to different communication methods. Exchange
472     primitives select between using shared memory or distributed
473     memory communication.
474     \item Transformation operations required when transporting data
475     between different grid regions. Transferring data between faces of
476     a cube-sphere grid, for example, involves a rotation of vector
477     components.
478     \item Forward and reverse mode computations. Derivative calculations
479     require tangent linear and adjoint forms of the exchange
480     primitives.
481     \end{itemize}
482 cnh 1.1
483     \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
484 edhill 1.21 operation for the pressure inversion phase of the MITgcm algorithm.
485     For certain configurations scaling can be highly sensitive to the
486     performance of the global sum primitive. This operation is a
487     collective operation involving all tiles of the simulated domain.
488     Different forms of the global sum primitive exist for handling
489     \begin{itemize}
490     \item Data type differences. Sixty-four bit and thirty-two bit
491     fields may be handled separately.
492     \item Bindings to different communication methods. Exchange
493     primitives select between using shared memory or distributed
494     memory communication.
495     \item Forward and reverse mode computations. Derivative calculations
496     require tangent linear and adjoint forms of the exchange
497     primitives.
498     \end{itemize}
499    
500     \item{\bf BARRIER} The WRAPPER provides a global synchronization
501     function called barrier. This is used to synchronize computations
502     over all tiles. The {\bf BARRIER} and {\bf GLOBAL SUM} primitives
503     have much in common and in some cases use the same underlying code.
504 cnh 1.1 \end{itemize}
505    
506    
507     \subsection{Memory architecture}
508    
509     The WRAPPER machine model is aimed to target efficiently systems with
510     highly pipelined memory architectures and systems with deep memory
511     hierarchies that favor memory reuse. This is achieved by supporting a
512     flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
513     Within a CPU computations are carried out sequentially on each tile
514     in turn. By reshaping tiles according to the target platform it is
515     possible to automatically tune code to improve memory performance.
516     On a vector machine a given domain might be sub-divided into a few
517     long, thin regions. On a commodity microprocessor based system, however,
518     the same region could be simulated use many more smaller
519     sub-domains.
520    
521    
522     \begin{figure}
523     \begin{center}
524 adcroft 1.2 \resizebox{5in}{!}{
525 jmc 1.25 \includegraphics{s_software/figs/tiling_detail.eps}
526 cnh 1.1 }
527     \end{center}
528     \caption{The tiling strategy that the WRAPPER supports allows tiles
529     to be shaped to suit the underlying system memory architecture.
530     Compact tiles that lead to greater memory reuse can be used on cache
531     based systems (upper half of figure) with deep memory hierarchies, long tiles
532     with large inner loops can be used to exploit vector systems having
533     highly pipelined memory systems.
534     } \label{fig:tiling-strategy}
535     \end{figure}
536    
537     \newpage
538     \subsection{Summary}
539     Following the discussion above, the machine model that the WRAPPER
540     presents to an application has the following characteristics
541    
542     \begin{itemize}
543 edhill 1.21 \item The machine consists of one or more logical processors.
544     \item Each processor operates on tiles that it owns.
545     \item A processor may own more than one tile.
546     \item Processors may compute concurrently.
547 cnh 1.1 \item Exchange of information between tiles is handled by the
548 edhill 1.21 machine (WRAPPER) not by the application.
549 cnh 1.1 \end{itemize}
550     Behind the scenes this allows the WRAPPER to adapt the machine model
551     functions to exploit hardware on which
552     \begin{itemize}
553 edhill 1.21 \item Processors may be able to communicate very efficiently with each
554     other using shared memory.
555 cnh 1.1 \item An alternative communication mechanism based on a relatively
556 edhill 1.21 simple inter-process communication API may be required.
557 cnh 1.1 \item Shared memory may not necessarily obey sequential consistency,
558 edhill 1.21 however some mechanism will exist for enforcing memory consistency.
559 cnh 1.1 \item Memory consistency that is enforced at the hardware level
560 edhill 1.21 may be expensive. Unnecessary triggering of consistency protocols
561     should be avoided.
562 cnh 1.1 \item Memory access patterns may need to either repetitive or highly
563 edhill 1.21 pipelined for optimum hardware performance.
564 cnh 1.1 \end{itemize}
565    
566     This generic model captures the essential hardware ingredients
567     of almost all successful scientific computer systems designed in the
568     last 50 years.
569    
570     \section{Using the WRAPPER}
571 edhill 1.20 \begin{rawhtml}
572     <!-- CMIREDIR:using_the_wrapper: -->
573     \end{rawhtml}
574 cnh 1.1
575 edhill 1.21 In order to support maximum portability the WRAPPER is implemented
576     primarily in sequential Fortran 77. At a practical level the key steps
577     provided by the WRAPPER are
578 cnh 1.1 \begin{enumerate}
579     \item specifying how a domain will be decomposed
580     \item starting a code in either sequential or parallel modes of operations
581     \item controlling communication between tiles and between concurrently
582 edhill 1.21 computing CPUs.
583 cnh 1.1 \end{enumerate}
584     This section describes the details of each of these operations.
585 jmc 1.26 Section \ref{sec:specifying_a_decomposition} explains how the way in
586 edhill 1.21 which a domain is decomposed (or composed) is expressed. Section
587 jmc 1.26 \ref{sec:starting_the_code} describes practical details of running
588 edhill 1.21 codes in various different parallel modes on contemporary computer
589 jmc 1.26 systems. Section \ref{sec:controlling_communication} explains the
590 edhill 1.21 internal information that the WRAPPER uses to control how information
591     is communicated between tiles.
592 cnh 1.1
593     \subsection{Specifying a domain decomposition}
594 jmc 1.26 \label{sec:specifying_a_decomposition}
595 cnh 1.1
596     At its heart much of the WRAPPER works only in terms of a collection of tiles
597     which are interconnected to each other. This is also true of application
598     code operating within the WRAPPER. Application code is written as a series
599     of compute operations, each of which operates on a single tile. If
600     application code needs to perform operations involving data
601     associated with another tile, it uses a WRAPPER function to obtain
602     that data.
603     The specification of how a global domain is constructed from tiles or alternatively
604     how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
605     This file defines the following parameters \\
606    
607     \fbox{
608     \begin{minipage}{4.75in}
609     Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
610     File: {\em model/inc/SIZE.h}
611     \end{minipage}
612     } \\
613    
614     Together these parameters define a tiling decomposition of the style shown in
615     figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
616     the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
617     maximum size of the overlap extent. This must be set to the maximum width
618     of the computation stencil that the numerical code finite-difference operations
619     require between overlap region updates. The maximum overlap required
620     by any of the operations in the MITgcm code distributed with this release is three grid
621     points. This is set by the requirements of the $\nabla^4$ dissipation and
622     diffusion operator. Code modifications and enhancements that involve adding wide
623     finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
624     Setting {\em OLx} and {\em OLy} to a too large value will decrease code
625     performance (because redundant computations will be performed), however it will
626     not cause any other problems.
627    
628     \begin{figure}
629     \begin{center}
630 adcroft 1.2 \resizebox{5in}{!}{
631 jmc 1.25 \includegraphics{s_software/figs/size_h.eps}
632 cnh 1.1 }
633     \end{center}
634     \caption{ The three level domain decomposition hierarchy employed by the
635     WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
636     to a single process. Multiple processes can exist, each with multiple tiles.
637     Tiles within a process can be spread over multiple compute threads.
638     } \label{fig:labelled_tile}
639     \end{figure}
640    
641     The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
642     be created within a single process. Each of these tiles will have internal
643     dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
644     allocated to different threads of a process that are then bound to
645     different physical processors ( see the multi-threaded
646 jmc 1.26 execution discussion in section \ref{sec:starting_the_code} ) then
647 cnh 1.1 computation will be performed concurrently on each tile. However, it is also
648     possible to run the same decomposition within a process running a single thread on
649     a single processor. In this case the tiles will be computed over sequentially.
650     If the decomposition is run in a single process running multiple threads
651     but attached to a single physical processor, then, in general, the computation
652     for different tiles will be interleaved by system level software.
653     This too is a valid mode of operation.
654    
655     The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
656     numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
657     are used to form the loop ranges for many numerical calculations and
658     to provide dimensions for arrays holding numerical state.
659     The {\em nSx} and{\em nSy} are used in conjunction with the thread number
660     parameter {\em myThid}. Much of the numerical code operating within the
661     WRAPPER takes the form
662     \begin{verbatim}
663     DO bj=myByLo(myThid),myByHi(myThid)
664     DO bi=myBxLo(myThid),myBxHi(myThid)
665     :
666     a block of computations ranging
667     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
668     :
669     ENDDO
670     ENDDO
671    
672     communication code to sum a number or maybe update
673     tile overlap regions
674    
675     DO bj=myByLo(myThid),myByHi(myThid)
676     DO bi=myBxLo(myThid),myBxHi(myThid)
677     :
678     another block of computations ranging
679     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
680     :
681     ENDDO
682     ENDDO
683     \end{verbatim}
684     The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
685     myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
686     schematic. These variables specify the subset of the tiles in
687     the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
688     thread number {\em myThid} owns. The thread number variable {\em myThid}
689     ranges from 1 to the total number of threads requested at execution time.
690     For each value of {\em myThid} the loop scheme above will step sequentially
691     through the tiles owned by that thread. However, different threads will
692     have different ranges of tiles assigned to them, so that separate threads can
693     compute iterations of the {\em bi}, {\em bj} loop concurrently.
694     Within a {\em bi}, {\em bj} loop
695     computation is performed concurrently over as many processes and threads
696     as there are physical processors available to compute.
697    
698 afe 1.11 An exception to the the use of {\em bi} and {\em bj} in loops arises in the
699     exchange routines used when the exch2 package is used with the cubed
700     sphere. In this case {\em bj} is generally set to 1 and the loop runs from
701     1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number,
702     which is then used to reference exchange parameters.
703    
704 cnh 1.1 The amount of computation that can be embedded
705     a single loop over {\em bi} and {\em bj} varies for different parts of the
706     MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
707     from the two-dimensional implicit elliptic solver. This portion of the
708     code computes the l2Norm of a vector whose elements are held in
709     the array {\em cg2d\_r} writing the final result to scalar variable
710     {\em err}. In this case, because the l2norm requires a global reduction,
711     the {\em bi},{\em bj} loop only contains one statement. This computation
712     phase is then followed by a communication phase in which all threads and
713     processes must participate. However,
714     in other areas of the MITgcm code entries subsections of code are within
715     a single {\em bi},{\em bj} loop. For example the evaluation of all
716     the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
717     is within a single {\em bi},{\em bj} loop.
718    
719     \begin{figure}
720     \begin{verbatim}
721     REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
722     REAL*8 err
723     :
724     :
725     other computations
726     :
727     :
728     err = 0.
729     DO bj=myByLo(myThid),myByHi(myThid)
730     DO bi=myBxLo(myThid),myBxHi(myThid)
731     DO J=1,sNy
732     DO I=1,sNx
733     err = err +
734     & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
735     ENDDO
736     ENDDO
737     ENDDO
738     ENDDO
739    
740     CALL GLOBAL_SUM_R8( err , myThid )
741     err = SQRT(err)
742    
743     \end{verbatim}
744     \caption{Example of numerical code for calculating
745     the l2-norm of a vector within the WRAPPER. Notice that
746     under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
747     dimensions. These right most indices are tile indexes. Different
748     threads with a single process operate on different ranges of tile
749     index, as controlled by the settings of
750     {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
751     } \label{fig:bibj_extract}
752     \end{figure}
753    
754     The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
755     are used to indicate to the WRAPPER level how many processes (each with
756     {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
757     This information is needed during initialization and during I/O phases.
758     However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
759     the values of {\em nPx} and {\em nPy} are absent
760     from the core numerical and support code.
761    
762     \subsubsection{Examples of {\em SIZE.h} specifications}
763    
764     The following different {\em SIZE.h} parameter setting illustrate how to
765     interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
766     and {\em nPy}.
767     \begin{enumerate}
768     \item
769     \begin{verbatim}
770     PARAMETER (
771     & sNx = 90,
772     & sNy = 40,
773     & OLx = 3,
774     & OLy = 3,
775     & nSx = 1,
776     & nSy = 1,
777     & nPx = 1,
778     & nPy = 1)
779     \end{verbatim}
780     This sets up a single tile with x-dimension of ninety grid points, y-dimension of
781     forty grid points, and x and y overlaps of three grid points each.
782     \item
783     \begin{verbatim}
784     PARAMETER (
785     & sNx = 45,
786     & sNy = 20,
787     & OLx = 3,
788     & OLy = 3,
789     & nSx = 1,
790     & nSy = 1,
791     & nPx = 2,
792     & nPy = 2)
793     \end{verbatim}
794     This sets up tiles with x-dimension of forty-five grid points, y-dimension of
795     twenty grid points, and x and y overlaps of three grid points each. There are
796     four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
797     arranged so that the global domain size is again ninety grid points in x and
798     forty grid points in y. In general the formula for global grid size (held in
799     model variables {\em Nx} and {\em Ny}) is
800     \begin{verbatim}
801     Nx = sNx*nSx*nPx
802     Ny = sNy*nSy*nPy
803     \end{verbatim}
804     \item
805     \begin{verbatim}
806     PARAMETER (
807     & sNx = 90,
808     & sNy = 10,
809     & OLx = 3,
810     & OLy = 3,
811     & nSx = 1,
812     & nSy = 2,
813     & nPx = 1,
814     & nPy = 2)
815     \end{verbatim}
816     This sets up tiles with x-dimension of ninety grid points, y-dimension of
817     ten grid points, and x and y overlaps of three grid points each. There are
818     four tiles allocated to two separate processes ({\em nPy=2}) each of which
819     has two separate sub-domains {\em nSy=2},
820     The global domain size is again ninety grid points in x and
821     forty grid points in y. The two sub-domains in each process will be computed
822     sequentially if they are given to a single thread within a single process.
823     Alternatively if the code is invoked with multiple threads per process
824 edhill 1.10 the two domains in y may be computed concurrently.
825 cnh 1.1 \item
826     \begin{verbatim}
827     PARAMETER (
828     & sNx = 32,
829     & sNy = 32,
830     & OLx = 3,
831     & OLy = 3,
832     & nSx = 6,
833     & nSy = 1,
834     & nPx = 1,
835     & nPy = 1)
836     \end{verbatim}
837     This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
838     thirty-two grid points, and x and y overlaps of three grid points each.
839     There are six tiles allocated to six separate logical processors ({\em nSx=6}).
840     This set of values can be used for a cube sphere calculation.
841     Each tile of size $32 \times 32$ represents a face of the
842 cnh 1.4 cube. Initializing the tile connectivity correctly ( see section
843 jmc 1.26 \ref{sec:cube_sphere_communication}. allows the rotations associated with
844 cnh 1.1 moving between the six cube faces to be embedded within the
845     tile-tile communication code.
846     \end{enumerate}
847    
848    
849     \subsection{Starting the code}
850 jmc 1.26 \label{sec:starting_the_code}
851 cnh 1.1 When code is started under the WRAPPER, execution begins in a main routine {\em
852     eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
853     to the application through a routine called {\em THE\_MODEL\_MAIN()}
854     once the WRAPPER has initialized correctly and has created the necessary variables
855     to support subsequent calls to communication routines
856     by the application code. The startup calling sequence followed by the
857     WRAPPER is shown in figure \ref{fig:wrapper_startup}.
858    
859     \begin{figure}
860 heimbach 1.8 {\footnotesize
861 cnh 1.1 \begin{verbatim}
862    
863     MAIN
864     |
865     |--EEBOOT :: WRAPPER initialization
866     | |
867     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
868     | | allow basic I/O.
869     | |-- EEINTRO_MSG :: Write startup greeting.
870     | |
871     | |-- EESET_PARMS :: Set WRAPPER parameters
872     | |
873     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
874     | |
875     | |-- INI_PROCS :: Associate processes with grid regions.
876     | |
877     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
878     | |
879     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
880     | :: communication data structures
881     |
882     |
883     |--CHECK_THREADS :: Validate multiple thread start up.
884     |
885     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
886    
887    
888     \end{verbatim}
889 heimbach 1.8 }
890 cnh 1.1 \caption{Main stages of the WRAPPER startup procedure.
891     This process proceeds transfer of control to application code, which
892     occurs through the procedure {\em THE\_MODEL\_MAIN()}.
893     } \label{fig:wrapper_startup}
894     \end{figure}
895    
896     \subsubsection{Multi-threaded execution}
897 jmc 1.26 \label{sec:multi_threaded_execution}
898 cnh 1.1 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
899     WRAPPER may cause several coarse grain threads to be initialized. The routine
900     {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
901     stack argument which is the thread number, stored in the
902     variable {\em myThid}. In addition to specifying a decomposition with
903 jmc 1.26 multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})
904 cnh 1.1 configuring and starting a code to run using multiple threads requires the following
905     steps.\\
906    
907     \paragraph{Compilation}
908     First the code must be compiled with appropriate multi-threading directives
909     active in the file {\em main.F} and with appropriate compiler flags
910     to request multi-threading support. The header files
911     {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
912     contain directives compatible with compilers for Sun, Compaq, SGI,
913     Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
914     activated by using compile time
915     directives {\em -DTARGET\_SUN},
916     {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
917     or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
918     for invoking multi-threaded compilation vary from system to system
919     and from compiler to compiler. The options will be described
920     in the individual compiler documentation. For the Fortran compiler
921     from Sun the following options are needed to correctly compile
922     multi-threaded code
923     \begin{verbatim}
924     -stackvar -explicitpar -vpara -noautopar
925     \end{verbatim}
926     These options are specific to the Sun compiler. Other compilers
927     will use different syntax that will be described in their
928     documentation. The effect of these options is as follows
929     \begin{enumerate}
930     \item {\bf -stackvar} Causes all local variables to be allocated in stack
931     storage. This is necessary for local variables to ensure that they are private
932     to their thread. Note, when using this option it may be necessary to override
933     the default limit on stack-size that the operating system assigns to a process.
934     This can normally be done by changing the settings of the command shells
935     {\em stack-size} limit variable. However, on some systems changing this limit
936     will require privileged administrator access to modify system parameters.
937    
938     \item {\bf -explicitpar} Requests that multiple threads be spawned
939     in response to explicit directives in the application code. These
940     directives are inserted with syntax appropriate to the particular target
941     platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
942    
943     \item {\bf -vpara} This causes the compiler to describe the multi-threaded
944     configuration it is creating. This is not required
945     but it can be useful when trouble shooting.
946    
947     \item {\bf -noautopar} This inhibits any automatic multi-threaded
948     parallelization the compiler may otherwise generate.
949    
950     \end{enumerate}
951    
952    
953     An example of valid settings for the {\em eedata} file for a
954     domain with two subdomains in y and running with two threads is shown
955     below
956     \begin{verbatim}
957     nTx=1,nTy=2
958     \end{verbatim}
959     This set of values will cause computations to stay within a single
960     thread when moving across the {\em nSx} sub-domains. In the y-direction,
961     however, sub-domains will be split equally between two threads.
962    
963     \paragraph{Multi-threading files and parameters} The following
964     files and variables are used in setting up multi-threaded execution.\\
965    
966     \fbox{
967     \begin{minipage}{4.75in}
968     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
969     File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
970     File: {\em model/src/THE\_MODEL\_MAIN.F}\\
971     File: {\em eesupp/src/MAIN.F}\\
972 afe 1.11 File: {\em tools/genmake2}\\
973 cnh 1.1 File: {\em eedata}\\
974     CPP: {\em TARGET\_SUN}\\
975     CPP: {\em TARGET\_DEC}\\
976     CPP: {\em TARGET\_HP }\\
977     CPP: {\em TARGET\_SGI}\\
978     CPP: {\em TARGET\_CRAY\_VECTOR}\\
979     Parameter: {\em nTx}\\
980     Parameter: {\em nTy}
981     \end{minipage}
982     } \\
983    
984     \subsubsection{Multi-process execution}
985 jmc 1.26 \label{sec:multi_process_execution}
986 cnh 1.1
987 edhill 1.21 Despite its appealing programming model, multi-threaded execution
988     remains less common than multi-process execution. One major reason for
989     this is that many system libraries are still not ``thread-safe''. This
990     means that, for example, on some systems it is not safe to call system
991     routines to perform I/O when running in multi-threaded mode (except,
992     perhaps, in a limited set of circumstances). Another reason is that
993     support for multi-threaded programming models varies between systems.
994    
995     Multi-process execution is more ubiquitous. In order to run code in a
996     multi-process configuration a decomposition specification (see section
997 jmc 1.26 \ref{sec:specifying_a_decomposition}) is given (in which the at least
998 edhill 1.21 one of the parameters {\em nPx} or {\em nPy} will be greater than one)
999     and then, as for multi-threaded operation, appropriate compile time
1000     and run time steps must be taken.
1001    
1002     \paragraph{Compilation} Multi-process execution under the WRAPPER
1003     assumes that the portable, MPI libraries are available for controlling
1004     the start-up of multiple processes. The MPI libraries are not
1005     required, although they are usually used, for performance critical
1006     communication. However, in order to simplify the task of controlling
1007     and coordinating the start up of a large number (hundreds and possibly
1008     even thousands) of copies of the same program, MPI is used. The calls
1009     to the MPI multi-process startup routines must be activated at compile
1010     time. Currently MPI libraries are invoked by specifying the
1011     appropriate options file with the {\tt-of} flag when running the {\em
1012     genmake2} script, which generates the Makefile for compiling and
1013     linking MITgcm. (Previously this was done by setting the {\em
1014     ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
1015     CPP\_EEOPTIONS.h} file.) More detailed information about the use of
1016     {\em genmake2} for specifying
1017 jmc 1.26 local compiler flags is located in section \ref{sec:genmake}.\\
1018 cnh 1.1
1019    
1020     \fbox{
1021     \begin{minipage}{4.75in}
1022 afe 1.15 Directory: {\em tools/build\_options}\\
1023 afe 1.11 File: {\em tools/genmake2}
1024 cnh 1.1 \end{minipage}
1025     } \\
1026 edhill 1.21 \paragraph{\bf Execution} The mechanics of starting a program in
1027     multi-process mode under MPI is not standardized. Documentation
1028 cnh 1.1 associated with the distribution of MPI installed on a system will
1029 edhill 1.21 describe how to start a program using that distribution. For the
1030     open-source MPICH system, the MITgcm program can be started using a
1031     command such as
1032 cnh 1.1 \begin{verbatim}
1033     mpirun -np 64 -machinefile mf ./mitgcmuv
1034     \end{verbatim}
1035 edhill 1.21 In this example the text {\em -np 64} specifies the number of
1036     processes that will be created. The numeric value {\em 64} must be
1037     equal to the product of the processor grid settings of {\em nPx} and
1038     {\em nPy} in the file {\em SIZE.h}. The parameter {\em mf} specifies
1039     that a text file called ``mf'' will be read to get a list of processor
1040     names on which the sixty-four processes will execute. The syntax of
1041     this file is specified by the MPI distribution.
1042     \\
1043 cnh 1.1
1044     \fbox{
1045     \begin{minipage}{4.75in}
1046     File: {\em SIZE.h}\\
1047     Parameter: {\em nPx}\\
1048     Parameter: {\em nPy}
1049     \end{minipage}
1050     } \\
1051    
1052 adcroft 1.2
1053     \paragraph{Environment variables}
1054 edhill 1.21 On most systems multi-threaded execution also requires the setting of
1055     a special environment variable. On many machines this variable is
1056     called PARALLEL and its values should be set to the number of parallel
1057     threads required. Generally the help or manual pages associated with
1058     the multi-threaded compiler on a machine will explain how to set the
1059     required environment variables.
1060 adcroft 1.2
1061     \paragraph{Runtime input parameters}
1062 edhill 1.21 Finally the file {\em eedata} needs to be configured to indicate the
1063     number of threads to be used in the x and y directions. The variables
1064     {\em nTx} and {\em nTy} in this file are used to specify the
1065     information required. The product of {\em nTx} and {\em nTy} must be
1066     equal to the number of threads spawned i.e. the setting of the
1067     environment variable PARALLEL. The value of {\em nTx} must subdivide
1068     the number of sub-domains in x ({\em nSx}) exactly. The value of {\em
1069     nTy} must subdivide the number of sub-domains in y ({\em nSy})
1070     exactly. The multiprocess startup of the MITgcm executable {\em
1071     mitgcmuv} is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1072     {\em INI\_PROCS()}. The first routine performs basic steps required to
1073     make sure each process is started and has a textual output stream
1074     associated with it. By default two output files are opened for each
1075     process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}. The {\bf
1076     NNNNN} part of the name is filled in with the process number so that
1077     process number 0 will create output files {\bf STDOUT.0000} and {\bf
1078     STDERR.0000}, process number 1 will create output files {\bf
1079     STDOUT.0001} and {\bf STDERR.0001}, etc. These files are used for
1080     reporting status and configuration information and for reporting error
1081     conditions on a process by process basis. The {\em EEBOOT\_MINIMAL()}
1082     procedure also sets the variables {\em myProcId} and {\em
1083     MPI\_COMM\_MODEL}. These variables are related to processor
1084     identification are are used later in the routine {\em INI\_PROCS()} to
1085     allocate tiles to processes.
1086    
1087     Allocation of processes to tiles is controlled by the routine {\em
1088     INI\_PROCS()}. For each process this routine sets the variables {\em
1089     myXGlobalLo} and {\em myYGlobalLo}. These variables specify, in
1090     index space, the coordinates of the southernmost and westernmost
1091     corner of the southernmost and westernmost tile owned by this process.
1092     The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are
1093     also set in this routine. These are used to identify processes holding
1094     tiles to the west, east, south and north of a given process. These
1095     values are stored in global storage in the header file {\em
1096     EESUPPORT.h} for use by communication routines. The above does not
1097     hold when the exch2 package is used. The exch2 sets its own
1098     parameters to specify the global indices of tiles and their
1099     relationships to each other. See the documentation on the exch2
1100     package (\ref{sec:exch2}) for details.
1101 cnh 1.1 \\
1102    
1103     \fbox{
1104     \begin{minipage}{4.75in}
1105     File: {\em eesupp/src/eeboot\_minimal.F}\\
1106     File: {\em eesupp/src/ini\_procs.F} \\
1107     File: {\em eesupp/inc/EESUPPORT.h} \\
1108     Parameter: {\em myProcId} \\
1109     Parameter: {\em MPI\_COMM\_MODEL} \\
1110     Parameter: {\em myXGlobalLo} \\
1111     Parameter: {\em myYGlobalLo} \\
1112     Parameter: {\em pidW } \\
1113     Parameter: {\em pidE } \\
1114     Parameter: {\em pidS } \\
1115     Parameter: {\em pidN }
1116     \end{minipage}
1117     } \\
1118    
1119    
1120     \subsection{Controlling communication}
1121 jmc 1.26 \label{sec:controlling_communication}
1122 cnh 1.1 The WRAPPER maintains internal information that is used for communication
1123     operations and that can be customized for different platforms. This section
1124     describes the information that is held and used.
1125 adcroft 1.2
1126 cnh 1.1 \begin{enumerate}
1127 edhill 1.21 \item {\bf Tile-tile connectivity information}
1128     For each tile the WRAPPER sets a flag that sets the tile number to
1129     the north, south, east and west of that tile. This number is unique
1130     over all tiles in a configuration. Except when using the cubed
1131     sphere and the exch2 package, the number is held in the variables
1132     {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em
1133     tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also
1134     stored with each tile that specifies the type of communication that
1135     is used between tiles. This information is held in the variables
1136     {\em tileCommModeN}, {\em tileCommModeS}, {\em tileCommModeE} and
1137     {\em tileCommModeW}. This latter set of variables can take one of
1138     the following values {\em COMM\_NONE}, {\em COMM\_MSG}, {\em
1139     COMM\_PUT} and {\em COMM\_GET}. A value of {\em COMM\_NONE} is
1140     used to indicate that a tile has no neighbor to communicate with on
1141     a particular face. A value of {\em COMM\_MSG} is used to indicate
1142     that some form of distributed memory communication is required to
1143     communicate between these tile faces (see section
1144 jmc 1.26 \ref{sec:distributed_memory_communication}). A value of {\em
1145 edhill 1.21 COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
1146     memory communication (see section
1147 jmc 1.26 \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value
1148 edhill 1.21 indicates that a CPU should communicate by writing to data
1149     structures owned by another CPU. A {\em COMM\_GET} value indicates
1150     that a CPU should communicate by reading from data structures owned
1151     by another CPU. These flags affect the behavior of the WRAPPER
1152     exchange primitive (see figure \ref{fig:communication_primitives}).
1153     The routine {\em ini\_communication\_patterns()} is responsible for
1154     setting the communication mode values for each tile.
1155    
1156     When using the cubed sphere configuration with the exch2 package,
1157     the relationships between tiles and their communication methods are
1158     set by the exch2 package and stored in different variables. See the
1159     exch2 package documentation (\ref{sec:exch2} for details.
1160 cnh 1.1
1161     \fbox{
1162     \begin{minipage}{4.75in}
1163     File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1164     File: {\em eesupp/inc/EESUPPORT.h} \\
1165     Parameter: {\em tileNo} \\
1166     Parameter: {\em tileNoE} \\
1167     Parameter: {\em tileNoW} \\
1168     Parameter: {\em tileNoN} \\
1169     Parameter: {\em tileNoS} \\
1170     Parameter: {\em tileCommModeE} \\
1171     Parameter: {\em tileCommModeW} \\
1172     Parameter: {\em tileCommModeN} \\
1173     Parameter: {\em tileCommModeS} \\
1174     \end{minipage}
1175     } \\
1176    
1177     \item {\bf MP directives}
1178 edhill 1.21 The WRAPPER transfers control to numerical application code through
1179     the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1180     that allows for it to be invoked by several threads. Support for
1181     this is based on either multi-processing (MP) compiler directives or
1182     specific calls to multi-threading libraries (\textit{eg.} POSIX
1183     threads). Most commercially available Fortran compilers support the
1184     generation of code to spawn multiple threads through some form of
1185     compiler directives. Compiler directives are generally more
1186     convenient than writing code to explicitly spawning threads. And,
1187     on some systems, compiler directives may be the only method
1188     available. The WRAPPER is distributed with template MP directives
1189     for a number of systems.
1190    
1191     These directives are inserted into the code just before and after
1192     the transfer of control to numerical algorithm code through the
1193     routine {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows
1194     an example of the code that performs this process for a Silicon
1195     Graphics system. This code is extracted from the files {\em main.F}
1196     and {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads}
1197     specifies how many instances of the routine {\em THE\_MODEL\_MAIN}
1198     will be created. The value of {\em nThreads} is set in the routine
1199     {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the the
1200     product of the parameters {\em nTx} and {\em nTy} that are read from
1201     the file {\em eedata}. If the value of {\em nThreads} is
1202     inconsistent with the number of threads requested from the operating
1203     system (for example by using an environment variable as described in
1204 jmc 1.26 section \ref{sec:multi_threaded_execution}) then usually an error
1205 edhill 1.21 will be reported by the routine {\em CHECK\_THREADS}.
1206 cnh 1.1
1207     \fbox{
1208     \begin{minipage}{4.75in}
1209     File: {\em eesupp/src/ini\_threading\_environment.F}\\
1210     File: {\em eesupp/src/check\_threads.F} \\
1211     File: {\em eesupp/src/main.F} \\
1212     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1213     File: {\em eedata } \\
1214     Parameter: {\em nThreads} \\
1215     Parameter: {\em nTx} \\
1216     Parameter: {\em nTy} \\
1217     \end{minipage}
1218     }
1219    
1220     \item {\bf memsync flags}
1221 jmc 1.26 As discussed in section \ref{sec:memory_consistency}, a low-level
1222 edhill 1.21 system function may be need to force memory consistency on some
1223     shared memory systems. The routine {\em MEMSYNC()} is used for this
1224     purpose. This routine should not need modifying and the information
1225     below is only provided for completeness. A logical parameter {\em
1226     exchNeedsMemSync} set in the routine {\em
1227     INI\_COMMUNICATION\_PATTERNS()} controls whether the {\em
1228     MEMSYNC()} primitive is called. In general this routine is only
1229     used for multi-threaded execution. The code that goes into the {\em
1230     MEMSYNC()} routine is specific to the compiler and processor used.
1231     In some cases, it must be written using a short code snippet of
1232     assembly language. For an Ultra Sparc system the following code
1233     snippet is used
1234 cnh 1.1 \begin{verbatim}
1235     asm("membar #LoadStore|#StoreStore");
1236     \end{verbatim}
1237 cnh 1.4 for an Alpha based system the equivalent code reads
1238 cnh 1.1 \begin{verbatim}
1239     asm("mb");
1240     \end{verbatim}
1241     while on an x86 system the following code is required
1242     \begin{verbatim}
1243     asm("lock; addl $0,0(%%esp)": : :"memory")
1244     \end{verbatim}
1245    
1246     \item {\bf Cache line size}
1247 jmc 1.26 As discussed in section \ref{sec:cache_effects_and_false_sharing},
1248 edhill 1.21 milti-threaded codes explicitly avoid penalties associated with
1249     excessive coherence traffic on an SMP system. To do this the shared
1250     memory data structures used by the {\em GLOBAL\_SUM}, {\em
1251     GLOBAL\_MAX} and {\em BARRIER} routines are padded. The variables
1252     that control the padding are set in the header file {\em
1253     EEPARAMS.h}. These variables are called {\em cacheLineSize}, {\em
1254     lShare1}, {\em lShare4} and {\em lShare8}. The default values
1255     should not normally need changing.
1256    
1257 cnh 1.1 \item {\bf \_BARRIER}
1258 edhill 1.21 This is a CPP macro that is expanded to a call to a routine which
1259     synchronizes all the logical processors running under the WRAPPER.
1260     Using a macro here preserves flexibility to insert a specialized
1261     call in-line into application code. By default this resolves to
1262     calling the procedure {\em BARRIER()}. The default setting for the
1263     \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1264 cnh 1.1
1265     \item {\bf \_GSUM}
1266 edhill 1.21 This is a CPP macro that is expanded to a call to a routine which
1267     sums up a floating point number over all the logical processors
1268     running under the WRAPPER. Using a macro here provides extra
1269     flexibility to insert a specialized call in-line into application
1270     code. By default this resolves to calling the procedure {\em
1271     GLOBAL\_SUM\_R8()} ( for 64-bit floating point operands) or {\em
1272     GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The
1273     default setting for the \_GSUM macro is given in the file {\em
1274     CPP\_EEMACROS.h}. The \_GSUM macro is a performance critical
1275     operation, especially for large processor count, small tile size
1276     configurations. The custom communication example discussed in
1277 jmc 1.26 section \ref{sec:jam_example} shows how the macro is used to invoke
1278 edhill 1.21 a custom global sum routine for a specific set of hardware.
1279 cnh 1.1
1280     \item {\bf \_EXCH}
1281 edhill 1.21 The \_EXCH CPP macro is used to update tile overlap regions. It is
1282     qualified by a suffix indicating whether overlap updates are for
1283     two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1284     physical fields and whether fields are 32-bit floating point (
1285     \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point (
1286     \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined in
1287     the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
1288     operation plays a crucial role in scaling to small tile, large
1289     logical and physical processor count configurations. The example in
1290 jmc 1.26 section \ref{sec:jam_example} discusses defining an optimized and
1291 edhill 1.21 specialized form on the \_EXCH operation.
1292    
1293     The \_EXCH operation is also central to supporting grids such as the
1294     cube-sphere grid. In this class of grid a rotation may be required
1295     between tiles. Aligning the coordinate requiring rotation with the
1296     tile decomposition, allows the coordinate transformation to be
1297     embedded within a custom form of the \_EXCH primitive. In these
1298     cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1299     package documentation \ref{sec:exch2}.
1300 cnh 1.1
1301     \item {\bf Reverse Mode}
1302 edhill 1.21 The communication primitives \_EXCH and \_GSUM both employ
1303     hand-written adjoint forms (or reverse mode) forms. These reverse
1304     mode forms can be found in the source code directory {\em
1305     pkg/autodiff}. For the global sum primitive the reverse mode form
1306     calls are to {\em GLOBAL\_ADSUM\_R4} and {\em GLOBAL\_ADSUM\_R8}.
1307     The reverse mode form of the exchange primitives are found in
1308     routines prefixed {\em ADEXCH}. The exchange routines make calls to
1309     the same low-level communication primitives as the forward mode
1310     operations. However, the routine argument {\em simulationMode} is
1311     set to the value {\em REVERSE\_SIMULATION}. This signifies to the
1312     low-level routines that the adjoint forms of the appropriate
1313     communication operation should be performed.
1314 afe 1.16
1315 cnh 1.1 \item {\bf MAX\_NO\_THREADS}
1316 edhill 1.21 The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum
1317     number of OS threads that a code will use. This value defaults to
1318     thirty-two and is set in the file {\em EEPARAMS.h}. For single
1319     threaded execution it can be reduced to one if required. The value
1320     is largely private to the WRAPPER and application code will not
1321     normally reference the value, except in the following scenario.
1322    
1323     For certain physical parametrization schemes it is necessary to have
1324     a substantial number of work arrays. Where these arrays are
1325     allocated in heap storage (for example COMMON blocks) multi-threaded
1326     execution will require multiple instances of the COMMON block data.
1327     This can be achieved using a Fortran 90 module construct. However,
1328     if this mechanism is unavailable then the work arrays can be extended
1329     with dimensions using the tile dimensioning scheme of {\em nSx} and
1330     {\em nSy} (as described in section
1331 jmc 1.26 \ref{sec:specifying_a_decomposition}). However, if the
1332 edhill 1.21 configuration being specified involves many more tiles than OS
1333     threads then it can save memory resources to reduce the variable
1334     {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
1335     that will be used and to declare the physical parameterization work
1336     arrays with a single {\em MAX\_NO\_THREADS} extra dimension. An
1337     example of this is given in the verification experiment {\em
1338     aim.5l\_cs}. Here the default setting of {\em MAX\_NO\_THREADS} is
1339     altered to
1340 cnh 1.1 \begin{verbatim}
1341     INTEGER MAX_NO_THREADS
1342     PARAMETER ( MAX_NO_THREADS = 6 )
1343     \end{verbatim}
1344 edhill 1.21 and several work arrays for storing intermediate calculations are
1345     created with declarations of the form.
1346 cnh 1.1 \begin{verbatim}
1347     common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1348     \end{verbatim}
1349 edhill 1.21 This declaration scheme is not used widely, because most global data
1350     is used for permanent not temporary storage of state information.
1351     In the case of permanent state information this approach cannot be
1352     used because there has to be enough storage allocated for all tiles.
1353     However, the technique can sometimes be a useful scheme for reducing
1354     memory requirements in complex physical parameterizations.
1355 adcroft 1.2 \end{enumerate}
1356    
1357     \begin{figure}
1358     \begin{verbatim}
1359     C--
1360     C-- Parallel directives for MIPS Pro Fortran compiler
1361     C--
1362     C Parallel compiler directives for SGI with IRIX
1363     C$PAR PARALLEL DO
1364     C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1365     C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1366     C
1367     DO I=1,nThreads
1368     myThid = I
1369    
1370     C-- Invoke nThreads instances of the numerical model
1371     CALL THE_MODEL_MAIN(myThid)
1372    
1373     ENDDO
1374     \end{verbatim}
1375 edhill 1.21 \caption{Prior to transferring control to the procedure {\em
1376     THE\_MODEL\_MAIN()} the WRAPPER may use MP directives to spawn
1377     multiple threads. } \label{fig:mp_directives}
1378 adcroft 1.2 \end{figure}
1379 cnh 1.1
1380    
1381     \subsubsection{Specializing the Communication Code}
1382    
1383     The isolation of performance critical communication primitives and the
1384     sub-division of the simulation domain into tiles is a powerful tool.
1385     Here we show how it can be used to improve application performance and
1386 cnh 1.5 how it can be used to adapt to new griding approaches.
1387 cnh 1.1
1388     \subsubsection{JAM example}
1389 jmc 1.26 \label{sec:jam_example}
1390 edhill 1.21 On some platforms a big performance boost can be obtained by binding
1391     the communication routines {\em \_EXCH} and {\em \_GSUM} to
1392     specialized native libraries (for example, the shmem library on CRAY
1393     T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag is used as an
1394     illustration of a specialized communication configuration that
1395     substitutes for standard, portable forms of {\em \_EXCH} and {\em
1396     \_GSUM}. It affects three source files {\em eeboot.F}, {\em
1397     CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined is has
1398     the following effects.
1399 cnh 1.1 \begin{itemize}
1400 edhill 1.21 \item An extra phase is included at boot time to initialize the custom
1401     communications library ( see {\em ini\_jam.F}).
1402 cnh 1.1 \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1403 edhill 1.21 with calls to custom routines (see {\em gsum\_jam.F} and {\em
1404     exch\_jam.F})
1405 cnh 1.1 \item a highly specialized form of the exchange operator (optimized
1406 edhill 1.21 for overlap regions of width one) is substituted into the elliptic
1407     solver routine {\em cg2d.F}.
1408 cnh 1.1 \end{itemize}
1409     Developing specialized code for other libraries follows a similar
1410     pattern.
1411    
1412     \subsubsection{Cube sphere communication}
1413 jmc 1.26 \label{sec:cube_sphere_communication}
1414 edhill 1.21 Actual {\em \_EXCH} routine code is generated automatically from a
1415     series of template files, for example {\em exch\_rx.template}. This
1416     is done to allow a large number of variations on the exchange process
1417     to be maintained. One set of variations supports the cube sphere grid.
1418     Support for a cube sphere grid in MITgcm is based on having each face
1419     of the cube as a separate tile or tiles. The exchange routines are
1420     then able to absorb much of the detailed rotation and reorientation
1421     required when moving around the cube grid. The set of {\em \_EXCH}
1422     routines that contain the word cube in their name perform these
1423     transformations. They are invoked when the run-time logical parameter
1424 cnh 1.1 {\em useCubedSphereExchange} is set true. To facilitate the
1425 edhill 1.21 transformations on a staggered C-grid, exchange operations are defined
1426     separately for both vector and scalar quantities and for grid-centered
1427     and for grid-face and grid-corner quantities. Three sets of exchange
1428     routines are defined. Routines with names of the form {\em exch\_rx}
1429     are used to exchange cell centered scalar quantities. Routines with
1430     names of the form {\em exch\_uv\_rx} are used to exchange vector
1431     quantities located at the C-grid velocity points. The vector
1432     quantities exchanged by the {\em exch\_uv\_rx} routines can either be
1433     signed (for example velocity components) or un-signed (for example
1434     grid-cell separations). Routines with names of the form {\em
1435     exch\_z\_rx} are used to exchange quantities at the C-grid vorticity
1436     point locations.
1437 cnh 1.1
1438    
1439    
1440    
1441     \section{MITgcm execution under WRAPPER}
1442 edhill 1.19 \begin{rawhtml}
1443     <!-- CMIREDIR:mitgcm_wrapper: -->
1444     \end{rawhtml}
1445 cnh 1.1
1446     Fitting together the WRAPPER elements, package elements and
1447     MITgcm core equation elements of the source code produces calling
1448 jmc 1.26 sequence shown in section \ref{sec:calling_sequence}
1449 cnh 1.1
1450     \subsection{Annotated call tree for MITgcm and WRAPPER}
1451 jmc 1.26 \label{sec:calling_sequence}
1452 cnh 1.1
1453     WRAPPER layer.
1454    
1455 heimbach 1.8 {\footnotesize
1456 cnh 1.1 \begin{verbatim}
1457    
1458     MAIN
1459     |
1460     |--EEBOOT :: WRAPPER initialization
1461     | |
1462     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1463     | | allow basic I/O.
1464     | |-- EEINTRO_MSG :: Write startup greeting.
1465     | |
1466     | |-- EESET_PARMS :: Set WRAPPER parameters
1467     | |
1468     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1469     | |
1470     | |-- INI_PROCS :: Associate processes with grid regions.
1471     | |
1472     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1473     | |
1474     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1475     | :: communication data structures
1476     |
1477     |
1478     |--CHECK_THREADS :: Validate multiple thread start up.
1479     |
1480     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1481    
1482     \end{verbatim}
1483 heimbach 1.8 }
1484 cnh 1.1
1485     Core equations plus packages.
1486    
1487 heimbach 1.8 {\footnotesize
1488 cnh 1.1 \begin{verbatim}
1489     C
1490     C Invocation from WRAPPER level...
1491     C :
1492     C :
1493     C |
1494     C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1495     C | :: Called from WRAPPER level numerical
1496 cnh 1.4 C | :: code invocation routine. On entry
1497 cnh 1.1 C | :: to THE_MODEL_MAIN separate thread and
1498     C | :: separate processes will have been established.
1499     C | :: Each thread and process will have a unique ID
1500     C | :: but as yet it will not be associated with a
1501     C | :: specific region in decomposed discrete space.
1502     C |
1503     C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1504     C | | :: grid, solver matrices etc..
1505     C | |
1506     C | |-INI_PARMS :: Routine to set kernel model parameters.
1507     C | | :: By default kernel parameters are read from file
1508     C | | :: "data" in directory in which code executes.
1509     C | |
1510 cnh 1.4 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1511 cnh 1.1 C | |
1512 cnh 1.4 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1513 cnh 1.1 C | | | :: Grid arrays are held and described in GRID.h.
1514     C | | |
1515 cnh 1.4 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1516 cnh 1.1 C | | |
1517 cnh 1.4 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1518 cnh 1.1 C | | | :: (calculate grid from kernel parameters).
1519     C | | |
1520     C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1521 cnh 1.4 C | | | :: initialization (calculate grid from
1522 cnh 1.1 C | | | :: kernel parameters).
1523     C | | |
1524     C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1525 cnh 1.4 C | | :: grid initializations. ( input from raw
1526 cnh 1.1 C | | :: grid files, LONC.bin, DXF.bin etc... )
1527     C | |
1528     C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1529     C | |
1530     C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1531     C | | :: land masking for solid-fluid boundaries.
1532     C | |
1533     C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1534     C | |
1535     C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1536 cnh 1.4 C | | :: sphere options are coded.
1537 cnh 1.1 C | |
1538     C | |-PACAKGES_BOOT :: Start up the optional package environment.
1539     C | | :: Runtime selection of active packages.
1540     C | |
1541     C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1542     C | | |
1543     C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1544     C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1545     C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1546     C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1547     C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1548     C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1549     C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1550     C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1551     C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1552     C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1553 edhill 1.23 C | | |-PTRACERS_READPARMS :: multiple tracer package, see pkg/ptracers
1554     C | | |-GCHEM_READPARMS :: tracer interface package, see pkg/gchem
1555 cnh 1.1 C | |
1556     C | |-PACKAGES_CHECK
1557     C | | |
1558     C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1559 edhill 1.23 C | | |-OBCS_CHECK :: Open bndy Pacakge. pkg/obcs
1560 cnh 1.1 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1561     C | |
1562     C | |-PACKAGES_INIT_FIXED
1563     C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1564     C | | |-FLT_INIT :: Floats Package. see pkg/flt
1565 edhill 1.23 C | | |-GCHEM_INIT_FIXED :: tracer interface pachage, see pkg/gchem
1566 cnh 1.1 C | |
1567     C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1568     C | |
1569 edhill 1.23 C | |-INI_CG2D :: 2d con. grad solver initialization.
1570 cnh 1.1 C | |
1571 edhill 1.23 C | |-INI_CG3D :: 3d con. grad solver initialization.
1572 cnh 1.1 C | |
1573     C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1574     C | :: Includes annotated table of kernel
1575     C | :: parameter settings.
1576     C |
1577     C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1578     C |
1579     C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1580 cnh 1.4 C ! :: Auotmatically generated by TAMC/TAF.
1581 cnh 1.1 C |
1582     C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1583     C |
1584     C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1585     C |
1586     C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1587     C | |
1588     C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1589     C | | | :: variables
1590     C | | |
1591     C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1592     C | | |
1593     C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1594 cnh 1.4 C | | | :: sphere options are coded.
1595 cnh 1.1 C | | |
1596 edhill 1.23 C | | |-INI_CG2D :: 2d con. grad solver initialization.
1597     C | | |-INI_CG3D :: 3d con. grad solver initialization.
1598     C | | |-INI_MIXING :: Initialize diapycnal diffusivity.
1599     C | | |-INI_DYNVARS :: Initialize to zero all DYNVARS.h arrays (dynamical
1600 cnh 1.1 C | | | :: fields).
1601     C | | |
1602 cnh 1.4 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1603 cnh 1.1 C | | | |-INI_VEL :: Initialize 3D flow field.
1604     C | | | |-INI_THETA :: Set model initial temperature field.
1605     C | | | |-INI_SALT :: Set model initial salinity field.
1606     C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1607 edhill 1.23 C | | | |-INI_PRESSURE :: Compute model initial hydrostatic pressure
1608     C | | | |-READ_CHECKPOINT :: Read the checkpoint
1609 cnh 1.1 C | | |
1610     C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1611     C | | | | :: Here applied to move restart conditions
1612     C | | | | :: (saved in mid timestep) to correct level in
1613     C | | | | :: time (only used for pre-c35).
1614     C | | | |
1615     C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1616     C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1617     C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1618     C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1619     C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1620     C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1621     C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1622     C | | | | |-FIND_RHO :: Find adjacent densities.
1623     C | | | | |-CONVECT :: Mix static instability.
1624     C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1625     C | | | |
1626     C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1627     C | | |
1628     C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1629     C | | | | :: Extra time history interactions.
1630     C | | | |
1631     C | | | |-FIND_RHO :: Find adjacent densities.
1632     C | | | |-CONVECT :: Mix static instability.
1633     C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1634     C | | |
1635 edhill 1.23 C | | |-PACKAGES_INIT_VARIABLES :: Does initialization of time evolving
1636 cnh 1.1 C | | | | :: package data.
1637     C | | | |
1638     C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1639     C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1640     C | | | |-KPP_OPEN_DIAGS
1641     C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1642 edhill 1.23 C | | | |-PTRACERS_INIT :: multi. tracer package,(see pkg/ptracers)
1643     C | | | |-GCHEM_INIT :: tracer interface pkg (see pkh/gchem)
1644 cnh 1.1 C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1645     C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1646     C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1647     C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1648     C | | | |-INI_FORCING :: Set model initial forcing fields.
1649     C | | | | :: Either set in-line or from file as shown.
1650     C | | | |-READ_FLD_XY_RS(zonalWindFile)
1651     C | | | |-READ_FLD_XY_RS(meridWindFile)
1652     C | | | |-READ_FLD_XY_RS(surfQFile)
1653     C | | | |-READ_FLD_XY_RS(EmPmRfile)
1654     C | | | |-READ_FLD_XY_RS(thetaClimFile)
1655     C | | | |-READ_FLD_XY_RS(saltClimFile)
1656     C | | | |-READ_FLD_XY_RS(surfQswFile)
1657     C | | |
1658     C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1659     C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1660     C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1661     C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1662     C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1663     C | |
1664     C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1665     C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1666     C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1667     C | |
1668     C | |-MONITOR :: Monitor state ( see pkg/monitor )
1669     C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1670     C====|>|
1671     C====|>| ****************************
1672     C====|>| BEGIN MAIN TIMESTEPPING LOOP
1673     C====|>| ****************************
1674     C====|>|
1675     C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1676     C/\ | | |
1677     C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1678     C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1679     C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1680     C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1681     C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1682 cnh 1.4 C/\ | | | | :: Simple interpolation between end-points
1683 cnh 1.1 C/\ | | | | :: for forcing datasets.
1684     C/\ | | | |
1685     C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1686 edhill 1.23 C/\ | | |-SEAICE_MODEL :: Compute sea-ice terms. ( pkg/seaice )
1687     C/\ | | |-FREEZE :: Limit surface temperature.
1688     C/\ | | |-GCHEM_FIELD_LOAD :: load tracer forcing fields (pkg/gchem)
1689 cnh 1.1 C/\ | | |
1690     C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1691     C/\ | | | |
1692     C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1693     C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1694 edhill 1.23 C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-RhoConst] of a slice
1695 cnh 1.1 C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1696     C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1697     C/\ | | | |
1698     C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1699     C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1700 edhill 1.23 C/\ | | | | | :: forcing terms.
1701     C/\ | | | | |-PTRACERS_FORCING_SURF :: Tracer package ( see pkg/ptracers ).
1702 cnh 1.1 C/\ | | | |
1703     C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1704     C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1705     C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1706     C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1707     C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1708     C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1709     C/\ | | | | advection case) (see pkg/gad).
1710     C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1711     C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1712     C/\ | | | | |
1713     C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1714     C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1715     C/\ | | | |
1716     C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1717     C/\ | | | | |
1718     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1719 edhill 1.23 C/\ | | | | | | :: ( see pkg/gad )
1720     C/\ | | | | | |-KPP_TRANSPORT_T :: KPP non-local transport ( see pkg/kpp ).
1721     C/\ | | | | |
1722 cnh 1.1 C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1723     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1724     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1725     C/\ | | | |
1726     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1727     C/\ | | | |
1728     C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1729     C/\ | | | | |
1730     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1731 edhill 1.23 C/\ | | | | | | :: ( see pkg/gad )
1732     C/\ | | | | | |-KPP_TRANSPORT_S :: KPP non-local transport ( see pkg/kpp ).
1733     C/\ | | | | |
1734 cnh 1.1 C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1735     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1736     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1737     C/\ | | | |
1738     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1739     C/\ | | | |
1740 edhill 1.23 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1741     C/\ | | | |
1742     C/\ | | | |-PTRACERS_INTEGRATE :: Integrate other tracer(s) (see pkg/ptracers).
1743 cnh 1.1 C/\ | | | | |
1744     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1745 edhill 1.23 C/\ | | | | | | :: ( see pkg/gad )
1746     C/\ | | | | | |-KPP_TRANSPORT_PTR:: KPP non-local transport ( see pkg/kpp ).
1747     C/\ | | | | |
1748     C/\ | | | | |-PTRACERS_FORCING :: Problem specific forcing for tracer.
1749     C/\ | | | | |-GCHEM_FORCING_INT :: tracer forcing for gchem pkg (if all
1750     C/\ | | | | | tendancy terms calcualted together)
1751 cnh 1.1 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1752     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1753 edhill 1.23 C/\ | | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1754 cnh 1.1 C/\ | | | |
1755     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1756     C/\ | | | |
1757     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1758     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1759     C/\ | | | |
1760     C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1761     C/\ | | | |-EXCH :: Update overlaps
1762     C/\ | | |
1763     C/\ | | |-DYNAMICS :: Momentum equations driver.
1764     C/\ | | | |
1765     C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1766     C/\ | | | | Potential anomaly.
1767     C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1768     C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1769     C/\ | | | |
1770     C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1771     C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1772     C/\ | | | | pkg/mom_fluxform ).
1773     C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1774     C/\ | | | | pkg/mom_vecinv ).
1775     C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1776     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1777     C/\ | | | |
1778     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1779     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1780     C/\ | | | |
1781     C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1782     C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1783     C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1784     C/\ | | |
1785     C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1786     C/\ | | |
1787     C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1788     C/\ | | |
1789     C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1790     C/\ | | |
1791     C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1792     C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1793     C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1794     C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1795     C/\ | | |
1796     C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1797     C/\ | | | |
1798     C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1799     C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1800     C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1801     C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1802     C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1803     C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1804     C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1805     C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1806     C/\ | | | | |-CONVECT :: Mix static instability.
1807     C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1808     C/\ | | | |
1809     C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1810     C/\ | | |
1811     C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1812     C/\ | | | |-EXCH
1813     C/\ | | |
1814 edhill 1.23 C/\ | | |-GCHEM_FORCING_SEP :: tracer forcing for gchem pkg (if
1815     C/\ | | | tracer dependent tendencies calculated
1816     C/\ | | | separatly)
1817     C/\ | | |
1818 cnh 1.1 C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1819     C/\ | | |
1820     C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1821     C/\ | | |
1822     C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1823     C/\ | | | |-WRITE_STATE :: Core state I/O
1824     C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1825     C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1826     C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1827     C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1828 edhill 1.23 C/\ | | | |-SBO_CALC :: SBO diags. see pkg/sbo
1829     C/\ | | | |-SBO_DIAGS :: SBO diags. see pkg/sbo
1830     C/\ | | | |-SEAICE_DO_DIAGS :: SEAICE diags. see pkg/seaice
1831     C/\ | | | |-GCHEM_DIAGS :: gchem diags. see pkg/gchem
1832 cnh 1.1 C/\ | | |
1833     C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1834     C/\ | |
1835     C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1836     C<===|=|
1837     C<===|=| **************************
1838     C<===|=| END MAIN TIMESTEPPING LOOP
1839     C<===|=| **************************
1840     C<===|=|
1841     C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1842     C |
1843     C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1844     C |
1845     C |-TIMER_PRINTALL :: Computational timing summary
1846     C |
1847     C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1848     C :: events.
1849 edhill 1.23 C
1850 cnh 1.1 \end{verbatim}
1851 heimbach 1.8 }
1852 cnh 1.1
1853     \subsection{Measuring and Characterizing Performance}
1854    
1855     TO BE DONE (CNH)
1856    
1857     \subsection{Estimating Resource Requirements}
1858    
1859     TO BE DONE (CNH)
1860    
1861     \subsubsection{Atlantic 1/6 degree example}
1862     \subsubsection{Dry Run testing}
1863     \subsubsection{Adjoint Resource Requirements}
1864     \subsubsection{State Estimation Environment Resources}
1865    

  ViewVC Help
Powered by ViewVC 1.1.22