/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Annotation of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.24 - (hide annotations) (download) (as text)
Wed Apr 5 01:12:02 2006 UTC (19 years, 3 months ago) by jmc
Branch: MAIN
Changes since 1.23: +2 -1 lines
File MIME type: application/x-tex
add label for section 4.2.4 (domain_decomposition)

1 jmc 1.24 % $Header: /u/gcmpack/manual/part4/sarch.tex,v 1.23 2006/04/04 19:14:14 edhill Exp $
2 cnh 1.1
3 edhill 1.22 This chapter focuses on describing the {\bf WRAPPER} environment
4     within which both the core numerics and the pluggable packages
5     operate. The description presented here is intended to be a detailed
6     exposition and contains significant background material, as well as
7     advanced details on working with the WRAPPER. The tutorial sections
8     of this manual (see sections \ref{sect:tutorials} and
9     \ref{sect:tutorialIII}) contain more succinct, step-by-step
10     instructions on running basic numerical experiments, of varous types,
11     both sequentially and in parallel. For many projects simply starting
12     from an example code and adapting it to suit a particular situation
13     will be all that is required. The first part of this chapter
14     discusses the MITgcm architecture at an abstract level. In the second
15     part of the chapter we described practical details of the MITgcm
16     implementation and of current tools and operating system features that
17     are employed.
18 cnh 1.1
19     \section{Overall architectural goals}
20 edhill 1.20 \begin{rawhtml}
21     <!-- CMIREDIR:overall_architectural_goals: -->
22     \end{rawhtml}
23 cnh 1.1
24     Broadly, the goals of the software architecture employed in MITgcm are
25     three-fold
26    
27     \begin{itemize}
28 edhill 1.22 \item We wish to be able to study a very broad range of interesting
29     and challenging rotating fluids problems.
30     \item We wish the model code to be readily targeted to a wide range of
31     platforms
32     \item On any given platform we would like to be able to achieve
33     performance comparable to an implementation developed and
34     specialized specifically for that platform.
35 cnh 1.1 \end{itemize}
36    
37 edhill 1.22 These points are summarized in figure
38     \ref{fig:mitgcm_architecture_goals} which conveys the goals of the
39     MITgcm design. The goals lead to a software architecture which at the
40     high-level can be viewed as consisting of
41 cnh 1.1
42     \begin{enumerate}
43 edhill 1.21 \item A core set of numerical and support code. This is discussed in
44     detail in section \ref{chap:discretization}.
45    
46     \item A scheme for supporting optional ``pluggable'' {\bf packages}
47     (containing for example mixed-layer schemes, biogeochemical schemes,
48     atmospheric physics). These packages are used both to overlay
49     alternate dynamics and to introduce specialized physical content
50     onto the core numerical code. An overview of the {\bf package}
51     scheme is given at the start of part \ref{chap:packagesI}.
52    
53     \item A support framework called {\bf WRAPPER} (Wrappable Application
54     Parallel Programming Environment Resource), within which the core
55     numerics and pluggable packages operate.
56 cnh 1.1 \end{enumerate}
57    
58 edhill 1.21 This chapter focuses on describing the {\bf WRAPPER} environment under
59     which both the core numerics and the pluggable packages function. The
60     description presented here is intended to be a detailed exposition and
61     contains significant background material, as well as advanced details
62     on working with the WRAPPER. The examples section of this manual
63     (part \ref{chap:getting_started}) contains more succinct, step-by-step
64     instructions on running basic numerical experiments both sequentially
65     and in parallel. For many projects simply starting from an example
66     code and adapting it to suit a particular situation will be all that
67     is required.
68 cnh 1.1
69 adcroft 1.2
70 cnh 1.1 \begin{figure}
71     \begin{center}
72 adcroft 1.2 \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
73 cnh 1.1 \end{center}
74 edhill 1.22 \caption{ The MITgcm architecture is designed to allow simulation of a
75     wide range of physical problems on a wide range of hardware. The
76     computational resource requirements of the applications targeted
77     range from around $10^7$ bytes ($\approx 10$ megabytes) of memory to
78     $10^{11}$ bytes ($\approx 100$ gigabytes). Arithmetic operation
79     counts for the applications of interest range from $10^{9}$ floating
80     point operations to more than $10^{17}$ floating point operations.}
81 adcroft 1.2 \label{fig:mitgcm_architecture_goals}
82 cnh 1.1 \end{figure}
83    
84     \section{WRAPPER}
85 afe 1.17 \begin{rawhtml}
86 afe 1.18 <!-- CMIREDIR:wrapper: -->
87 afe 1.17 \end{rawhtml}
88 cnh 1.1
89 edhill 1.22 A significant element of the software architecture utilized in MITgcm
90     is a software superstructure and substructure collectively called the
91     WRAPPER (Wrappable Application Parallel Programming Environment
92     Resource). All numerical and support code in MITgcm is written to
93     ``fit'' within the WRAPPER infrastructure. Writing code to ``fit''
94     within the WRAPPER means that coding has to follow certain, relatively
95     straightforward, rules and conventions (these are discussed further in
96 edhill 1.21 section \ref{sect:specifying_a_decomposition}).
97 cnh 1.1
98 edhill 1.22 The approach taken by the WRAPPER is illustrated in figure
99     \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to
100     insulate code that fits within it from architectural differences
101     between hardware platforms and operating systems. This allows
102     numerical code to be easily retargetted.
103 adcroft 1.2
104    
105 cnh 1.1 \begin{figure}
106     \begin{center}
107 adcroft 1.2 \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
108 cnh 1.1 \end{center}
109 adcroft 1.2 \caption{
110 edhill 1.10 Numerical code is written to fit within a software support
111 cnh 1.1 infrastructure called WRAPPER. The WRAPPER is portable and
112 cnh 1.4 can be specialized for a wide range of specific target hardware and
113 cnh 1.1 programming environments, without impacting numerical code that fits
114     within the WRAPPER. Codes that fit within the WRAPPER can generally be
115     made to run as fast on a particular platform as codes specially
116 adcroft 1.2 optimized for that platform.}
117     \label{fig:fit_in_wrapper}
118 cnh 1.1 \end{figure}
119    
120     \subsection{Target hardware}
121 adcroft 1.6 \label{sect:target_hardware}
122 cnh 1.1
123 edhill 1.21 The WRAPPER is designed to target as broad as possible a range of
124     computer systems. The original development of the WRAPPER took place
125     on a multi-processor, CRAY Y-MP system. On that system, numerical code
126     performance and scaling under the WRAPPER was in excess of that of an
127     implementation that was tightly bound to the CRAY systems proprietary
128     multi-tasking and micro-tasking approach. Later developments have been
129     carried out on uniprocessor and multi-processor Sun systems with both
130     uniform memory access (UMA) and non-uniform memory access (NUMA)
131     designs. Significant work has also been undertaken on x86 cluster
132     systems, Alpha processor based clustered SMP systems, and on
133     cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
134     systems. The MITgcm code, operating within the WRAPPER, is also
135     routinely used on large scale MPP systems (for example, Cray T3E and
136     IBM SP systems). In all cases numerical code, operating within the
137     WRAPPER, performs and scales very competitively with equivalent
138     numerical code that has been modified to contain native optimizations
139     for a particular system \ref{ref hoe and hill, ecmwf}.
140 cnh 1.1
141     \subsection{Supporting hardware neutrality}
142    
143 edhill 1.21 The different systems listed in section \ref{sect:target_hardware} can
144     be categorized in many different ways. For example, one common
145     distinction is between shared-memory parallel systems (SMP and PVP)
146     and distributed memory parallel systems (for example x86 clusters and
147     large MPP systems). This is one example of a difference between
148     compute platforms that can impact an application. Another common
149     distinction is between vector processing systems with highly
150     specialized CPUs and memory subsystems and commodity microprocessor
151     based systems. There are numerous other differences, especially in
152     relation to how parallel execution is supported. To capture the
153     essential differences between different platforms the WRAPPER uses a
154     {\it machine model}.
155 cnh 1.1
156     \subsection{WRAPPER machine model}
157    
158 edhill 1.21 Applications using the WRAPPER are not written to target just one
159     particular machine (for example an IBM SP2) or just one particular
160     family or class of machines (for example Parallel Vector Processor
161     Systems). Instead the WRAPPER provides applications with an abstract
162     {\it machine model}. The machine model is very general, however, it
163     can easily be specialized to fit, in a computationally efficient
164     manner, any computer architecture currently available to the
165     scientific computing community.
166 cnh 1.1
167     \subsection{Machine model parallelism}
168 jmc 1.24 \label{sect:domain_decomposition}
169 afe 1.17 \begin{rawhtml}
170 afe 1.18 <!-- CMIREDIR:domain_decomp: -->
171 afe 1.17 \end{rawhtml}
172 cnh 1.1
173 edhill 1.21 Codes operating under the WRAPPER target an abstract machine that is
174     assumed to consist of one or more logical processors that can compute
175     concurrently. Computational work is divided among the logical
176     processors by allocating ``ownership'' to each processor of a certain
177     set (or sets) of calculations. Each set of calculations owned by a
178     particular processor is associated with a specific region of the
179     physical space that is being simulated, only one processor will be
180     associated with each such region (domain decomposition).
181    
182     In a strict sense the logical processors over which work is divided do
183     not need to correspond to physical processors. It is perfectly
184     possible to execute a configuration decomposed for multiple logical
185     processors on a single physical processor. This helps ensure that
186     numerical code that is written to fit within the WRAPPER will
187     parallelize with no additional effort. It is also useful for
188     debugging purposes. Generally, however, the computational domain will
189     be subdivided over multiple logical processors in order to then bind
190     those logical processors to physical processor resources that can
191     compute in parallel.
192 cnh 1.1
193     \subsubsection{Tiles}
194    
195 edhill 1.21 Computationally, the data structures (\textit{eg.} arrays, scalar
196     variables, etc.) that hold the simulated state are associated with
197     each region of physical space and are allocated to a particular
198     logical processor. We refer to these data structures as being {\bf
199     owned} by the processor to which their associated region of physical
200     space has been allocated. Individual regions that are allocated to
201     processors are called {\bf tiles}. A processor can own more than one
202     tile. Figure \ref{fig:domaindecomp} shows a physical domain being
203     mapped to a set of logical processors, with each processors owning a
204     single region of the domain (a single tile). Except for periods of
205     communication and coordination, each processor computes autonomously,
206     working only with data from the tile (or tiles) that the processor
207     owns. When multiple tiles are alloted to a single processor, each
208     tile is computed on independently of the other tiles, in a sequential
209     fashion.
210 cnh 1.1
211     \begin{figure}
212     \begin{center}
213 adcroft 1.2 \resizebox{5in}{!}{
214     \includegraphics{part4/domain_decomp.eps}
215 cnh 1.1 }
216     \end{center}
217 edhill 1.21 \caption{ The WRAPPER provides support for one and two dimensional
218     decompositions of grid-point domains. The figure shows a
219     hypothetical domain of total size $N_{x}N_{y}N_{z}$. This
220     hypothetical domain is decomposed in two-dimensions along the
221     $N_{x}$ and $N_{y}$ directions. The resulting {\bf tiles} are {\bf
222     owned} by different processors. The {\bf owning} processors
223     perform the arithmetic operations associated with a {\bf tile}.
224     Although not illustrated here, a single processor can {\bf own}
225     several {\bf tiles}. Whenever a processor wishes to transfer data
226     between tiles or communicate with other processors it calls a
227     WRAPPER supplied function. } \label{fig:domaindecomp}
228 cnh 1.1 \end{figure}
229    
230    
231     \subsubsection{Tile layout}
232    
233 edhill 1.21 Tiles consist of an interior region and an overlap region. The
234     overlap region of a tile corresponds to the interior region of an
235     adjacent tile. In figure \ref{fig:tiledworld} each tile would own the
236     region within the black square and hold duplicate information for
237     overlap regions extending into the tiles to the north, south, east and
238     west. During computational phases a processor will reference data in
239     an overlap region whenever it requires values that lie outside the
240     domain it owns. Periodically processors will make calls to WRAPPER
241     functions to communicate data between tiles, in order to keep the
242     overlap regions up to date (see section
243     \ref{sect:communication_primitives}). The WRAPPER functions can use a
244 cnh 1.1 variety of different mechanisms to communicate data between tiles.
245    
246     \begin{figure}
247     \begin{center}
248 adcroft 1.2 \resizebox{5in}{!}{
249     \includegraphics{part4/tiled-world.eps}
250 cnh 1.1 }
251     \end{center}
252     \caption{ A global grid subdivided into tiles.
253     Tiles contain a interior region and an overlap region.
254     Overlap regions are periodically updated from neighboring tiles.
255     } \label{fig:tiledworld}
256     \end{figure}
257    
258     \subsection{Communication mechanisms}
259    
260 edhill 1.21 Logical processors are assumed to be able to exchange information
261     between tiles and between each other using at least one of two
262     possible mechanisms.
263 cnh 1.1
264     \begin{itemize}
265 edhill 1.21 \item {\bf Shared memory communication}. Under this mode of
266     communication data transfers are assumed to be possible using direct
267     addressing of regions of memory. In this case a CPU is able to read
268     (and write) directly to regions of memory ``owned'' by another CPU
269     using simple programming language level assignment operations of the
270     the sort shown in figure \ref{fig:simple_assign}. In this way one
271     CPU (CPU1 in the figure) can communicate information to another CPU
272     (CPU2 in the figure) by assigning a particular value to a particular
273     memory location.
274    
275     \item {\bf Distributed memory communication}. Under this mode of
276     communication there is no mechanism, at the application code level,
277     for directly addressing regions of memory owned and visible to
278     another CPU. Instead a communication library must be used as
279     illustrated in figure \ref{fig:comm_msg}. In this case CPUs must
280     call a function in the API of the communication library to
281     communicate data from a tile that it owns to a tile that another CPU
282     owns. By default the WRAPPER binds to the MPI communication library
283     \ref{MPI} for this style of communication.
284 cnh 1.1 \end{itemize}
285    
286     The WRAPPER assumes that communication will use one of these two styles
287 edhill 1.21 of communication. The underlying hardware and operating system support
288 cnh 1.1 for the style used is not specified and can vary from system to system.
289    
290     \begin{figure}
291     \begin{verbatim}
292    
293     CPU1 | CPU2
294     ==== | ====
295     |
296     a(3) = 8 | WHILE ( a(3) .NE. 8 )
297     | WAIT
298     | END WHILE
299     |
300     \end{verbatim}
301 edhill 1.21 \caption{In the WRAPPER shared memory communication model, simple writes to an
302     array can be made to be visible to other CPUs at the application code level.
303 cnh 1.1 So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
304 edhill 1.21 element $3$ of array $a$, then other CPUs (for example CPU2 in the figure above)
305 cnh 1.1 will be able to see the value $8$ when they read from $a(3)$.
306     This provides a very low latency and high bandwidth communication
307     mechanism.
308     } \label{fig:simple_assign}
309     \end{figure}
310    
311     \begin{figure}
312     \begin{verbatim}
313    
314     CPU1 | CPU2
315     ==== | ====
316     |
317     a(3) = 8 | WHILE ( a(3) .NE. 8 )
318     CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
319     | END WHILE
320     |
321     \end{verbatim}
322     \caption{ In the WRAPPER distributed memory communication model
323 edhill 1.21 data can not be made directly visible to other CPUs.
324 cnh 1.1 If one CPU writes the value $8$ to element $3$ of array $a$, then
325     at least one of CPU1 and/or CPU2 in the figure above will need
326     to call a bespoke communication library in order for the updated
327 edhill 1.21 value to be communicated between CPUs.
328 cnh 1.1 } \label{fig:comm_msg}
329     \end{figure}
330    
331     \subsection{Shared memory communication}
332 adcroft 1.6 \label{sect:shared_memory_communication}
333 cnh 1.1
334 edhill 1.21 Under shared communication independent CPUs are operating on the
335     exact same global address space at the application level. This means
336     that CPU 1 can directly write into global data structures that CPU 2
337     ``owns'' using a simple assignment at the application level. This is
338     the model of memory access is supported at the basic system design
339     level in ``shared-memory'' systems such as PVP systems, SMP systems,
340     and on distributed shared memory systems (\textit{eg.} SGI Origin, SGI
341     Altix, and some AMD Opteron systems). On such systems the WRAPPER
342     will generally use simple read and write statements to access directly
343     application data structures when communicating between CPUs.
344    
345     In a system where assignments statements, like the one in figure
346     \ref{fig:simple_assign} map directly to hardware instructions that
347     transport data between CPU and memory banks, this can be a very
348     efficient mechanism for communication. In this case two CPUs, CPU1
349     and CPU2, can communicate simply be reading and writing to an agreed
350     location and following a few basic rules. The latency of this sort of
351     communication is generally not that much higher than the hardware
352     latency of other memory accesses on the system. The bandwidth
353     available between CPUs communicating in this way can be close to the
354     bandwidth of the systems main-memory interconnect. This can make this
355     method of communication very efficient provided it is used
356     appropriately.
357 cnh 1.1
358     \subsubsection{Memory consistency}
359 adcroft 1.6 \label{sect:memory_consistency}
360 cnh 1.1
361 edhill 1.21 When using shared memory communication between multiple processors the
362     WRAPPER level shields user applications from certain counter-intuitive
363     system behaviors. In particular, one issue the WRAPPER layer must
364     deal with is a systems memory model. In general the order of reads
365     and writes expressed by the textual order of an application code may
366     not be the ordering of instructions executed by the processor
367     performing the application. The processor performing the application
368     instructions will always operate so that, for the application
369     instructions the processor is executing, any reordering is not
370     apparent. However, in general machines are often designed so that
371     reordering of instructions is not hidden from other second processors.
372     This means that, in general, even on a shared memory system two
373     processors can observe inconsistent memory values.
374    
375     The issue of memory consistency between multiple processors is
376     discussed at length in many computer science papers. From a practical
377     point of view, in order to deal with this issue, shared memory
378     machines all provide some mechanism to enforce memory consistency when
379     it is needed. The exact mechanism employed will vary between systems.
380     For communication using shared memory, the WRAPPER provides a place to
381     invoke the appropriate mechanism to ensure memory consistency for a
382     particular platform.
383 cnh 1.1
384     \subsubsection{Cache effects and false sharing}
385 adcroft 1.6 \label{sect:cache_effects_and_false_sharing}
386 cnh 1.1
387     Shared-memory machines often have local to processor memory caches
388 edhill 1.21 which contain mirrored copies of main memory. Automatic cache-coherence
389 cnh 1.1 protocols are used to maintain consistency between caches on different
390 edhill 1.21 processors. These cache-coherence protocols typically enforce consistency
391 cnh 1.1 between regions of memory with large granularity (typically 128 or 256 byte
392 edhill 1.21 chunks). The coherency protocols employed can be expensive relative to other
393 cnh 1.1 memory accesses and so care is taken in the WRAPPER (by padding synchronization
394     structures appropriately) to avoid unnecessary coherence traffic.
395    
396     \subsubsection{Operating system support for shared memory.}
397    
398 edhill 1.21 Applications running under multiple threads within a single process
399     can use shared memory communication. In this case {\it all} the
400     memory locations in an application are potentially visible to all the
401     compute threads. Multiple threads operating within a single process is
402     the standard mechanism for supporting shared memory that the WRAPPER
403     utilizes. Configuring and launching code to run in multi-threaded mode
404     on specific platforms is discussed in section
405     \ref{sect:multi-threaded-execution}. However, on many systems,
406     potentially very efficient mechanisms for using shared memory
407     communication between multiple processes (in contrast to multiple
408     threads within a single process) also exist. In most cases this works
409     by making a limited region of memory shared between processes. The
410     MMAP \ref{magicgarden} and IPC \ref{magicgarden} facilities in UNIX
411     systems provide this capability as do vendor specific tools like LAPI
412     \ref{IBMLAPI} and IMC \ref{Memorychannel}. Extensions exist for the
413     WRAPPER that allow these mechanisms to be used for shared memory
414     communication. However, these mechanisms are not distributed with the
415     default WRAPPER sources, because of their proprietary nature.
416 cnh 1.1
417     \subsection{Distributed memory communication}
418 adcroft 1.6 \label{sect:distributed_memory_communication}
419 cnh 1.1 Many parallel systems are not constructed in a way where it is
420 edhill 1.22 possible or practical for an application to use shared memory for
421     communication. For example cluster systems consist of individual
422     computers connected by a fast network. On such systems there is no
423     notion of shared memory at the system level. For this sort of system
424     the WRAPPER provides support for communication based on a bespoke
425     communication library (see figure \ref{fig:comm_msg}). The default
426     communication library used is MPI \cite{MPI-std-20}. However, it is
427     relatively straightforward to implement bindings to optimized platform
428     specific communication libraries. For example the work described in
429     \ref{hoe-hill:99} substituted standard MPI communication for a highly
430     optimized library.
431 cnh 1.1
432     \subsection{Communication primitives}
433 adcroft 1.6 \label{sect:communication_primitives}
434 cnh 1.1
435     \begin{figure}
436     \begin{center}
437 adcroft 1.2 \resizebox{5in}{!}{
438     \includegraphics{part4/comm-primm.eps}
439 cnh 1.1 }
440     \end{center}
441 cnh 1.5 \caption{Three performance critical parallel primitives are provided
442 edhill 1.21 by the WRAPPER. These primitives are always used to communicate data
443     between tiles. The figure shows four tiles. The curved arrows
444     indicate exchange primitives which transfer data between the overlap
445     regions at tile edges and interior regions for nearest-neighbor
446     tiles. The straight arrows symbolize global sum operations which
447     connect all tiles. The global sum operation provides both a key
448     arithmetic primitive and can serve as a synchronization primitive. A
449     third barrier primitive is also provided, it behaves much like the
450     global sum primitive. } \label{fig:communication_primitives}
451 cnh 1.1 \end{figure}
452    
453    
454 edhill 1.21 Optimized communication support is assumed to be potentially available
455     for a small number of communication operations. It is also assumed
456     that communication performance optimizations can be achieved by
457     optimizing a small number of communication primitives. Three
458     optimizable primitives are provided by the WRAPPER
459 cnh 1.1 \begin{itemize}
460 edhill 1.21 \item{\bf EXCHANGE} This operation is used to transfer data between
461     interior and overlap regions of neighboring tiles. A number of
462     different forms of this operation are supported. These different
463     forms handle
464     \begin{itemize}
465     \item Data type differences. Sixty-four bit and thirty-two bit
466     fields may be handled separately.
467     \item Bindings to different communication methods. Exchange
468     primitives select between using shared memory or distributed
469     memory communication.
470     \item Transformation operations required when transporting data
471     between different grid regions. Transferring data between faces of
472     a cube-sphere grid, for example, involves a rotation of vector
473     components.
474     \item Forward and reverse mode computations. Derivative calculations
475     require tangent linear and adjoint forms of the exchange
476     primitives.
477     \end{itemize}
478 cnh 1.1
479     \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
480 edhill 1.21 operation for the pressure inversion phase of the MITgcm algorithm.
481     For certain configurations scaling can be highly sensitive to the
482     performance of the global sum primitive. This operation is a
483     collective operation involving all tiles of the simulated domain.
484     Different forms of the global sum primitive exist for handling
485     \begin{itemize}
486     \item Data type differences. Sixty-four bit and thirty-two bit
487     fields may be handled separately.
488     \item Bindings to different communication methods. Exchange
489     primitives select between using shared memory or distributed
490     memory communication.
491     \item Forward and reverse mode computations. Derivative calculations
492     require tangent linear and adjoint forms of the exchange
493     primitives.
494     \end{itemize}
495    
496     \item{\bf BARRIER} The WRAPPER provides a global synchronization
497     function called barrier. This is used to synchronize computations
498     over all tiles. The {\bf BARRIER} and {\bf GLOBAL SUM} primitives
499     have much in common and in some cases use the same underlying code.
500 cnh 1.1 \end{itemize}
501    
502    
503     \subsection{Memory architecture}
504    
505     The WRAPPER machine model is aimed to target efficiently systems with
506     highly pipelined memory architectures and systems with deep memory
507     hierarchies that favor memory reuse. This is achieved by supporting a
508     flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
509     Within a CPU computations are carried out sequentially on each tile
510     in turn. By reshaping tiles according to the target platform it is
511     possible to automatically tune code to improve memory performance.
512     On a vector machine a given domain might be sub-divided into a few
513     long, thin regions. On a commodity microprocessor based system, however,
514     the same region could be simulated use many more smaller
515     sub-domains.
516    
517    
518     \begin{figure}
519     \begin{center}
520 adcroft 1.2 \resizebox{5in}{!}{
521     \includegraphics{part4/tiling_detail.eps}
522 cnh 1.1 }
523     \end{center}
524     \caption{The tiling strategy that the WRAPPER supports allows tiles
525     to be shaped to suit the underlying system memory architecture.
526     Compact tiles that lead to greater memory reuse can be used on cache
527     based systems (upper half of figure) with deep memory hierarchies, long tiles
528     with large inner loops can be used to exploit vector systems having
529     highly pipelined memory systems.
530     } \label{fig:tiling-strategy}
531     \end{figure}
532    
533     \newpage
534     \subsection{Summary}
535     Following the discussion above, the machine model that the WRAPPER
536     presents to an application has the following characteristics
537    
538     \begin{itemize}
539 edhill 1.21 \item The machine consists of one or more logical processors.
540     \item Each processor operates on tiles that it owns.
541     \item A processor may own more than one tile.
542     \item Processors may compute concurrently.
543 cnh 1.1 \item Exchange of information between tiles is handled by the
544 edhill 1.21 machine (WRAPPER) not by the application.
545 cnh 1.1 \end{itemize}
546     Behind the scenes this allows the WRAPPER to adapt the machine model
547     functions to exploit hardware on which
548     \begin{itemize}
549 edhill 1.21 \item Processors may be able to communicate very efficiently with each
550     other using shared memory.
551 cnh 1.1 \item An alternative communication mechanism based on a relatively
552 edhill 1.21 simple inter-process communication API may be required.
553 cnh 1.1 \item Shared memory may not necessarily obey sequential consistency,
554 edhill 1.21 however some mechanism will exist for enforcing memory consistency.
555 cnh 1.1 \item Memory consistency that is enforced at the hardware level
556 edhill 1.21 may be expensive. Unnecessary triggering of consistency protocols
557     should be avoided.
558 cnh 1.1 \item Memory access patterns may need to either repetitive or highly
559 edhill 1.21 pipelined for optimum hardware performance.
560 cnh 1.1 \end{itemize}
561    
562     This generic model captures the essential hardware ingredients
563     of almost all successful scientific computer systems designed in the
564     last 50 years.
565    
566     \section{Using the WRAPPER}
567 edhill 1.20 \begin{rawhtml}
568     <!-- CMIREDIR:using_the_wrapper: -->
569     \end{rawhtml}
570 cnh 1.1
571 edhill 1.21 In order to support maximum portability the WRAPPER is implemented
572     primarily in sequential Fortran 77. At a practical level the key steps
573     provided by the WRAPPER are
574 cnh 1.1 \begin{enumerate}
575     \item specifying how a domain will be decomposed
576     \item starting a code in either sequential or parallel modes of operations
577     \item controlling communication between tiles and between concurrently
578 edhill 1.21 computing CPUs.
579 cnh 1.1 \end{enumerate}
580     This section describes the details of each of these operations.
581 edhill 1.21 Section \ref{sect:specifying_a_decomposition} explains how the way in
582     which a domain is decomposed (or composed) is expressed. Section
583     \ref{sect:starting_a_code} describes practical details of running
584     codes in various different parallel modes on contemporary computer
585     systems. Section \ref{sect:controlling_communication} explains the
586     internal information that the WRAPPER uses to control how information
587     is communicated between tiles.
588 cnh 1.1
589     \subsection{Specifying a domain decomposition}
590 adcroft 1.6 \label{sect:specifying_a_decomposition}
591 cnh 1.1
592     At its heart much of the WRAPPER works only in terms of a collection of tiles
593     which are interconnected to each other. This is also true of application
594     code operating within the WRAPPER. Application code is written as a series
595     of compute operations, each of which operates on a single tile. If
596     application code needs to perform operations involving data
597     associated with another tile, it uses a WRAPPER function to obtain
598     that data.
599     The specification of how a global domain is constructed from tiles or alternatively
600     how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
601     This file defines the following parameters \\
602    
603     \fbox{
604     \begin{minipage}{4.75in}
605     Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
606     File: {\em model/inc/SIZE.h}
607     \end{minipage}
608     } \\
609    
610     Together these parameters define a tiling decomposition of the style shown in
611     figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
612     the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
613     maximum size of the overlap extent. This must be set to the maximum width
614     of the computation stencil that the numerical code finite-difference operations
615     require between overlap region updates. The maximum overlap required
616     by any of the operations in the MITgcm code distributed with this release is three grid
617     points. This is set by the requirements of the $\nabla^4$ dissipation and
618     diffusion operator. Code modifications and enhancements that involve adding wide
619     finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
620     Setting {\em OLx} and {\em OLy} to a too large value will decrease code
621     performance (because redundant computations will be performed), however it will
622     not cause any other problems.
623    
624     \begin{figure}
625     \begin{center}
626 adcroft 1.2 \resizebox{5in}{!}{
627     \includegraphics{part4/size_h.eps}
628 cnh 1.1 }
629     \end{center}
630     \caption{ The three level domain decomposition hierarchy employed by the
631     WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
632     to a single process. Multiple processes can exist, each with multiple tiles.
633     Tiles within a process can be spread over multiple compute threads.
634     } \label{fig:labelled_tile}
635     \end{figure}
636    
637     The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
638     be created within a single process. Each of these tiles will have internal
639     dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
640     allocated to different threads of a process that are then bound to
641     different physical processors ( see the multi-threaded
642 adcroft 1.6 execution discussion in section \ref{sect:starting_the_code} ) then
643 cnh 1.1 computation will be performed concurrently on each tile. However, it is also
644     possible to run the same decomposition within a process running a single thread on
645     a single processor. In this case the tiles will be computed over sequentially.
646     If the decomposition is run in a single process running multiple threads
647     but attached to a single physical processor, then, in general, the computation
648     for different tiles will be interleaved by system level software.
649     This too is a valid mode of operation.
650    
651     The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
652     numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
653     are used to form the loop ranges for many numerical calculations and
654     to provide dimensions for arrays holding numerical state.
655     The {\em nSx} and{\em nSy} are used in conjunction with the thread number
656     parameter {\em myThid}. Much of the numerical code operating within the
657     WRAPPER takes the form
658     \begin{verbatim}
659     DO bj=myByLo(myThid),myByHi(myThid)
660     DO bi=myBxLo(myThid),myBxHi(myThid)
661     :
662     a block of computations ranging
663     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
664     :
665     ENDDO
666     ENDDO
667    
668     communication code to sum a number or maybe update
669     tile overlap regions
670    
671     DO bj=myByLo(myThid),myByHi(myThid)
672     DO bi=myBxLo(myThid),myBxHi(myThid)
673     :
674     another block of computations ranging
675     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
676     :
677     ENDDO
678     ENDDO
679     \end{verbatim}
680     The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
681     myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
682     schematic. These variables specify the subset of the tiles in
683     the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
684     thread number {\em myThid} owns. The thread number variable {\em myThid}
685     ranges from 1 to the total number of threads requested at execution time.
686     For each value of {\em myThid} the loop scheme above will step sequentially
687     through the tiles owned by that thread. However, different threads will
688     have different ranges of tiles assigned to them, so that separate threads can
689     compute iterations of the {\em bi}, {\em bj} loop concurrently.
690     Within a {\em bi}, {\em bj} loop
691     computation is performed concurrently over as many processes and threads
692     as there are physical processors available to compute.
693    
694 afe 1.11 An exception to the the use of {\em bi} and {\em bj} in loops arises in the
695     exchange routines used when the exch2 package is used with the cubed
696     sphere. In this case {\em bj} is generally set to 1 and the loop runs from
697     1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number,
698     which is then used to reference exchange parameters.
699    
700 cnh 1.1 The amount of computation that can be embedded
701     a single loop over {\em bi} and {\em bj} varies for different parts of the
702     MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
703     from the two-dimensional implicit elliptic solver. This portion of the
704     code computes the l2Norm of a vector whose elements are held in
705     the array {\em cg2d\_r} writing the final result to scalar variable
706     {\em err}. In this case, because the l2norm requires a global reduction,
707     the {\em bi},{\em bj} loop only contains one statement. This computation
708     phase is then followed by a communication phase in which all threads and
709     processes must participate. However,
710     in other areas of the MITgcm code entries subsections of code are within
711     a single {\em bi},{\em bj} loop. For example the evaluation of all
712     the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
713     is within a single {\em bi},{\em bj} loop.
714    
715     \begin{figure}
716     \begin{verbatim}
717     REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
718     REAL*8 err
719     :
720     :
721     other computations
722     :
723     :
724     err = 0.
725     DO bj=myByLo(myThid),myByHi(myThid)
726     DO bi=myBxLo(myThid),myBxHi(myThid)
727     DO J=1,sNy
728     DO I=1,sNx
729     err = err +
730     & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
731     ENDDO
732     ENDDO
733     ENDDO
734     ENDDO
735    
736     CALL GLOBAL_SUM_R8( err , myThid )
737     err = SQRT(err)
738    
739     \end{verbatim}
740     \caption{Example of numerical code for calculating
741     the l2-norm of a vector within the WRAPPER. Notice that
742     under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
743     dimensions. These right most indices are tile indexes. Different
744     threads with a single process operate on different ranges of tile
745     index, as controlled by the settings of
746     {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
747     } \label{fig:bibj_extract}
748     \end{figure}
749    
750     The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
751     are used to indicate to the WRAPPER level how many processes (each with
752     {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
753     This information is needed during initialization and during I/O phases.
754     However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
755     the values of {\em nPx} and {\em nPy} are absent
756     from the core numerical and support code.
757    
758     \subsubsection{Examples of {\em SIZE.h} specifications}
759    
760     The following different {\em SIZE.h} parameter setting illustrate how to
761     interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
762     and {\em nPy}.
763     \begin{enumerate}
764     \item
765     \begin{verbatim}
766     PARAMETER (
767     & sNx = 90,
768     & sNy = 40,
769     & OLx = 3,
770     & OLy = 3,
771     & nSx = 1,
772     & nSy = 1,
773     & nPx = 1,
774     & nPy = 1)
775     \end{verbatim}
776     This sets up a single tile with x-dimension of ninety grid points, y-dimension of
777     forty grid points, and x and y overlaps of three grid points each.
778     \item
779     \begin{verbatim}
780     PARAMETER (
781     & sNx = 45,
782     & sNy = 20,
783     & OLx = 3,
784     & OLy = 3,
785     & nSx = 1,
786     & nSy = 1,
787     & nPx = 2,
788     & nPy = 2)
789     \end{verbatim}
790     This sets up tiles with x-dimension of forty-five grid points, y-dimension of
791     twenty grid points, and x and y overlaps of three grid points each. There are
792     four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
793     arranged so that the global domain size is again ninety grid points in x and
794     forty grid points in y. In general the formula for global grid size (held in
795     model variables {\em Nx} and {\em Ny}) is
796     \begin{verbatim}
797     Nx = sNx*nSx*nPx
798     Ny = sNy*nSy*nPy
799     \end{verbatim}
800     \item
801     \begin{verbatim}
802     PARAMETER (
803     & sNx = 90,
804     & sNy = 10,
805     & OLx = 3,
806     & OLy = 3,
807     & nSx = 1,
808     & nSy = 2,
809     & nPx = 1,
810     & nPy = 2)
811     \end{verbatim}
812     This sets up tiles with x-dimension of ninety grid points, y-dimension of
813     ten grid points, and x and y overlaps of three grid points each. There are
814     four tiles allocated to two separate processes ({\em nPy=2}) each of which
815     has two separate sub-domains {\em nSy=2},
816     The global domain size is again ninety grid points in x and
817     forty grid points in y. The two sub-domains in each process will be computed
818     sequentially if they are given to a single thread within a single process.
819     Alternatively if the code is invoked with multiple threads per process
820 edhill 1.10 the two domains in y may be computed concurrently.
821 cnh 1.1 \item
822     \begin{verbatim}
823     PARAMETER (
824     & sNx = 32,
825     & sNy = 32,
826     & OLx = 3,
827     & OLy = 3,
828     & nSx = 6,
829     & nSy = 1,
830     & nPx = 1,
831     & nPy = 1)
832     \end{verbatim}
833     This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
834     thirty-two grid points, and x and y overlaps of three grid points each.
835     There are six tiles allocated to six separate logical processors ({\em nSx=6}).
836     This set of values can be used for a cube sphere calculation.
837     Each tile of size $32 \times 32$ represents a face of the
838 cnh 1.4 cube. Initializing the tile connectivity correctly ( see section
839 adcroft 1.6 \ref{sect:cube_sphere_communication}. allows the rotations associated with
840 cnh 1.1 moving between the six cube faces to be embedded within the
841     tile-tile communication code.
842     \end{enumerate}
843    
844    
845     \subsection{Starting the code}
846 adcroft 1.6 \label{sect:starting_the_code}
847 cnh 1.1 When code is started under the WRAPPER, execution begins in a main routine {\em
848     eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
849     to the application through a routine called {\em THE\_MODEL\_MAIN()}
850     once the WRAPPER has initialized correctly and has created the necessary variables
851     to support subsequent calls to communication routines
852     by the application code. The startup calling sequence followed by the
853     WRAPPER is shown in figure \ref{fig:wrapper_startup}.
854    
855     \begin{figure}
856 heimbach 1.8 {\footnotesize
857 cnh 1.1 \begin{verbatim}
858    
859     MAIN
860     |
861     |--EEBOOT :: WRAPPER initialization
862     | |
863     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
864     | | allow basic I/O.
865     | |-- EEINTRO_MSG :: Write startup greeting.
866     | |
867     | |-- EESET_PARMS :: Set WRAPPER parameters
868     | |
869     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
870     | |
871     | |-- INI_PROCS :: Associate processes with grid regions.
872     | |
873     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
874     | |
875     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
876     | :: communication data structures
877     |
878     |
879     |--CHECK_THREADS :: Validate multiple thread start up.
880     |
881     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
882    
883    
884     \end{verbatim}
885 heimbach 1.8 }
886 cnh 1.1 \caption{Main stages of the WRAPPER startup procedure.
887     This process proceeds transfer of control to application code, which
888     occurs through the procedure {\em THE\_MODEL\_MAIN()}.
889     } \label{fig:wrapper_startup}
890     \end{figure}
891    
892     \subsubsection{Multi-threaded execution}
893 adcroft 1.6 \label{sect:multi-threaded-execution}
894 cnh 1.1 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
895     WRAPPER may cause several coarse grain threads to be initialized. The routine
896     {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
897     stack argument which is the thread number, stored in the
898     variable {\em myThid}. In addition to specifying a decomposition with
899 adcroft 1.6 multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
900 cnh 1.1 configuring and starting a code to run using multiple threads requires the following
901     steps.\\
902    
903     \paragraph{Compilation}
904     First the code must be compiled with appropriate multi-threading directives
905     active in the file {\em main.F} and with appropriate compiler flags
906     to request multi-threading support. The header files
907     {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
908     contain directives compatible with compilers for Sun, Compaq, SGI,
909     Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
910     activated by using compile time
911     directives {\em -DTARGET\_SUN},
912     {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
913     or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
914     for invoking multi-threaded compilation vary from system to system
915     and from compiler to compiler. The options will be described
916     in the individual compiler documentation. For the Fortran compiler
917     from Sun the following options are needed to correctly compile
918     multi-threaded code
919     \begin{verbatim}
920     -stackvar -explicitpar -vpara -noautopar
921     \end{verbatim}
922     These options are specific to the Sun compiler. Other compilers
923     will use different syntax that will be described in their
924     documentation. The effect of these options is as follows
925     \begin{enumerate}
926     \item {\bf -stackvar} Causes all local variables to be allocated in stack
927     storage. This is necessary for local variables to ensure that they are private
928     to their thread. Note, when using this option it may be necessary to override
929     the default limit on stack-size that the operating system assigns to a process.
930     This can normally be done by changing the settings of the command shells
931     {\em stack-size} limit variable. However, on some systems changing this limit
932     will require privileged administrator access to modify system parameters.
933    
934     \item {\bf -explicitpar} Requests that multiple threads be spawned
935     in response to explicit directives in the application code. These
936     directives are inserted with syntax appropriate to the particular target
937     platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
938    
939     \item {\bf -vpara} This causes the compiler to describe the multi-threaded
940     configuration it is creating. This is not required
941     but it can be useful when trouble shooting.
942    
943     \item {\bf -noautopar} This inhibits any automatic multi-threaded
944     parallelization the compiler may otherwise generate.
945    
946     \end{enumerate}
947    
948    
949     An example of valid settings for the {\em eedata} file for a
950     domain with two subdomains in y and running with two threads is shown
951     below
952     \begin{verbatim}
953     nTx=1,nTy=2
954     \end{verbatim}
955     This set of values will cause computations to stay within a single
956     thread when moving across the {\em nSx} sub-domains. In the y-direction,
957     however, sub-domains will be split equally between two threads.
958    
959     \paragraph{Multi-threading files and parameters} The following
960     files and variables are used in setting up multi-threaded execution.\\
961    
962     \fbox{
963     \begin{minipage}{4.75in}
964     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
965     File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
966     File: {\em model/src/THE\_MODEL\_MAIN.F}\\
967     File: {\em eesupp/src/MAIN.F}\\
968 afe 1.11 File: {\em tools/genmake2}\\
969 cnh 1.1 File: {\em eedata}\\
970     CPP: {\em TARGET\_SUN}\\
971     CPP: {\em TARGET\_DEC}\\
972     CPP: {\em TARGET\_HP }\\
973     CPP: {\em TARGET\_SGI}\\
974     CPP: {\em TARGET\_CRAY\_VECTOR}\\
975     Parameter: {\em nTx}\\
976     Parameter: {\em nTy}
977     \end{minipage}
978     } \\
979    
980     \subsubsection{Multi-process execution}
981 adcroft 1.6 \label{sect:multi-process-execution}
982 cnh 1.1
983 edhill 1.21 Despite its appealing programming model, multi-threaded execution
984     remains less common than multi-process execution. One major reason for
985     this is that many system libraries are still not ``thread-safe''. This
986     means that, for example, on some systems it is not safe to call system
987     routines to perform I/O when running in multi-threaded mode (except,
988     perhaps, in a limited set of circumstances). Another reason is that
989     support for multi-threaded programming models varies between systems.
990    
991     Multi-process execution is more ubiquitous. In order to run code in a
992     multi-process configuration a decomposition specification (see section
993     \ref{sect:specifying_a_decomposition}) is given (in which the at least
994     one of the parameters {\em nPx} or {\em nPy} will be greater than one)
995     and then, as for multi-threaded operation, appropriate compile time
996     and run time steps must be taken.
997    
998     \paragraph{Compilation} Multi-process execution under the WRAPPER
999     assumes that the portable, MPI libraries are available for controlling
1000     the start-up of multiple processes. The MPI libraries are not
1001     required, although they are usually used, for performance critical
1002     communication. However, in order to simplify the task of controlling
1003     and coordinating the start up of a large number (hundreds and possibly
1004     even thousands) of copies of the same program, MPI is used. The calls
1005     to the MPI multi-process startup routines must be activated at compile
1006     time. Currently MPI libraries are invoked by specifying the
1007     appropriate options file with the {\tt-of} flag when running the {\em
1008     genmake2} script, which generates the Makefile for compiling and
1009     linking MITgcm. (Previously this was done by setting the {\em
1010     ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
1011     CPP\_EEOPTIONS.h} file.) More detailed information about the use of
1012     {\em genmake2} for specifying
1013     local compiler flags is located in section \ref{sect:genmake}.\\
1014 cnh 1.1
1015    
1016     \fbox{
1017     \begin{minipage}{4.75in}
1018 afe 1.15 Directory: {\em tools/build\_options}\\
1019 afe 1.11 File: {\em tools/genmake2}
1020 cnh 1.1 \end{minipage}
1021     } \\
1022 edhill 1.21 \paragraph{\bf Execution} The mechanics of starting a program in
1023     multi-process mode under MPI is not standardized. Documentation
1024 cnh 1.1 associated with the distribution of MPI installed on a system will
1025 edhill 1.21 describe how to start a program using that distribution. For the
1026     open-source MPICH system, the MITgcm program can be started using a
1027     command such as
1028 cnh 1.1 \begin{verbatim}
1029     mpirun -np 64 -machinefile mf ./mitgcmuv
1030     \end{verbatim}
1031 edhill 1.21 In this example the text {\em -np 64} specifies the number of
1032     processes that will be created. The numeric value {\em 64} must be
1033     equal to the product of the processor grid settings of {\em nPx} and
1034     {\em nPy} in the file {\em SIZE.h}. The parameter {\em mf} specifies
1035     that a text file called ``mf'' will be read to get a list of processor
1036     names on which the sixty-four processes will execute. The syntax of
1037     this file is specified by the MPI distribution.
1038     \\
1039 cnh 1.1
1040     \fbox{
1041     \begin{minipage}{4.75in}
1042     File: {\em SIZE.h}\\
1043     Parameter: {\em nPx}\\
1044     Parameter: {\em nPy}
1045     \end{minipage}
1046     } \\
1047    
1048 adcroft 1.2
1049     \paragraph{Environment variables}
1050 edhill 1.21 On most systems multi-threaded execution also requires the setting of
1051     a special environment variable. On many machines this variable is
1052     called PARALLEL and its values should be set to the number of parallel
1053     threads required. Generally the help or manual pages associated with
1054     the multi-threaded compiler on a machine will explain how to set the
1055     required environment variables.
1056 adcroft 1.2
1057     \paragraph{Runtime input parameters}
1058 edhill 1.21 Finally the file {\em eedata} needs to be configured to indicate the
1059     number of threads to be used in the x and y directions. The variables
1060     {\em nTx} and {\em nTy} in this file are used to specify the
1061     information required. The product of {\em nTx} and {\em nTy} must be
1062     equal to the number of threads spawned i.e. the setting of the
1063     environment variable PARALLEL. The value of {\em nTx} must subdivide
1064     the number of sub-domains in x ({\em nSx}) exactly. The value of {\em
1065     nTy} must subdivide the number of sub-domains in y ({\em nSy})
1066     exactly. The multiprocess startup of the MITgcm executable {\em
1067     mitgcmuv} is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1068     {\em INI\_PROCS()}. The first routine performs basic steps required to
1069     make sure each process is started and has a textual output stream
1070     associated with it. By default two output files are opened for each
1071     process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}. The {\bf
1072     NNNNN} part of the name is filled in with the process number so that
1073     process number 0 will create output files {\bf STDOUT.0000} and {\bf
1074     STDERR.0000}, process number 1 will create output files {\bf
1075     STDOUT.0001} and {\bf STDERR.0001}, etc. These files are used for
1076     reporting status and configuration information and for reporting error
1077     conditions on a process by process basis. The {\em EEBOOT\_MINIMAL()}
1078     procedure also sets the variables {\em myProcId} and {\em
1079     MPI\_COMM\_MODEL}. These variables are related to processor
1080     identification are are used later in the routine {\em INI\_PROCS()} to
1081     allocate tiles to processes.
1082    
1083     Allocation of processes to tiles is controlled by the routine {\em
1084     INI\_PROCS()}. For each process this routine sets the variables {\em
1085     myXGlobalLo} and {\em myYGlobalLo}. These variables specify, in
1086     index space, the coordinates of the southernmost and westernmost
1087     corner of the southernmost and westernmost tile owned by this process.
1088     The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are
1089     also set in this routine. These are used to identify processes holding
1090     tiles to the west, east, south and north of a given process. These
1091     values are stored in global storage in the header file {\em
1092     EESUPPORT.h} for use by communication routines. The above does not
1093     hold when the exch2 package is used. The exch2 sets its own
1094     parameters to specify the global indices of tiles and their
1095     relationships to each other. See the documentation on the exch2
1096     package (\ref{sec:exch2}) for details.
1097 cnh 1.1 \\
1098    
1099     \fbox{
1100     \begin{minipage}{4.75in}
1101     File: {\em eesupp/src/eeboot\_minimal.F}\\
1102     File: {\em eesupp/src/ini\_procs.F} \\
1103     File: {\em eesupp/inc/EESUPPORT.h} \\
1104     Parameter: {\em myProcId} \\
1105     Parameter: {\em MPI\_COMM\_MODEL} \\
1106     Parameter: {\em myXGlobalLo} \\
1107     Parameter: {\em myYGlobalLo} \\
1108     Parameter: {\em pidW } \\
1109     Parameter: {\em pidE } \\
1110     Parameter: {\em pidS } \\
1111     Parameter: {\em pidN }
1112     \end{minipage}
1113     } \\
1114    
1115    
1116     \subsection{Controlling communication}
1117     The WRAPPER maintains internal information that is used for communication
1118     operations and that can be customized for different platforms. This section
1119     describes the information that is held and used.
1120 adcroft 1.2
1121 cnh 1.1 \begin{enumerate}
1122 edhill 1.21 \item {\bf Tile-tile connectivity information}
1123     For each tile the WRAPPER sets a flag that sets the tile number to
1124     the north, south, east and west of that tile. This number is unique
1125     over all tiles in a configuration. Except when using the cubed
1126     sphere and the exch2 package, the number is held in the variables
1127     {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em
1128     tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also
1129     stored with each tile that specifies the type of communication that
1130     is used between tiles. This information is held in the variables
1131     {\em tileCommModeN}, {\em tileCommModeS}, {\em tileCommModeE} and
1132     {\em tileCommModeW}. This latter set of variables can take one of
1133     the following values {\em COMM\_NONE}, {\em COMM\_MSG}, {\em
1134     COMM\_PUT} and {\em COMM\_GET}. A value of {\em COMM\_NONE} is
1135     used to indicate that a tile has no neighbor to communicate with on
1136     a particular face. A value of {\em COMM\_MSG} is used to indicate
1137     that some form of distributed memory communication is required to
1138     communicate between these tile faces (see section
1139     \ref{sect:distributed_memory_communication}). A value of {\em
1140     COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
1141     memory communication (see section
1142     \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value
1143     indicates that a CPU should communicate by writing to data
1144     structures owned by another CPU. A {\em COMM\_GET} value indicates
1145     that a CPU should communicate by reading from data structures owned
1146     by another CPU. These flags affect the behavior of the WRAPPER
1147     exchange primitive (see figure \ref{fig:communication_primitives}).
1148     The routine {\em ini\_communication\_patterns()} is responsible for
1149     setting the communication mode values for each tile.
1150    
1151     When using the cubed sphere configuration with the exch2 package,
1152     the relationships between tiles and their communication methods are
1153     set by the exch2 package and stored in different variables. See the
1154     exch2 package documentation (\ref{sec:exch2} for details.
1155 cnh 1.1
1156     \fbox{
1157     \begin{minipage}{4.75in}
1158     File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1159     File: {\em eesupp/inc/EESUPPORT.h} \\
1160     Parameter: {\em tileNo} \\
1161     Parameter: {\em tileNoE} \\
1162     Parameter: {\em tileNoW} \\
1163     Parameter: {\em tileNoN} \\
1164     Parameter: {\em tileNoS} \\
1165     Parameter: {\em tileCommModeE} \\
1166     Parameter: {\em tileCommModeW} \\
1167     Parameter: {\em tileCommModeN} \\
1168     Parameter: {\em tileCommModeS} \\
1169     \end{minipage}
1170     } \\
1171    
1172     \item {\bf MP directives}
1173 edhill 1.21 The WRAPPER transfers control to numerical application code through
1174     the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1175     that allows for it to be invoked by several threads. Support for
1176     this is based on either multi-processing (MP) compiler directives or
1177     specific calls to multi-threading libraries (\textit{eg.} POSIX
1178     threads). Most commercially available Fortran compilers support the
1179     generation of code to spawn multiple threads through some form of
1180     compiler directives. Compiler directives are generally more
1181     convenient than writing code to explicitly spawning threads. And,
1182     on some systems, compiler directives may be the only method
1183     available. The WRAPPER is distributed with template MP directives
1184     for a number of systems.
1185    
1186     These directives are inserted into the code just before and after
1187     the transfer of control to numerical algorithm code through the
1188     routine {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows
1189     an example of the code that performs this process for a Silicon
1190     Graphics system. This code is extracted from the files {\em main.F}
1191     and {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads}
1192     specifies how many instances of the routine {\em THE\_MODEL\_MAIN}
1193     will be created. The value of {\em nThreads} is set in the routine
1194     {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the the
1195     product of the parameters {\em nTx} and {\em nTy} that are read from
1196     the file {\em eedata}. If the value of {\em nThreads} is
1197     inconsistent with the number of threads requested from the operating
1198     system (for example by using an environment variable as described in
1199     section \ref{sect:multi_threaded_execution}) then usually an error
1200     will be reported by the routine {\em CHECK\_THREADS}.
1201 cnh 1.1
1202     \fbox{
1203     \begin{minipage}{4.75in}
1204     File: {\em eesupp/src/ini\_threading\_environment.F}\\
1205     File: {\em eesupp/src/check\_threads.F} \\
1206     File: {\em eesupp/src/main.F} \\
1207     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1208     File: {\em eedata } \\
1209     Parameter: {\em nThreads} \\
1210     Parameter: {\em nTx} \\
1211     Parameter: {\em nTy} \\
1212     \end{minipage}
1213     }
1214    
1215     \item {\bf memsync flags}
1216 edhill 1.21 As discussed in section \ref{sect:memory_consistency}, a low-level
1217     system function may be need to force memory consistency on some
1218     shared memory systems. The routine {\em MEMSYNC()} is used for this
1219     purpose. This routine should not need modifying and the information
1220     below is only provided for completeness. A logical parameter {\em
1221     exchNeedsMemSync} set in the routine {\em
1222     INI\_COMMUNICATION\_PATTERNS()} controls whether the {\em
1223     MEMSYNC()} primitive is called. In general this routine is only
1224     used for multi-threaded execution. The code that goes into the {\em
1225     MEMSYNC()} routine is specific to the compiler and processor used.
1226     In some cases, it must be written using a short code snippet of
1227     assembly language. For an Ultra Sparc system the following code
1228     snippet is used
1229 cnh 1.1 \begin{verbatim}
1230     asm("membar #LoadStore|#StoreStore");
1231     \end{verbatim}
1232 cnh 1.4 for an Alpha based system the equivalent code reads
1233 cnh 1.1 \begin{verbatim}
1234     asm("mb");
1235     \end{verbatim}
1236     while on an x86 system the following code is required
1237     \begin{verbatim}
1238     asm("lock; addl $0,0(%%esp)": : :"memory")
1239     \end{verbatim}
1240    
1241     \item {\bf Cache line size}
1242 edhill 1.21 As discussed in section \ref{sect:cache_effects_and_false_sharing},
1243     milti-threaded codes explicitly avoid penalties associated with
1244     excessive coherence traffic on an SMP system. To do this the shared
1245     memory data structures used by the {\em GLOBAL\_SUM}, {\em
1246     GLOBAL\_MAX} and {\em BARRIER} routines are padded. The variables
1247     that control the padding are set in the header file {\em
1248     EEPARAMS.h}. These variables are called {\em cacheLineSize}, {\em
1249     lShare1}, {\em lShare4} and {\em lShare8}. The default values
1250     should not normally need changing.
1251    
1252 cnh 1.1 \item {\bf \_BARRIER}
1253 edhill 1.21 This is a CPP macro that is expanded to a call to a routine which
1254     synchronizes all the logical processors running under the WRAPPER.
1255     Using a macro here preserves flexibility to insert a specialized
1256     call in-line into application code. By default this resolves to
1257     calling the procedure {\em BARRIER()}. The default setting for the
1258     \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1259 cnh 1.1
1260     \item {\bf \_GSUM}
1261 edhill 1.21 This is a CPP macro that is expanded to a call to a routine which
1262     sums up a floating point number over all the logical processors
1263     running under the WRAPPER. Using a macro here provides extra
1264     flexibility to insert a specialized call in-line into application
1265     code. By default this resolves to calling the procedure {\em
1266     GLOBAL\_SUM\_R8()} ( for 64-bit floating point operands) or {\em
1267     GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The
1268     default setting for the \_GSUM macro is given in the file {\em
1269     CPP\_EEMACROS.h}. The \_GSUM macro is a performance critical
1270     operation, especially for large processor count, small tile size
1271     configurations. The custom communication example discussed in
1272     section \ref{sect:jam_example} shows how the macro is used to invoke
1273     a custom global sum routine for a specific set of hardware.
1274 cnh 1.1
1275     \item {\bf \_EXCH}
1276 edhill 1.21 The \_EXCH CPP macro is used to update tile overlap regions. It is
1277     qualified by a suffix indicating whether overlap updates are for
1278     two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1279     physical fields and whether fields are 32-bit floating point (
1280     \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point (
1281     \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined in
1282     the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
1283     operation plays a crucial role in scaling to small tile, large
1284     logical and physical processor count configurations. The example in
1285     section \ref{sect:jam_example} discusses defining an optimized and
1286     specialized form on the \_EXCH operation.
1287    
1288     The \_EXCH operation is also central to supporting grids such as the
1289     cube-sphere grid. In this class of grid a rotation may be required
1290     between tiles. Aligning the coordinate requiring rotation with the
1291     tile decomposition, allows the coordinate transformation to be
1292     embedded within a custom form of the \_EXCH primitive. In these
1293     cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1294     package documentation \ref{sec:exch2}.
1295 cnh 1.1
1296     \item {\bf Reverse Mode}
1297 edhill 1.21 The communication primitives \_EXCH and \_GSUM both employ
1298     hand-written adjoint forms (or reverse mode) forms. These reverse
1299     mode forms can be found in the source code directory {\em
1300     pkg/autodiff}. For the global sum primitive the reverse mode form
1301     calls are to {\em GLOBAL\_ADSUM\_R4} and {\em GLOBAL\_ADSUM\_R8}.
1302     The reverse mode form of the exchange primitives are found in
1303     routines prefixed {\em ADEXCH}. The exchange routines make calls to
1304     the same low-level communication primitives as the forward mode
1305     operations. However, the routine argument {\em simulationMode} is
1306     set to the value {\em REVERSE\_SIMULATION}. This signifies to the
1307     low-level routines that the adjoint forms of the appropriate
1308     communication operation should be performed.
1309 afe 1.16
1310 cnh 1.1 \item {\bf MAX\_NO\_THREADS}
1311 edhill 1.21 The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum
1312     number of OS threads that a code will use. This value defaults to
1313     thirty-two and is set in the file {\em EEPARAMS.h}. For single
1314     threaded execution it can be reduced to one if required. The value
1315     is largely private to the WRAPPER and application code will not
1316     normally reference the value, except in the following scenario.
1317    
1318     For certain physical parametrization schemes it is necessary to have
1319     a substantial number of work arrays. Where these arrays are
1320     allocated in heap storage (for example COMMON blocks) multi-threaded
1321     execution will require multiple instances of the COMMON block data.
1322     This can be achieved using a Fortran 90 module construct. However,
1323     if this mechanism is unavailable then the work arrays can be extended
1324     with dimensions using the tile dimensioning scheme of {\em nSx} and
1325     {\em nSy} (as described in section
1326     \ref{sect:specifying_a_decomposition}). However, if the
1327     configuration being specified involves many more tiles than OS
1328     threads then it can save memory resources to reduce the variable
1329     {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
1330     that will be used and to declare the physical parameterization work
1331     arrays with a single {\em MAX\_NO\_THREADS} extra dimension. An
1332     example of this is given in the verification experiment {\em
1333     aim.5l\_cs}. Here the default setting of {\em MAX\_NO\_THREADS} is
1334     altered to
1335 cnh 1.1 \begin{verbatim}
1336     INTEGER MAX_NO_THREADS
1337     PARAMETER ( MAX_NO_THREADS = 6 )
1338     \end{verbatim}
1339 edhill 1.21 and several work arrays for storing intermediate calculations are
1340     created with declarations of the form.
1341 cnh 1.1 \begin{verbatim}
1342     common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1343     \end{verbatim}
1344 edhill 1.21 This declaration scheme is not used widely, because most global data
1345     is used for permanent not temporary storage of state information.
1346     In the case of permanent state information this approach cannot be
1347     used because there has to be enough storage allocated for all tiles.
1348     However, the technique can sometimes be a useful scheme for reducing
1349     memory requirements in complex physical parameterizations.
1350 adcroft 1.2 \end{enumerate}
1351    
1352     \begin{figure}
1353     \begin{verbatim}
1354     C--
1355     C-- Parallel directives for MIPS Pro Fortran compiler
1356     C--
1357     C Parallel compiler directives for SGI with IRIX
1358     C$PAR PARALLEL DO
1359     C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1360     C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1361     C
1362     DO I=1,nThreads
1363     myThid = I
1364    
1365     C-- Invoke nThreads instances of the numerical model
1366     CALL THE_MODEL_MAIN(myThid)
1367    
1368     ENDDO
1369     \end{verbatim}
1370 edhill 1.21 \caption{Prior to transferring control to the procedure {\em
1371     THE\_MODEL\_MAIN()} the WRAPPER may use MP directives to spawn
1372     multiple threads. } \label{fig:mp_directives}
1373 adcroft 1.2 \end{figure}
1374 cnh 1.1
1375    
1376     \subsubsection{Specializing the Communication Code}
1377    
1378     The isolation of performance critical communication primitives and the
1379     sub-division of the simulation domain into tiles is a powerful tool.
1380     Here we show how it can be used to improve application performance and
1381 cnh 1.5 how it can be used to adapt to new griding approaches.
1382 cnh 1.1
1383     \subsubsection{JAM example}
1384 adcroft 1.6 \label{sect:jam_example}
1385 edhill 1.21 On some platforms a big performance boost can be obtained by binding
1386     the communication routines {\em \_EXCH} and {\em \_GSUM} to
1387     specialized native libraries (for example, the shmem library on CRAY
1388     T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag is used as an
1389     illustration of a specialized communication configuration that
1390     substitutes for standard, portable forms of {\em \_EXCH} and {\em
1391     \_GSUM}. It affects three source files {\em eeboot.F}, {\em
1392     CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined is has
1393     the following effects.
1394 cnh 1.1 \begin{itemize}
1395 edhill 1.21 \item An extra phase is included at boot time to initialize the custom
1396     communications library ( see {\em ini\_jam.F}).
1397 cnh 1.1 \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1398 edhill 1.21 with calls to custom routines (see {\em gsum\_jam.F} and {\em
1399     exch\_jam.F})
1400 cnh 1.1 \item a highly specialized form of the exchange operator (optimized
1401 edhill 1.21 for overlap regions of width one) is substituted into the elliptic
1402     solver routine {\em cg2d.F}.
1403 cnh 1.1 \end{itemize}
1404     Developing specialized code for other libraries follows a similar
1405     pattern.
1406    
1407     \subsubsection{Cube sphere communication}
1408 adcroft 1.6 \label{sect:cube_sphere_communication}
1409 edhill 1.21 Actual {\em \_EXCH} routine code is generated automatically from a
1410     series of template files, for example {\em exch\_rx.template}. This
1411     is done to allow a large number of variations on the exchange process
1412     to be maintained. One set of variations supports the cube sphere grid.
1413     Support for a cube sphere grid in MITgcm is based on having each face
1414     of the cube as a separate tile or tiles. The exchange routines are
1415     then able to absorb much of the detailed rotation and reorientation
1416     required when moving around the cube grid. The set of {\em \_EXCH}
1417     routines that contain the word cube in their name perform these
1418     transformations. They are invoked when the run-time logical parameter
1419 cnh 1.1 {\em useCubedSphereExchange} is set true. To facilitate the
1420 edhill 1.21 transformations on a staggered C-grid, exchange operations are defined
1421     separately for both vector and scalar quantities and for grid-centered
1422     and for grid-face and grid-corner quantities. Three sets of exchange
1423     routines are defined. Routines with names of the form {\em exch\_rx}
1424     are used to exchange cell centered scalar quantities. Routines with
1425     names of the form {\em exch\_uv\_rx} are used to exchange vector
1426     quantities located at the C-grid velocity points. The vector
1427     quantities exchanged by the {\em exch\_uv\_rx} routines can either be
1428     signed (for example velocity components) or un-signed (for example
1429     grid-cell separations). Routines with names of the form {\em
1430     exch\_z\_rx} are used to exchange quantities at the C-grid vorticity
1431     point locations.
1432 cnh 1.1
1433    
1434    
1435    
1436     \section{MITgcm execution under WRAPPER}
1437 edhill 1.19 \begin{rawhtml}
1438     <!-- CMIREDIR:mitgcm_wrapper: -->
1439     \end{rawhtml}
1440 cnh 1.1
1441     Fitting together the WRAPPER elements, package elements and
1442     MITgcm core equation elements of the source code produces calling
1443 adcroft 1.6 sequence shown in section \ref{sect:calling_sequence}
1444 cnh 1.1
1445     \subsection{Annotated call tree for MITgcm and WRAPPER}
1446 adcroft 1.6 \label{sect:calling_sequence}
1447 cnh 1.1
1448     WRAPPER layer.
1449    
1450 heimbach 1.8 {\footnotesize
1451 cnh 1.1 \begin{verbatim}
1452    
1453     MAIN
1454     |
1455     |--EEBOOT :: WRAPPER initialization
1456     | |
1457     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1458     | | allow basic I/O.
1459     | |-- EEINTRO_MSG :: Write startup greeting.
1460     | |
1461     | |-- EESET_PARMS :: Set WRAPPER parameters
1462     | |
1463     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1464     | |
1465     | |-- INI_PROCS :: Associate processes with grid regions.
1466     | |
1467     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1468     | |
1469     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1470     | :: communication data structures
1471     |
1472     |
1473     |--CHECK_THREADS :: Validate multiple thread start up.
1474     |
1475     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1476    
1477     \end{verbatim}
1478 heimbach 1.8 }
1479 cnh 1.1
1480     Core equations plus packages.
1481    
1482 heimbach 1.8 {\footnotesize
1483 cnh 1.1 \begin{verbatim}
1484     C
1485     C Invocation from WRAPPER level...
1486     C :
1487     C :
1488     C |
1489     C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1490     C | :: Called from WRAPPER level numerical
1491 cnh 1.4 C | :: code invocation routine. On entry
1492 cnh 1.1 C | :: to THE_MODEL_MAIN separate thread and
1493     C | :: separate processes will have been established.
1494     C | :: Each thread and process will have a unique ID
1495     C | :: but as yet it will not be associated with a
1496     C | :: specific region in decomposed discrete space.
1497     C |
1498     C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1499     C | | :: grid, solver matrices etc..
1500     C | |
1501     C | |-INI_PARMS :: Routine to set kernel model parameters.
1502     C | | :: By default kernel parameters are read from file
1503     C | | :: "data" in directory in which code executes.
1504     C | |
1505 cnh 1.4 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1506 cnh 1.1 C | |
1507 cnh 1.4 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1508 cnh 1.1 C | | | :: Grid arrays are held and described in GRID.h.
1509     C | | |
1510 cnh 1.4 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1511 cnh 1.1 C | | |
1512 cnh 1.4 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1513 cnh 1.1 C | | | :: (calculate grid from kernel parameters).
1514     C | | |
1515     C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1516 cnh 1.4 C | | | :: initialization (calculate grid from
1517 cnh 1.1 C | | | :: kernel parameters).
1518     C | | |
1519     C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1520 cnh 1.4 C | | :: grid initializations. ( input from raw
1521 cnh 1.1 C | | :: grid files, LONC.bin, DXF.bin etc... )
1522     C | |
1523     C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1524     C | |
1525     C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1526     C | | :: land masking for solid-fluid boundaries.
1527     C | |
1528     C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1529     C | |
1530     C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1531 cnh 1.4 C | | :: sphere options are coded.
1532 cnh 1.1 C | |
1533     C | |-PACAKGES_BOOT :: Start up the optional package environment.
1534     C | | :: Runtime selection of active packages.
1535     C | |
1536     C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1537     C | | |
1538     C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1539     C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1540     C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1541     C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1542     C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1543     C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1544     C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1545     C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1546     C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1547     C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1548 edhill 1.23 C | | |-PTRACERS_READPARMS :: multiple tracer package, see pkg/ptracers
1549     C | | |-GCHEM_READPARMS :: tracer interface package, see pkg/gchem
1550 cnh 1.1 C | |
1551     C | |-PACKAGES_CHECK
1552     C | | |
1553     C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1554 edhill 1.23 C | | |-OBCS_CHECK :: Open bndy Pacakge. pkg/obcs
1555 cnh 1.1 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1556     C | |
1557     C | |-PACKAGES_INIT_FIXED
1558     C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1559     C | | |-FLT_INIT :: Floats Package. see pkg/flt
1560 edhill 1.23 C | | |-GCHEM_INIT_FIXED :: tracer interface pachage, see pkg/gchem
1561 cnh 1.1 C | |
1562     C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1563     C | |
1564 edhill 1.23 C | |-INI_CG2D :: 2d con. grad solver initialization.
1565 cnh 1.1 C | |
1566 edhill 1.23 C | |-INI_CG3D :: 3d con. grad solver initialization.
1567 cnh 1.1 C | |
1568     C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1569     C | :: Includes annotated table of kernel
1570     C | :: parameter settings.
1571     C |
1572     C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1573     C |
1574     C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1575 cnh 1.4 C ! :: Auotmatically generated by TAMC/TAF.
1576 cnh 1.1 C |
1577     C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1578     C |
1579     C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1580     C |
1581     C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1582     C | |
1583     C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1584     C | | | :: variables
1585     C | | |
1586     C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1587     C | | |
1588     C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1589 cnh 1.4 C | | | :: sphere options are coded.
1590 cnh 1.1 C | | |
1591 edhill 1.23 C | | |-INI_CG2D :: 2d con. grad solver initialization.
1592     C | | |-INI_CG3D :: 3d con. grad solver initialization.
1593     C | | |-INI_MIXING :: Initialize diapycnal diffusivity.
1594     C | | |-INI_DYNVARS :: Initialize to zero all DYNVARS.h arrays (dynamical
1595 cnh 1.1 C | | | :: fields).
1596     C | | |
1597 cnh 1.4 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1598 cnh 1.1 C | | | |-INI_VEL :: Initialize 3D flow field.
1599     C | | | |-INI_THETA :: Set model initial temperature field.
1600     C | | | |-INI_SALT :: Set model initial salinity field.
1601     C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1602 edhill 1.23 C | | | |-INI_PRESSURE :: Compute model initial hydrostatic pressure
1603     C | | | |-READ_CHECKPOINT :: Read the checkpoint
1604 cnh 1.1 C | | |
1605     C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1606     C | | | | :: Here applied to move restart conditions
1607     C | | | | :: (saved in mid timestep) to correct level in
1608     C | | | | :: time (only used for pre-c35).
1609     C | | | |
1610     C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1611     C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1612     C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1613     C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1614     C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1615     C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1616     C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1617     C | | | | |-FIND_RHO :: Find adjacent densities.
1618     C | | | | |-CONVECT :: Mix static instability.
1619     C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1620     C | | | |
1621     C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1622     C | | |
1623     C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1624     C | | | | :: Extra time history interactions.
1625     C | | | |
1626     C | | | |-FIND_RHO :: Find adjacent densities.
1627     C | | | |-CONVECT :: Mix static instability.
1628     C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1629     C | | |
1630 edhill 1.23 C | | |-PACKAGES_INIT_VARIABLES :: Does initialization of time evolving
1631 cnh 1.1 C | | | | :: package data.
1632     C | | | |
1633     C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1634     C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1635     C | | | |-KPP_OPEN_DIAGS
1636     C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1637 edhill 1.23 C | | | |-PTRACERS_INIT :: multi. tracer package,(see pkg/ptracers)
1638     C | | | |-GCHEM_INIT :: tracer interface pkg (see pkh/gchem)
1639 cnh 1.1 C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1640     C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1641     C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1642     C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1643     C | | | |-INI_FORCING :: Set model initial forcing fields.
1644     C | | | | :: Either set in-line or from file as shown.
1645     C | | | |-READ_FLD_XY_RS(zonalWindFile)
1646     C | | | |-READ_FLD_XY_RS(meridWindFile)
1647     C | | | |-READ_FLD_XY_RS(surfQFile)
1648     C | | | |-READ_FLD_XY_RS(EmPmRfile)
1649     C | | | |-READ_FLD_XY_RS(thetaClimFile)
1650     C | | | |-READ_FLD_XY_RS(saltClimFile)
1651     C | | | |-READ_FLD_XY_RS(surfQswFile)
1652     C | | |
1653     C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1654     C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1655     C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1656     C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1657     C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1658     C | |
1659     C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1660     C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1661     C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1662     C | |
1663     C | |-MONITOR :: Monitor state ( see pkg/monitor )
1664     C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1665     C====|>|
1666     C====|>| ****************************
1667     C====|>| BEGIN MAIN TIMESTEPPING LOOP
1668     C====|>| ****************************
1669     C====|>|
1670     C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1671     C/\ | | |
1672     C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1673     C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1674     C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1675     C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1676     C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1677 cnh 1.4 C/\ | | | | :: Simple interpolation between end-points
1678 cnh 1.1 C/\ | | | | :: for forcing datasets.
1679     C/\ | | | |
1680     C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1681 edhill 1.23 C/\ | | |-SEAICE_MODEL :: Compute sea-ice terms. ( pkg/seaice )
1682     C/\ | | |-FREEZE :: Limit surface temperature.
1683     C/\ | | |-GCHEM_FIELD_LOAD :: load tracer forcing fields (pkg/gchem)
1684 cnh 1.1 C/\ | | |
1685     C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1686     C/\ | | | |
1687     C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1688     C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1689 edhill 1.23 C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-RhoConst] of a slice
1690 cnh 1.1 C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1691     C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1692     C/\ | | | |
1693     C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1694     C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1695 edhill 1.23 C/\ | | | | | :: forcing terms.
1696     C/\ | | | | |-PTRACERS_FORCING_SURF :: Tracer package ( see pkg/ptracers ).
1697 cnh 1.1 C/\ | | | |
1698     C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1699     C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1700     C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1701     C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1702     C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1703     C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1704     C/\ | | | | advection case) (see pkg/gad).
1705     C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1706     C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1707     C/\ | | | | |
1708     C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1709     C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1710     C/\ | | | |
1711     C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1712     C/\ | | | | |
1713     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1714 edhill 1.23 C/\ | | | | | | :: ( see pkg/gad )
1715     C/\ | | | | | |-KPP_TRANSPORT_T :: KPP non-local transport ( see pkg/kpp ).
1716     C/\ | | | | |
1717 cnh 1.1 C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1718     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1719     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1720     C/\ | | | |
1721     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1722     C/\ | | | |
1723     C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1724     C/\ | | | | |
1725     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1726 edhill 1.23 C/\ | | | | | | :: ( see pkg/gad )
1727     C/\ | | | | | |-KPP_TRANSPORT_S :: KPP non-local transport ( see pkg/kpp ).
1728     C/\ | | | | |
1729 cnh 1.1 C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1730     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1731     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1732     C/\ | | | |
1733     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1734     C/\ | | | |
1735 edhill 1.23 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1736     C/\ | | | |
1737     C/\ | | | |-PTRACERS_INTEGRATE :: Integrate other tracer(s) (see pkg/ptracers).
1738 cnh 1.1 C/\ | | | | |
1739     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1740 edhill 1.23 C/\ | | | | | | :: ( see pkg/gad )
1741     C/\ | | | | | |-KPP_TRANSPORT_PTR:: KPP non-local transport ( see pkg/kpp ).
1742     C/\ | | | | |
1743     C/\ | | | | |-PTRACERS_FORCING :: Problem specific forcing for tracer.
1744     C/\ | | | | |-GCHEM_FORCING_INT :: tracer forcing for gchem pkg (if all
1745     C/\ | | | | | tendancy terms calcualted together)
1746 cnh 1.1 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1747     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1748 edhill 1.23 C/\ | | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1749 cnh 1.1 C/\ | | | |
1750     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1751     C/\ | | | |
1752     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1753     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1754     C/\ | | | |
1755     C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1756     C/\ | | | |-EXCH :: Update overlaps
1757     C/\ | | |
1758     C/\ | | |-DYNAMICS :: Momentum equations driver.
1759     C/\ | | | |
1760     C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1761     C/\ | | | | Potential anomaly.
1762     C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1763     C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1764     C/\ | | | |
1765     C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1766     C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1767     C/\ | | | | pkg/mom_fluxform ).
1768     C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1769     C/\ | | | | pkg/mom_vecinv ).
1770     C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1771     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1772     C/\ | | | |
1773     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1774     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1775     C/\ | | | |
1776     C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1777     C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1778     C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1779     C/\ | | |
1780     C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1781     C/\ | | |
1782     C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1783     C/\ | | |
1784     C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1785     C/\ | | |
1786     C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1787     C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1788     C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1789     C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1790     C/\ | | |
1791     C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1792     C/\ | | | |
1793     C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1794     C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1795     C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1796     C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1797     C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1798     C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1799     C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1800     C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1801     C/\ | | | | |-CONVECT :: Mix static instability.
1802     C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1803     C/\ | | | |
1804     C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1805     C/\ | | |
1806     C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1807     C/\ | | | |-EXCH
1808     C/\ | | |
1809 edhill 1.23 C/\ | | |-GCHEM_FORCING_SEP :: tracer forcing for gchem pkg (if
1810     C/\ | | | tracer dependent tendencies calculated
1811     C/\ | | | separatly)
1812     C/\ | | |
1813 cnh 1.1 C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1814     C/\ | | |
1815     C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1816     C/\ | | |
1817     C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1818     C/\ | | | |-WRITE_STATE :: Core state I/O
1819     C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1820     C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1821     C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1822     C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1823 edhill 1.23 C/\ | | | |-SBO_CALC :: SBO diags. see pkg/sbo
1824     C/\ | | | |-SBO_DIAGS :: SBO diags. see pkg/sbo
1825     C/\ | | | |-SEAICE_DO_DIAGS :: SEAICE diags. see pkg/seaice
1826     C/\ | | | |-GCHEM_DIAGS :: gchem diags. see pkg/gchem
1827 cnh 1.1 C/\ | | |
1828     C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1829     C/\ | |
1830     C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1831     C<===|=|
1832     C<===|=| **************************
1833     C<===|=| END MAIN TIMESTEPPING LOOP
1834     C<===|=| **************************
1835     C<===|=|
1836     C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1837     C |
1838     C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1839     C |
1840     C |-TIMER_PRINTALL :: Computational timing summary
1841     C |
1842     C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1843     C :: events.
1844 edhill 1.23 C
1845 cnh 1.1 \end{verbatim}
1846 heimbach 1.8 }
1847 cnh 1.1
1848     \subsection{Measuring and Characterizing Performance}
1849    
1850     TO BE DONE (CNH)
1851    
1852     \subsection{Estimating Resource Requirements}
1853    
1854     TO BE DONE (CNH)
1855    
1856     \subsubsection{Atlantic 1/6 degree example}
1857     \subsubsection{Dry Run testing}
1858     \subsubsection{Adjoint Resource Requirements}
1859     \subsubsection{State Estimation Environment Resources}
1860    

  ViewVC Help
Powered by ViewVC 1.1.22