/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Annotation of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.21 - (hide annotations) (download) (as text)
Tue Apr 4 15:54:55 2006 UTC (19 years, 3 months ago) by edhill
Branch: MAIN
Changes since 1.20: +584 -579 lines
File MIME type: application/x-tex
updates for part4:
 - lots of small spelling, grammar, LaTeX syntax, etc. changes
 - addition of two references to parts 3 & 6 needed by part 4
 - does not break "make pdf"

1 edhill 1.21 % $Header: /u/gcmpack/manual/part4/sarch.tex,v 1.20 2004/10/16 03:40:16 edhill Exp $
2 cnh 1.1
3 cnh 1.7 This chapter focuses on describing the {\bf WRAPPER} environment within which
4     both the core numerics and the pluggable packages operate. The description
5     presented here is intended to be a detailed exposition and contains significant
6     background material, as well as advanced details on working with the WRAPPER.
7 adcroft 1.9 The tutorial sections of this manual (see sections
8     \ref{sect:tutorials} and \ref{sect:tutorialIII})
9 cnh 1.7 contain more succinct, step-by-step instructions on running basic numerical
10     experiments, of varous types, both sequentially and in parallel. For many
11     projects simply starting from an example code and adapting it to suit a
12     particular situation
13     will be all that is required.
14     The first part of this chapter discusses the MITgcm architecture at an
15     abstract level. In the second part of the chapter we described practical
16     details of the MITgcm implementation and of current tools and operating system
17     features that are employed.
18 cnh 1.1
19     \section{Overall architectural goals}
20 edhill 1.20 \begin{rawhtml}
21     <!-- CMIREDIR:overall_architectural_goals: -->
22     \end{rawhtml}
23 cnh 1.1
24     Broadly, the goals of the software architecture employed in MITgcm are
25     three-fold
26    
27     \begin{itemize}
28     \item We wish to be able to study a very broad range
29     of interesting and challenging rotating fluids problems.
30     \item We wish the model code to be readily targeted to
31     a wide range of platforms
32     \item On any given platform we would like to be
33     able to achieve performance comparable to an implementation
34     developed and specialized specifically for that platform.
35     \end{itemize}
36    
37     These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
38     which conveys the goals of the MITgcm design. The goals lead to
39     a software architecture which at the high-level can be viewed as consisting
40     of
41    
42     \begin{enumerate}
43 edhill 1.21 \item A core set of numerical and support code. This is discussed in
44     detail in section \ref{chap:discretization}.
45    
46     \item A scheme for supporting optional ``pluggable'' {\bf packages}
47     (containing for example mixed-layer schemes, biogeochemical schemes,
48     atmospheric physics). These packages are used both to overlay
49     alternate dynamics and to introduce specialized physical content
50     onto the core numerical code. An overview of the {\bf package}
51     scheme is given at the start of part \ref{chap:packagesI}.
52    
53     \item A support framework called {\bf WRAPPER} (Wrappable Application
54     Parallel Programming Environment Resource), within which the core
55     numerics and pluggable packages operate.
56 cnh 1.1 \end{enumerate}
57    
58 edhill 1.21 This chapter focuses on describing the {\bf WRAPPER} environment under
59     which both the core numerics and the pluggable packages function. The
60     description presented here is intended to be a detailed exposition and
61     contains significant background material, as well as advanced details
62     on working with the WRAPPER. The examples section of this manual
63     (part \ref{chap:getting_started}) contains more succinct, step-by-step
64     instructions on running basic numerical experiments both sequentially
65     and in parallel. For many projects simply starting from an example
66     code and adapting it to suit a particular situation will be all that
67     is required.
68 cnh 1.1
69 adcroft 1.2
70 cnh 1.1 \begin{figure}
71     \begin{center}
72 adcroft 1.2 \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
73 cnh 1.1 \end{center}
74 adcroft 1.2 \caption{
75     The MITgcm architecture is designed to allow simulation of a wide
76 cnh 1.1 range of physical problems on a wide range of hardware. The computational
77     resource requirements of the applications targeted range from around
78     $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
79     ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
80     interest range from $10^{9}$ floating point operations to more than $10^{17}$
81 adcroft 1.2 floating point operations.}
82     \label{fig:mitgcm_architecture_goals}
83 cnh 1.1 \end{figure}
84    
85     \section{WRAPPER}
86 afe 1.17 \begin{rawhtml}
87 afe 1.18 <!-- CMIREDIR:wrapper: -->
88 afe 1.17 \end{rawhtml}
89 cnh 1.1
90     A significant element of the software architecture utilized in
91     MITgcm is a software superstructure and substructure collectively
92     called the WRAPPER (Wrappable Application Parallel Programming
93     Environment Resource). All numerical and support code in MITgcm is written
94     to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
95     the WRAPPER means that coding has to follow certain, relatively
96 edhill 1.21 straightforward, rules and conventions (these are discussed further in
97     section \ref{sect:specifying_a_decomposition}).
98 cnh 1.1
99     The approach taken by the WRAPPER is illustrated in figure
100     \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
101     that fits within it from architectural differences between hardware platforms
102     and operating systems. This allows numerical code to be easily retargetted.
103 adcroft 1.2
104    
105 cnh 1.1 \begin{figure}
106     \begin{center}
107 adcroft 1.2 \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
108 cnh 1.1 \end{center}
109 adcroft 1.2 \caption{
110 edhill 1.10 Numerical code is written to fit within a software support
111 cnh 1.1 infrastructure called WRAPPER. The WRAPPER is portable and
112 cnh 1.4 can be specialized for a wide range of specific target hardware and
113 cnh 1.1 programming environments, without impacting numerical code that fits
114     within the WRAPPER. Codes that fit within the WRAPPER can generally be
115     made to run as fast on a particular platform as codes specially
116 adcroft 1.2 optimized for that platform.}
117     \label{fig:fit_in_wrapper}
118 cnh 1.1 \end{figure}
119    
120     \subsection{Target hardware}
121 adcroft 1.6 \label{sect:target_hardware}
122 cnh 1.1
123 edhill 1.21 The WRAPPER is designed to target as broad as possible a range of
124     computer systems. The original development of the WRAPPER took place
125     on a multi-processor, CRAY Y-MP system. On that system, numerical code
126     performance and scaling under the WRAPPER was in excess of that of an
127     implementation that was tightly bound to the CRAY systems proprietary
128     multi-tasking and micro-tasking approach. Later developments have been
129     carried out on uniprocessor and multi-processor Sun systems with both
130     uniform memory access (UMA) and non-uniform memory access (NUMA)
131     designs. Significant work has also been undertaken on x86 cluster
132     systems, Alpha processor based clustered SMP systems, and on
133     cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
134     systems. The MITgcm code, operating within the WRAPPER, is also
135     routinely used on large scale MPP systems (for example, Cray T3E and
136     IBM SP systems). In all cases numerical code, operating within the
137     WRAPPER, performs and scales very competitively with equivalent
138     numerical code that has been modified to contain native optimizations
139     for a particular system \ref{ref hoe and hill, ecmwf}.
140 cnh 1.1
141     \subsection{Supporting hardware neutrality}
142    
143 edhill 1.21 The different systems listed in section \ref{sect:target_hardware} can
144     be categorized in many different ways. For example, one common
145     distinction is between shared-memory parallel systems (SMP and PVP)
146     and distributed memory parallel systems (for example x86 clusters and
147     large MPP systems). This is one example of a difference between
148     compute platforms that can impact an application. Another common
149     distinction is between vector processing systems with highly
150     specialized CPUs and memory subsystems and commodity microprocessor
151     based systems. There are numerous other differences, especially in
152     relation to how parallel execution is supported. To capture the
153     essential differences between different platforms the WRAPPER uses a
154     {\it machine model}.
155 cnh 1.1
156     \subsection{WRAPPER machine model}
157    
158 edhill 1.21 Applications using the WRAPPER are not written to target just one
159     particular machine (for example an IBM SP2) or just one particular
160     family or class of machines (for example Parallel Vector Processor
161     Systems). Instead the WRAPPER provides applications with an abstract
162     {\it machine model}. The machine model is very general, however, it
163     can easily be specialized to fit, in a computationally efficient
164     manner, any computer architecture currently available to the
165     scientific computing community.
166 cnh 1.1
167     \subsection{Machine model parallelism}
168 afe 1.17 \begin{rawhtml}
169 afe 1.18 <!-- CMIREDIR:domain_decomp: -->
170 afe 1.17 \end{rawhtml}
171 cnh 1.1
172 edhill 1.21 Codes operating under the WRAPPER target an abstract machine that is
173     assumed to consist of one or more logical processors that can compute
174     concurrently. Computational work is divided among the logical
175     processors by allocating ``ownership'' to each processor of a certain
176     set (or sets) of calculations. Each set of calculations owned by a
177     particular processor is associated with a specific region of the
178     physical space that is being simulated, only one processor will be
179     associated with each such region (domain decomposition).
180    
181     In a strict sense the logical processors over which work is divided do
182     not need to correspond to physical processors. It is perfectly
183     possible to execute a configuration decomposed for multiple logical
184     processors on a single physical processor. This helps ensure that
185     numerical code that is written to fit within the WRAPPER will
186     parallelize with no additional effort. It is also useful for
187     debugging purposes. Generally, however, the computational domain will
188     be subdivided over multiple logical processors in order to then bind
189     those logical processors to physical processor resources that can
190     compute in parallel.
191 cnh 1.1
192     \subsubsection{Tiles}
193    
194 edhill 1.21 Computationally, the data structures (\textit{eg.} arrays, scalar
195     variables, etc.) that hold the simulated state are associated with
196     each region of physical space and are allocated to a particular
197     logical processor. We refer to these data structures as being {\bf
198     owned} by the processor to which their associated region of physical
199     space has been allocated. Individual regions that are allocated to
200     processors are called {\bf tiles}. A processor can own more than one
201     tile. Figure \ref{fig:domaindecomp} shows a physical domain being
202     mapped to a set of logical processors, with each processors owning a
203     single region of the domain (a single tile). Except for periods of
204     communication and coordination, each processor computes autonomously,
205     working only with data from the tile (or tiles) that the processor
206     owns. When multiple tiles are alloted to a single processor, each
207     tile is computed on independently of the other tiles, in a sequential
208     fashion.
209 cnh 1.1
210     \begin{figure}
211     \begin{center}
212 adcroft 1.2 \resizebox{5in}{!}{
213     \includegraphics{part4/domain_decomp.eps}
214 cnh 1.1 }
215     \end{center}
216 edhill 1.21 \caption{ The WRAPPER provides support for one and two dimensional
217     decompositions of grid-point domains. The figure shows a
218     hypothetical domain of total size $N_{x}N_{y}N_{z}$. This
219     hypothetical domain is decomposed in two-dimensions along the
220     $N_{x}$ and $N_{y}$ directions. The resulting {\bf tiles} are {\bf
221     owned} by different processors. The {\bf owning} processors
222     perform the arithmetic operations associated with a {\bf tile}.
223     Although not illustrated here, a single processor can {\bf own}
224     several {\bf tiles}. Whenever a processor wishes to transfer data
225     between tiles or communicate with other processors it calls a
226     WRAPPER supplied function. } \label{fig:domaindecomp}
227 cnh 1.1 \end{figure}
228    
229    
230     \subsubsection{Tile layout}
231    
232 edhill 1.21 Tiles consist of an interior region and an overlap region. The
233     overlap region of a tile corresponds to the interior region of an
234     adjacent tile. In figure \ref{fig:tiledworld} each tile would own the
235     region within the black square and hold duplicate information for
236     overlap regions extending into the tiles to the north, south, east and
237     west. During computational phases a processor will reference data in
238     an overlap region whenever it requires values that lie outside the
239     domain it owns. Periodically processors will make calls to WRAPPER
240     functions to communicate data between tiles, in order to keep the
241     overlap regions up to date (see section
242     \ref{sect:communication_primitives}). The WRAPPER functions can use a
243 cnh 1.1 variety of different mechanisms to communicate data between tiles.
244    
245     \begin{figure}
246     \begin{center}
247 adcroft 1.2 \resizebox{5in}{!}{
248     \includegraphics{part4/tiled-world.eps}
249 cnh 1.1 }
250     \end{center}
251     \caption{ A global grid subdivided into tiles.
252     Tiles contain a interior region and an overlap region.
253     Overlap regions are periodically updated from neighboring tiles.
254     } \label{fig:tiledworld}
255     \end{figure}
256    
257     \subsection{Communication mechanisms}
258    
259 edhill 1.21 Logical processors are assumed to be able to exchange information
260     between tiles and between each other using at least one of two
261     possible mechanisms.
262 cnh 1.1
263     \begin{itemize}
264 edhill 1.21 \item {\bf Shared memory communication}. Under this mode of
265     communication data transfers are assumed to be possible using direct
266     addressing of regions of memory. In this case a CPU is able to read
267     (and write) directly to regions of memory ``owned'' by another CPU
268     using simple programming language level assignment operations of the
269     the sort shown in figure \ref{fig:simple_assign}. In this way one
270     CPU (CPU1 in the figure) can communicate information to another CPU
271     (CPU2 in the figure) by assigning a particular value to a particular
272     memory location.
273    
274     \item {\bf Distributed memory communication}. Under this mode of
275     communication there is no mechanism, at the application code level,
276     for directly addressing regions of memory owned and visible to
277     another CPU. Instead a communication library must be used as
278     illustrated in figure \ref{fig:comm_msg}. In this case CPUs must
279     call a function in the API of the communication library to
280     communicate data from a tile that it owns to a tile that another CPU
281     owns. By default the WRAPPER binds to the MPI communication library
282     \ref{MPI} for this style of communication.
283 cnh 1.1 \end{itemize}
284    
285     The WRAPPER assumes that communication will use one of these two styles
286 edhill 1.21 of communication. The underlying hardware and operating system support
287 cnh 1.1 for the style used is not specified and can vary from system to system.
288    
289     \begin{figure}
290     \begin{verbatim}
291    
292     CPU1 | CPU2
293     ==== | ====
294     |
295     a(3) = 8 | WHILE ( a(3) .NE. 8 )
296     | WAIT
297     | END WHILE
298     |
299     \end{verbatim}
300 edhill 1.21 \caption{In the WRAPPER shared memory communication model, simple writes to an
301     array can be made to be visible to other CPUs at the application code level.
302 cnh 1.1 So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
303 edhill 1.21 element $3$ of array $a$, then other CPUs (for example CPU2 in the figure above)
304 cnh 1.1 will be able to see the value $8$ when they read from $a(3)$.
305     This provides a very low latency and high bandwidth communication
306     mechanism.
307     } \label{fig:simple_assign}
308     \end{figure}
309    
310     \begin{figure}
311     \begin{verbatim}
312    
313     CPU1 | CPU2
314     ==== | ====
315     |
316     a(3) = 8 | WHILE ( a(3) .NE. 8 )
317     CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
318     | END WHILE
319     |
320     \end{verbatim}
321     \caption{ In the WRAPPER distributed memory communication model
322 edhill 1.21 data can not be made directly visible to other CPUs.
323 cnh 1.1 If one CPU writes the value $8$ to element $3$ of array $a$, then
324     at least one of CPU1 and/or CPU2 in the figure above will need
325     to call a bespoke communication library in order for the updated
326 edhill 1.21 value to be communicated between CPUs.
327 cnh 1.1 } \label{fig:comm_msg}
328     \end{figure}
329    
330     \subsection{Shared memory communication}
331 adcroft 1.6 \label{sect:shared_memory_communication}
332 cnh 1.1
333 edhill 1.21 Under shared communication independent CPUs are operating on the
334     exact same global address space at the application level. This means
335     that CPU 1 can directly write into global data structures that CPU 2
336     ``owns'' using a simple assignment at the application level. This is
337     the model of memory access is supported at the basic system design
338     level in ``shared-memory'' systems such as PVP systems, SMP systems,
339     and on distributed shared memory systems (\textit{eg.} SGI Origin, SGI
340     Altix, and some AMD Opteron systems). On such systems the WRAPPER
341     will generally use simple read and write statements to access directly
342     application data structures when communicating between CPUs.
343    
344     In a system where assignments statements, like the one in figure
345     \ref{fig:simple_assign} map directly to hardware instructions that
346     transport data between CPU and memory banks, this can be a very
347     efficient mechanism for communication. In this case two CPUs, CPU1
348     and CPU2, can communicate simply be reading and writing to an agreed
349     location and following a few basic rules. The latency of this sort of
350     communication is generally not that much higher than the hardware
351     latency of other memory accesses on the system. The bandwidth
352     available between CPUs communicating in this way can be close to the
353     bandwidth of the systems main-memory interconnect. This can make this
354     method of communication very efficient provided it is used
355     appropriately.
356 cnh 1.1
357     \subsubsection{Memory consistency}
358 adcroft 1.6 \label{sect:memory_consistency}
359 cnh 1.1
360 edhill 1.21 When using shared memory communication between multiple processors the
361     WRAPPER level shields user applications from certain counter-intuitive
362     system behaviors. In particular, one issue the WRAPPER layer must
363     deal with is a systems memory model. In general the order of reads
364     and writes expressed by the textual order of an application code may
365     not be the ordering of instructions executed by the processor
366     performing the application. The processor performing the application
367     instructions will always operate so that, for the application
368     instructions the processor is executing, any reordering is not
369     apparent. However, in general machines are often designed so that
370     reordering of instructions is not hidden from other second processors.
371     This means that, in general, even on a shared memory system two
372     processors can observe inconsistent memory values.
373    
374     The issue of memory consistency between multiple processors is
375     discussed at length in many computer science papers. From a practical
376     point of view, in order to deal with this issue, shared memory
377     machines all provide some mechanism to enforce memory consistency when
378     it is needed. The exact mechanism employed will vary between systems.
379     For communication using shared memory, the WRAPPER provides a place to
380     invoke the appropriate mechanism to ensure memory consistency for a
381     particular platform.
382 cnh 1.1
383     \subsubsection{Cache effects and false sharing}
384 adcroft 1.6 \label{sect:cache_effects_and_false_sharing}
385 cnh 1.1
386     Shared-memory machines often have local to processor memory caches
387 edhill 1.21 which contain mirrored copies of main memory. Automatic cache-coherence
388 cnh 1.1 protocols are used to maintain consistency between caches on different
389 edhill 1.21 processors. These cache-coherence protocols typically enforce consistency
390 cnh 1.1 between regions of memory with large granularity (typically 128 or 256 byte
391 edhill 1.21 chunks). The coherency protocols employed can be expensive relative to other
392 cnh 1.1 memory accesses and so care is taken in the WRAPPER (by padding synchronization
393     structures appropriately) to avoid unnecessary coherence traffic.
394    
395     \subsubsection{Operating system support for shared memory.}
396    
397 edhill 1.21 Applications running under multiple threads within a single process
398     can use shared memory communication. In this case {\it all} the
399     memory locations in an application are potentially visible to all the
400     compute threads. Multiple threads operating within a single process is
401     the standard mechanism for supporting shared memory that the WRAPPER
402     utilizes. Configuring and launching code to run in multi-threaded mode
403     on specific platforms is discussed in section
404     \ref{sect:multi-threaded-execution}. However, on many systems,
405     potentially very efficient mechanisms for using shared memory
406     communication between multiple processes (in contrast to multiple
407     threads within a single process) also exist. In most cases this works
408     by making a limited region of memory shared between processes. The
409     MMAP \ref{magicgarden} and IPC \ref{magicgarden} facilities in UNIX
410     systems provide this capability as do vendor specific tools like LAPI
411     \ref{IBMLAPI} and IMC \ref{Memorychannel}. Extensions exist for the
412     WRAPPER that allow these mechanisms to be used for shared memory
413     communication. However, these mechanisms are not distributed with the
414     default WRAPPER sources, because of their proprietary nature.
415 cnh 1.1
416     \subsection{Distributed memory communication}
417 adcroft 1.6 \label{sect:distributed_memory_communication}
418 cnh 1.1 Many parallel systems are not constructed in a way where it is
419     possible or practical for an application to use shared memory
420     for communication. For example cluster systems consist of individual computers
421 edhill 1.21 connected by a fast network. On such systems there is no notion of shared memory
422 cnh 1.1 at the system level. For this sort of system the WRAPPER provides support
423     for communication based on a bespoke communication library
424     (see figure \ref{fig:comm_msg}). The default communication library used is MPI
425     \ref{mpi}. However, it is relatively straightforward to implement bindings to
426     optimized platform specific communication libraries. For example the work
427     described in \ref{hoe-hill:99} substituted standard MPI communication for a
428     highly optimized library.
429    
430     \subsection{Communication primitives}
431 adcroft 1.6 \label{sect:communication_primitives}
432 cnh 1.1
433     \begin{figure}
434     \begin{center}
435 adcroft 1.2 \resizebox{5in}{!}{
436     \includegraphics{part4/comm-primm.eps}
437 cnh 1.1 }
438     \end{center}
439 cnh 1.5 \caption{Three performance critical parallel primitives are provided
440 edhill 1.21 by the WRAPPER. These primitives are always used to communicate data
441     between tiles. The figure shows four tiles. The curved arrows
442     indicate exchange primitives which transfer data between the overlap
443     regions at tile edges and interior regions for nearest-neighbor
444     tiles. The straight arrows symbolize global sum operations which
445     connect all tiles. The global sum operation provides both a key
446     arithmetic primitive and can serve as a synchronization primitive. A
447     third barrier primitive is also provided, it behaves much like the
448     global sum primitive. } \label{fig:communication_primitives}
449 cnh 1.1 \end{figure}
450    
451    
452 edhill 1.21 Optimized communication support is assumed to be potentially available
453     for a small number of communication operations. It is also assumed
454     that communication performance optimizations can be achieved by
455     optimizing a small number of communication primitives. Three
456     optimizable primitives are provided by the WRAPPER
457 cnh 1.1 \begin{itemize}
458 edhill 1.21 \item{\bf EXCHANGE} This operation is used to transfer data between
459     interior and overlap regions of neighboring tiles. A number of
460     different forms of this operation are supported. These different
461     forms handle
462     \begin{itemize}
463     \item Data type differences. Sixty-four bit and thirty-two bit
464     fields may be handled separately.
465     \item Bindings to different communication methods. Exchange
466     primitives select between using shared memory or distributed
467     memory communication.
468     \item Transformation operations required when transporting data
469     between different grid regions. Transferring data between faces of
470     a cube-sphere grid, for example, involves a rotation of vector
471     components.
472     \item Forward and reverse mode computations. Derivative calculations
473     require tangent linear and adjoint forms of the exchange
474     primitives.
475     \end{itemize}
476 cnh 1.1
477     \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
478 edhill 1.21 operation for the pressure inversion phase of the MITgcm algorithm.
479     For certain configurations scaling can be highly sensitive to the
480     performance of the global sum primitive. This operation is a
481     collective operation involving all tiles of the simulated domain.
482     Different forms of the global sum primitive exist for handling
483     \begin{itemize}
484     \item Data type differences. Sixty-four bit and thirty-two bit
485     fields may be handled separately.
486     \item Bindings to different communication methods. Exchange
487     primitives select between using shared memory or distributed
488     memory communication.
489     \item Forward and reverse mode computations. Derivative calculations
490     require tangent linear and adjoint forms of the exchange
491     primitives.
492     \end{itemize}
493    
494     \item{\bf BARRIER} The WRAPPER provides a global synchronization
495     function called barrier. This is used to synchronize computations
496     over all tiles. The {\bf BARRIER} and {\bf GLOBAL SUM} primitives
497     have much in common and in some cases use the same underlying code.
498 cnh 1.1 \end{itemize}
499    
500    
501     \subsection{Memory architecture}
502    
503     The WRAPPER machine model is aimed to target efficiently systems with
504     highly pipelined memory architectures and systems with deep memory
505     hierarchies that favor memory reuse. This is achieved by supporting a
506     flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
507     Within a CPU computations are carried out sequentially on each tile
508     in turn. By reshaping tiles according to the target platform it is
509     possible to automatically tune code to improve memory performance.
510     On a vector machine a given domain might be sub-divided into a few
511     long, thin regions. On a commodity microprocessor based system, however,
512     the same region could be simulated use many more smaller
513     sub-domains.
514    
515    
516     \begin{figure}
517     \begin{center}
518 adcroft 1.2 \resizebox{5in}{!}{
519     \includegraphics{part4/tiling_detail.eps}
520 cnh 1.1 }
521     \end{center}
522     \caption{The tiling strategy that the WRAPPER supports allows tiles
523     to be shaped to suit the underlying system memory architecture.
524     Compact tiles that lead to greater memory reuse can be used on cache
525     based systems (upper half of figure) with deep memory hierarchies, long tiles
526     with large inner loops can be used to exploit vector systems having
527     highly pipelined memory systems.
528     } \label{fig:tiling-strategy}
529     \end{figure}
530    
531     \newpage
532     \subsection{Summary}
533     Following the discussion above, the machine model that the WRAPPER
534     presents to an application has the following characteristics
535    
536     \begin{itemize}
537 edhill 1.21 \item The machine consists of one or more logical processors.
538     \item Each processor operates on tiles that it owns.
539     \item A processor may own more than one tile.
540     \item Processors may compute concurrently.
541 cnh 1.1 \item Exchange of information between tiles is handled by the
542 edhill 1.21 machine (WRAPPER) not by the application.
543 cnh 1.1 \end{itemize}
544     Behind the scenes this allows the WRAPPER to adapt the machine model
545     functions to exploit hardware on which
546     \begin{itemize}
547 edhill 1.21 \item Processors may be able to communicate very efficiently with each
548     other using shared memory.
549 cnh 1.1 \item An alternative communication mechanism based on a relatively
550 edhill 1.21 simple inter-process communication API may be required.
551 cnh 1.1 \item Shared memory may not necessarily obey sequential consistency,
552 edhill 1.21 however some mechanism will exist for enforcing memory consistency.
553 cnh 1.1 \item Memory consistency that is enforced at the hardware level
554 edhill 1.21 may be expensive. Unnecessary triggering of consistency protocols
555     should be avoided.
556 cnh 1.1 \item Memory access patterns may need to either repetitive or highly
557 edhill 1.21 pipelined for optimum hardware performance.
558 cnh 1.1 \end{itemize}
559    
560     This generic model captures the essential hardware ingredients
561     of almost all successful scientific computer systems designed in the
562     last 50 years.
563    
564     \section{Using the WRAPPER}
565 edhill 1.20 \begin{rawhtml}
566     <!-- CMIREDIR:using_the_wrapper: -->
567     \end{rawhtml}
568 cnh 1.1
569 edhill 1.21 In order to support maximum portability the WRAPPER is implemented
570     primarily in sequential Fortran 77. At a practical level the key steps
571     provided by the WRAPPER are
572 cnh 1.1 \begin{enumerate}
573     \item specifying how a domain will be decomposed
574     \item starting a code in either sequential or parallel modes of operations
575     \item controlling communication between tiles and between concurrently
576 edhill 1.21 computing CPUs.
577 cnh 1.1 \end{enumerate}
578     This section describes the details of each of these operations.
579 edhill 1.21 Section \ref{sect:specifying_a_decomposition} explains how the way in
580     which a domain is decomposed (or composed) is expressed. Section
581     \ref{sect:starting_a_code} describes practical details of running
582     codes in various different parallel modes on contemporary computer
583     systems. Section \ref{sect:controlling_communication} explains the
584     internal information that the WRAPPER uses to control how information
585     is communicated between tiles.
586 cnh 1.1
587     \subsection{Specifying a domain decomposition}
588 adcroft 1.6 \label{sect:specifying_a_decomposition}
589 cnh 1.1
590     At its heart much of the WRAPPER works only in terms of a collection of tiles
591     which are interconnected to each other. This is also true of application
592     code operating within the WRAPPER. Application code is written as a series
593     of compute operations, each of which operates on a single tile. If
594     application code needs to perform operations involving data
595     associated with another tile, it uses a WRAPPER function to obtain
596     that data.
597     The specification of how a global domain is constructed from tiles or alternatively
598     how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
599     This file defines the following parameters \\
600    
601     \fbox{
602     \begin{minipage}{4.75in}
603     Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
604     File: {\em model/inc/SIZE.h}
605     \end{minipage}
606     } \\
607    
608     Together these parameters define a tiling decomposition of the style shown in
609     figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
610     the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
611     maximum size of the overlap extent. This must be set to the maximum width
612     of the computation stencil that the numerical code finite-difference operations
613     require between overlap region updates. The maximum overlap required
614     by any of the operations in the MITgcm code distributed with this release is three grid
615     points. This is set by the requirements of the $\nabla^4$ dissipation and
616     diffusion operator. Code modifications and enhancements that involve adding wide
617     finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
618     Setting {\em OLx} and {\em OLy} to a too large value will decrease code
619     performance (because redundant computations will be performed), however it will
620     not cause any other problems.
621    
622     \begin{figure}
623     \begin{center}
624 adcroft 1.2 \resizebox{5in}{!}{
625     \includegraphics{part4/size_h.eps}
626 cnh 1.1 }
627     \end{center}
628     \caption{ The three level domain decomposition hierarchy employed by the
629     WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
630     to a single process. Multiple processes can exist, each with multiple tiles.
631     Tiles within a process can be spread over multiple compute threads.
632     } \label{fig:labelled_tile}
633     \end{figure}
634    
635     The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
636     be created within a single process. Each of these tiles will have internal
637     dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
638     allocated to different threads of a process that are then bound to
639     different physical processors ( see the multi-threaded
640 adcroft 1.6 execution discussion in section \ref{sect:starting_the_code} ) then
641 cnh 1.1 computation will be performed concurrently on each tile. However, it is also
642     possible to run the same decomposition within a process running a single thread on
643     a single processor. In this case the tiles will be computed over sequentially.
644     If the decomposition is run in a single process running multiple threads
645     but attached to a single physical processor, then, in general, the computation
646     for different tiles will be interleaved by system level software.
647     This too is a valid mode of operation.
648    
649     The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
650     numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
651     are used to form the loop ranges for many numerical calculations and
652     to provide dimensions for arrays holding numerical state.
653     The {\em nSx} and{\em nSy} are used in conjunction with the thread number
654     parameter {\em myThid}. Much of the numerical code operating within the
655     WRAPPER takes the form
656     \begin{verbatim}
657     DO bj=myByLo(myThid),myByHi(myThid)
658     DO bi=myBxLo(myThid),myBxHi(myThid)
659     :
660     a block of computations ranging
661     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
662     :
663     ENDDO
664     ENDDO
665    
666     communication code to sum a number or maybe update
667     tile overlap regions
668    
669     DO bj=myByLo(myThid),myByHi(myThid)
670     DO bi=myBxLo(myThid),myBxHi(myThid)
671     :
672     another block of computations ranging
673     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
674     :
675     ENDDO
676     ENDDO
677     \end{verbatim}
678     The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
679     myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
680     schematic. These variables specify the subset of the tiles in
681     the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
682     thread number {\em myThid} owns. The thread number variable {\em myThid}
683     ranges from 1 to the total number of threads requested at execution time.
684     For each value of {\em myThid} the loop scheme above will step sequentially
685     through the tiles owned by that thread. However, different threads will
686     have different ranges of tiles assigned to them, so that separate threads can
687     compute iterations of the {\em bi}, {\em bj} loop concurrently.
688     Within a {\em bi}, {\em bj} loop
689     computation is performed concurrently over as many processes and threads
690     as there are physical processors available to compute.
691    
692 afe 1.11 An exception to the the use of {\em bi} and {\em bj} in loops arises in the
693     exchange routines used when the exch2 package is used with the cubed
694     sphere. In this case {\em bj} is generally set to 1 and the loop runs from
695     1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number,
696     which is then used to reference exchange parameters.
697    
698 cnh 1.1 The amount of computation that can be embedded
699     a single loop over {\em bi} and {\em bj} varies for different parts of the
700     MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
701     from the two-dimensional implicit elliptic solver. This portion of the
702     code computes the l2Norm of a vector whose elements are held in
703     the array {\em cg2d\_r} writing the final result to scalar variable
704     {\em err}. In this case, because the l2norm requires a global reduction,
705     the {\em bi},{\em bj} loop only contains one statement. This computation
706     phase is then followed by a communication phase in which all threads and
707     processes must participate. However,
708     in other areas of the MITgcm code entries subsections of code are within
709     a single {\em bi},{\em bj} loop. For example the evaluation of all
710     the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
711     is within a single {\em bi},{\em bj} loop.
712    
713     \begin{figure}
714     \begin{verbatim}
715     REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
716     REAL*8 err
717     :
718     :
719     other computations
720     :
721     :
722     err = 0.
723     DO bj=myByLo(myThid),myByHi(myThid)
724     DO bi=myBxLo(myThid),myBxHi(myThid)
725     DO J=1,sNy
726     DO I=1,sNx
727     err = err +
728     & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
729     ENDDO
730     ENDDO
731     ENDDO
732     ENDDO
733    
734     CALL GLOBAL_SUM_R8( err , myThid )
735     err = SQRT(err)
736    
737     \end{verbatim}
738     \caption{Example of numerical code for calculating
739     the l2-norm of a vector within the WRAPPER. Notice that
740     under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
741     dimensions. These right most indices are tile indexes. Different
742     threads with a single process operate on different ranges of tile
743     index, as controlled by the settings of
744     {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
745     } \label{fig:bibj_extract}
746     \end{figure}
747    
748     The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
749     are used to indicate to the WRAPPER level how many processes (each with
750     {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
751     This information is needed during initialization and during I/O phases.
752     However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
753     the values of {\em nPx} and {\em nPy} are absent
754     from the core numerical and support code.
755    
756     \subsubsection{Examples of {\em SIZE.h} specifications}
757    
758     The following different {\em SIZE.h} parameter setting illustrate how to
759     interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
760     and {\em nPy}.
761     \begin{enumerate}
762     \item
763     \begin{verbatim}
764     PARAMETER (
765     & sNx = 90,
766     & sNy = 40,
767     & OLx = 3,
768     & OLy = 3,
769     & nSx = 1,
770     & nSy = 1,
771     & nPx = 1,
772     & nPy = 1)
773     \end{verbatim}
774     This sets up a single tile with x-dimension of ninety grid points, y-dimension of
775     forty grid points, and x and y overlaps of three grid points each.
776     \item
777     \begin{verbatim}
778     PARAMETER (
779     & sNx = 45,
780     & sNy = 20,
781     & OLx = 3,
782     & OLy = 3,
783     & nSx = 1,
784     & nSy = 1,
785     & nPx = 2,
786     & nPy = 2)
787     \end{verbatim}
788     This sets up tiles with x-dimension of forty-five grid points, y-dimension of
789     twenty grid points, and x and y overlaps of three grid points each. There are
790     four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
791     arranged so that the global domain size is again ninety grid points in x and
792     forty grid points in y. In general the formula for global grid size (held in
793     model variables {\em Nx} and {\em Ny}) is
794     \begin{verbatim}
795     Nx = sNx*nSx*nPx
796     Ny = sNy*nSy*nPy
797     \end{verbatim}
798     \item
799     \begin{verbatim}
800     PARAMETER (
801     & sNx = 90,
802     & sNy = 10,
803     & OLx = 3,
804     & OLy = 3,
805     & nSx = 1,
806     & nSy = 2,
807     & nPx = 1,
808     & nPy = 2)
809     \end{verbatim}
810     This sets up tiles with x-dimension of ninety grid points, y-dimension of
811     ten grid points, and x and y overlaps of three grid points each. There are
812     four tiles allocated to two separate processes ({\em nPy=2}) each of which
813     has two separate sub-domains {\em nSy=2},
814     The global domain size is again ninety grid points in x and
815     forty grid points in y. The two sub-domains in each process will be computed
816     sequentially if they are given to a single thread within a single process.
817     Alternatively if the code is invoked with multiple threads per process
818 edhill 1.10 the two domains in y may be computed concurrently.
819 cnh 1.1 \item
820     \begin{verbatim}
821     PARAMETER (
822     & sNx = 32,
823     & sNy = 32,
824     & OLx = 3,
825     & OLy = 3,
826     & nSx = 6,
827     & nSy = 1,
828     & nPx = 1,
829     & nPy = 1)
830     \end{verbatim}
831     This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
832     thirty-two grid points, and x and y overlaps of three grid points each.
833     There are six tiles allocated to six separate logical processors ({\em nSx=6}).
834     This set of values can be used for a cube sphere calculation.
835     Each tile of size $32 \times 32$ represents a face of the
836 cnh 1.4 cube. Initializing the tile connectivity correctly ( see section
837 adcroft 1.6 \ref{sect:cube_sphere_communication}. allows the rotations associated with
838 cnh 1.1 moving between the six cube faces to be embedded within the
839     tile-tile communication code.
840     \end{enumerate}
841    
842    
843     \subsection{Starting the code}
844 adcroft 1.6 \label{sect:starting_the_code}
845 cnh 1.1 When code is started under the WRAPPER, execution begins in a main routine {\em
846     eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
847     to the application through a routine called {\em THE\_MODEL\_MAIN()}
848     once the WRAPPER has initialized correctly and has created the necessary variables
849     to support subsequent calls to communication routines
850     by the application code. The startup calling sequence followed by the
851     WRAPPER is shown in figure \ref{fig:wrapper_startup}.
852    
853     \begin{figure}
854 heimbach 1.8 {\footnotesize
855 cnh 1.1 \begin{verbatim}
856    
857     MAIN
858     |
859     |--EEBOOT :: WRAPPER initialization
860     | |
861     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
862     | | allow basic I/O.
863     | |-- EEINTRO_MSG :: Write startup greeting.
864     | |
865     | |-- EESET_PARMS :: Set WRAPPER parameters
866     | |
867     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
868     | |
869     | |-- INI_PROCS :: Associate processes with grid regions.
870     | |
871     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
872     | |
873     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
874     | :: communication data structures
875     |
876     |
877     |--CHECK_THREADS :: Validate multiple thread start up.
878     |
879     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
880    
881    
882     \end{verbatim}
883 heimbach 1.8 }
884 cnh 1.1 \caption{Main stages of the WRAPPER startup procedure.
885     This process proceeds transfer of control to application code, which
886     occurs through the procedure {\em THE\_MODEL\_MAIN()}.
887     } \label{fig:wrapper_startup}
888     \end{figure}
889    
890     \subsubsection{Multi-threaded execution}
891 adcroft 1.6 \label{sect:multi-threaded-execution}
892 cnh 1.1 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
893     WRAPPER may cause several coarse grain threads to be initialized. The routine
894     {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
895     stack argument which is the thread number, stored in the
896     variable {\em myThid}. In addition to specifying a decomposition with
897 adcroft 1.6 multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
898 cnh 1.1 configuring and starting a code to run using multiple threads requires the following
899     steps.\\
900    
901     \paragraph{Compilation}
902     First the code must be compiled with appropriate multi-threading directives
903     active in the file {\em main.F} and with appropriate compiler flags
904     to request multi-threading support. The header files
905     {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
906     contain directives compatible with compilers for Sun, Compaq, SGI,
907     Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
908     activated by using compile time
909     directives {\em -DTARGET\_SUN},
910     {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
911     or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
912     for invoking multi-threaded compilation vary from system to system
913     and from compiler to compiler. The options will be described
914     in the individual compiler documentation. For the Fortran compiler
915     from Sun the following options are needed to correctly compile
916     multi-threaded code
917     \begin{verbatim}
918     -stackvar -explicitpar -vpara -noautopar
919     \end{verbatim}
920     These options are specific to the Sun compiler. Other compilers
921     will use different syntax that will be described in their
922     documentation. The effect of these options is as follows
923     \begin{enumerate}
924     \item {\bf -stackvar} Causes all local variables to be allocated in stack
925     storage. This is necessary for local variables to ensure that they are private
926     to their thread. Note, when using this option it may be necessary to override
927     the default limit on stack-size that the operating system assigns to a process.
928     This can normally be done by changing the settings of the command shells
929     {\em stack-size} limit variable. However, on some systems changing this limit
930     will require privileged administrator access to modify system parameters.
931    
932     \item {\bf -explicitpar} Requests that multiple threads be spawned
933     in response to explicit directives in the application code. These
934     directives are inserted with syntax appropriate to the particular target
935     platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
936    
937     \item {\bf -vpara} This causes the compiler to describe the multi-threaded
938     configuration it is creating. This is not required
939     but it can be useful when trouble shooting.
940    
941     \item {\bf -noautopar} This inhibits any automatic multi-threaded
942     parallelization the compiler may otherwise generate.
943    
944     \end{enumerate}
945    
946    
947     An example of valid settings for the {\em eedata} file for a
948     domain with two subdomains in y and running with two threads is shown
949     below
950     \begin{verbatim}
951     nTx=1,nTy=2
952     \end{verbatim}
953     This set of values will cause computations to stay within a single
954     thread when moving across the {\em nSx} sub-domains. In the y-direction,
955     however, sub-domains will be split equally between two threads.
956    
957     \paragraph{Multi-threading files and parameters} The following
958     files and variables are used in setting up multi-threaded execution.\\
959    
960     \fbox{
961     \begin{minipage}{4.75in}
962     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
963     File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
964     File: {\em model/src/THE\_MODEL\_MAIN.F}\\
965     File: {\em eesupp/src/MAIN.F}\\
966 afe 1.11 File: {\em tools/genmake2}\\
967 cnh 1.1 File: {\em eedata}\\
968     CPP: {\em TARGET\_SUN}\\
969     CPP: {\em TARGET\_DEC}\\
970     CPP: {\em TARGET\_HP }\\
971     CPP: {\em TARGET\_SGI}\\
972     CPP: {\em TARGET\_CRAY\_VECTOR}\\
973     Parameter: {\em nTx}\\
974     Parameter: {\em nTy}
975     \end{minipage}
976     } \\
977    
978     \subsubsection{Multi-process execution}
979 adcroft 1.6 \label{sect:multi-process-execution}
980 cnh 1.1
981 edhill 1.21 Despite its appealing programming model, multi-threaded execution
982     remains less common than multi-process execution. One major reason for
983     this is that many system libraries are still not ``thread-safe''. This
984     means that, for example, on some systems it is not safe to call system
985     routines to perform I/O when running in multi-threaded mode (except,
986     perhaps, in a limited set of circumstances). Another reason is that
987     support for multi-threaded programming models varies between systems.
988    
989     Multi-process execution is more ubiquitous. In order to run code in a
990     multi-process configuration a decomposition specification (see section
991     \ref{sect:specifying_a_decomposition}) is given (in which the at least
992     one of the parameters {\em nPx} or {\em nPy} will be greater than one)
993     and then, as for multi-threaded operation, appropriate compile time
994     and run time steps must be taken.
995    
996     \paragraph{Compilation} Multi-process execution under the WRAPPER
997     assumes that the portable, MPI libraries are available for controlling
998     the start-up of multiple processes. The MPI libraries are not
999     required, although they are usually used, for performance critical
1000     communication. However, in order to simplify the task of controlling
1001     and coordinating the start up of a large number (hundreds and possibly
1002     even thousands) of copies of the same program, MPI is used. The calls
1003     to the MPI multi-process startup routines must be activated at compile
1004     time. Currently MPI libraries are invoked by specifying the
1005     appropriate options file with the {\tt-of} flag when running the {\em
1006     genmake2} script, which generates the Makefile for compiling and
1007     linking MITgcm. (Previously this was done by setting the {\em
1008     ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
1009     CPP\_EEOPTIONS.h} file.) More detailed information about the use of
1010     {\em genmake2} for specifying
1011     local compiler flags is located in section \ref{sect:genmake}.\\
1012 cnh 1.1
1013    
1014     \fbox{
1015     \begin{minipage}{4.75in}
1016 afe 1.15 Directory: {\em tools/build\_options}\\
1017 afe 1.11 File: {\em tools/genmake2}
1018 cnh 1.1 \end{minipage}
1019     } \\
1020 edhill 1.21 \paragraph{\bf Execution} The mechanics of starting a program in
1021     multi-process mode under MPI is not standardized. Documentation
1022 cnh 1.1 associated with the distribution of MPI installed on a system will
1023 edhill 1.21 describe how to start a program using that distribution. For the
1024     open-source MPICH system, the MITgcm program can be started using a
1025     command such as
1026 cnh 1.1 \begin{verbatim}
1027     mpirun -np 64 -machinefile mf ./mitgcmuv
1028     \end{verbatim}
1029 edhill 1.21 In this example the text {\em -np 64} specifies the number of
1030     processes that will be created. The numeric value {\em 64} must be
1031     equal to the product of the processor grid settings of {\em nPx} and
1032     {\em nPy} in the file {\em SIZE.h}. The parameter {\em mf} specifies
1033     that a text file called ``mf'' will be read to get a list of processor
1034     names on which the sixty-four processes will execute. The syntax of
1035     this file is specified by the MPI distribution.
1036     \\
1037 cnh 1.1
1038     \fbox{
1039     \begin{minipage}{4.75in}
1040     File: {\em SIZE.h}\\
1041     Parameter: {\em nPx}\\
1042     Parameter: {\em nPy}
1043     \end{minipage}
1044     } \\
1045    
1046 adcroft 1.2
1047     \paragraph{Environment variables}
1048 edhill 1.21 On most systems multi-threaded execution also requires the setting of
1049     a special environment variable. On many machines this variable is
1050     called PARALLEL and its values should be set to the number of parallel
1051     threads required. Generally the help or manual pages associated with
1052     the multi-threaded compiler on a machine will explain how to set the
1053     required environment variables.
1054 adcroft 1.2
1055     \paragraph{Runtime input parameters}
1056 edhill 1.21 Finally the file {\em eedata} needs to be configured to indicate the
1057     number of threads to be used in the x and y directions. The variables
1058     {\em nTx} and {\em nTy} in this file are used to specify the
1059     information required. The product of {\em nTx} and {\em nTy} must be
1060     equal to the number of threads spawned i.e. the setting of the
1061     environment variable PARALLEL. The value of {\em nTx} must subdivide
1062     the number of sub-domains in x ({\em nSx}) exactly. The value of {\em
1063     nTy} must subdivide the number of sub-domains in y ({\em nSy})
1064     exactly. The multiprocess startup of the MITgcm executable {\em
1065     mitgcmuv} is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1066     {\em INI\_PROCS()}. The first routine performs basic steps required to
1067     make sure each process is started and has a textual output stream
1068     associated with it. By default two output files are opened for each
1069     process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}. The {\bf
1070     NNNNN} part of the name is filled in with the process number so that
1071     process number 0 will create output files {\bf STDOUT.0000} and {\bf
1072     STDERR.0000}, process number 1 will create output files {\bf
1073     STDOUT.0001} and {\bf STDERR.0001}, etc. These files are used for
1074     reporting status and configuration information and for reporting error
1075     conditions on a process by process basis. The {\em EEBOOT\_MINIMAL()}
1076     procedure also sets the variables {\em myProcId} and {\em
1077     MPI\_COMM\_MODEL}. These variables are related to processor
1078     identification are are used later in the routine {\em INI\_PROCS()} to
1079     allocate tiles to processes.
1080    
1081     Allocation of processes to tiles is controlled by the routine {\em
1082     INI\_PROCS()}. For each process this routine sets the variables {\em
1083     myXGlobalLo} and {\em myYGlobalLo}. These variables specify, in
1084     index space, the coordinates of the southernmost and westernmost
1085     corner of the southernmost and westernmost tile owned by this process.
1086     The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are
1087     also set in this routine. These are used to identify processes holding
1088     tiles to the west, east, south and north of a given process. These
1089     values are stored in global storage in the header file {\em
1090     EESUPPORT.h} for use by communication routines. The above does not
1091     hold when the exch2 package is used. The exch2 sets its own
1092     parameters to specify the global indices of tiles and their
1093     relationships to each other. See the documentation on the exch2
1094     package (\ref{sec:exch2}) for details.
1095 cnh 1.1 \\
1096    
1097     \fbox{
1098     \begin{minipage}{4.75in}
1099     File: {\em eesupp/src/eeboot\_minimal.F}\\
1100     File: {\em eesupp/src/ini\_procs.F} \\
1101     File: {\em eesupp/inc/EESUPPORT.h} \\
1102     Parameter: {\em myProcId} \\
1103     Parameter: {\em MPI\_COMM\_MODEL} \\
1104     Parameter: {\em myXGlobalLo} \\
1105     Parameter: {\em myYGlobalLo} \\
1106     Parameter: {\em pidW } \\
1107     Parameter: {\em pidE } \\
1108     Parameter: {\em pidS } \\
1109     Parameter: {\em pidN }
1110     \end{minipage}
1111     } \\
1112    
1113    
1114     \subsection{Controlling communication}
1115     The WRAPPER maintains internal information that is used for communication
1116     operations and that can be customized for different platforms. This section
1117     describes the information that is held and used.
1118 adcroft 1.2
1119 cnh 1.1 \begin{enumerate}
1120 edhill 1.21 \item {\bf Tile-tile connectivity information}
1121     For each tile the WRAPPER sets a flag that sets the tile number to
1122     the north, south, east and west of that tile. This number is unique
1123     over all tiles in a configuration. Except when using the cubed
1124     sphere and the exch2 package, the number is held in the variables
1125     {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em
1126     tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also
1127     stored with each tile that specifies the type of communication that
1128     is used between tiles. This information is held in the variables
1129     {\em tileCommModeN}, {\em tileCommModeS}, {\em tileCommModeE} and
1130     {\em tileCommModeW}. This latter set of variables can take one of
1131     the following values {\em COMM\_NONE}, {\em COMM\_MSG}, {\em
1132     COMM\_PUT} and {\em COMM\_GET}. A value of {\em COMM\_NONE} is
1133     used to indicate that a tile has no neighbor to communicate with on
1134     a particular face. A value of {\em COMM\_MSG} is used to indicate
1135     that some form of distributed memory communication is required to
1136     communicate between these tile faces (see section
1137     \ref{sect:distributed_memory_communication}). A value of {\em
1138     COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
1139     memory communication (see section
1140     \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value
1141     indicates that a CPU should communicate by writing to data
1142     structures owned by another CPU. A {\em COMM\_GET} value indicates
1143     that a CPU should communicate by reading from data structures owned
1144     by another CPU. These flags affect the behavior of the WRAPPER
1145     exchange primitive (see figure \ref{fig:communication_primitives}).
1146     The routine {\em ini\_communication\_patterns()} is responsible for
1147     setting the communication mode values for each tile.
1148    
1149     When using the cubed sphere configuration with the exch2 package,
1150     the relationships between tiles and their communication methods are
1151     set by the exch2 package and stored in different variables. See the
1152     exch2 package documentation (\ref{sec:exch2} for details.
1153 cnh 1.1
1154     \fbox{
1155     \begin{minipage}{4.75in}
1156     File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1157     File: {\em eesupp/inc/EESUPPORT.h} \\
1158     Parameter: {\em tileNo} \\
1159     Parameter: {\em tileNoE} \\
1160     Parameter: {\em tileNoW} \\
1161     Parameter: {\em tileNoN} \\
1162     Parameter: {\em tileNoS} \\
1163     Parameter: {\em tileCommModeE} \\
1164     Parameter: {\em tileCommModeW} \\
1165     Parameter: {\em tileCommModeN} \\
1166     Parameter: {\em tileCommModeS} \\
1167     \end{minipage}
1168     } \\
1169    
1170     \item {\bf MP directives}
1171 edhill 1.21 The WRAPPER transfers control to numerical application code through
1172     the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1173     that allows for it to be invoked by several threads. Support for
1174     this is based on either multi-processing (MP) compiler directives or
1175     specific calls to multi-threading libraries (\textit{eg.} POSIX
1176     threads). Most commercially available Fortran compilers support the
1177     generation of code to spawn multiple threads through some form of
1178     compiler directives. Compiler directives are generally more
1179     convenient than writing code to explicitly spawning threads. And,
1180     on some systems, compiler directives may be the only method
1181     available. The WRAPPER is distributed with template MP directives
1182     for a number of systems.
1183    
1184     These directives are inserted into the code just before and after
1185     the transfer of control to numerical algorithm code through the
1186     routine {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows
1187     an example of the code that performs this process for a Silicon
1188     Graphics system. This code is extracted from the files {\em main.F}
1189     and {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads}
1190     specifies how many instances of the routine {\em THE\_MODEL\_MAIN}
1191     will be created. The value of {\em nThreads} is set in the routine
1192     {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the the
1193     product of the parameters {\em nTx} and {\em nTy} that are read from
1194     the file {\em eedata}. If the value of {\em nThreads} is
1195     inconsistent with the number of threads requested from the operating
1196     system (for example by using an environment variable as described in
1197     section \ref{sect:multi_threaded_execution}) then usually an error
1198     will be reported by the routine {\em CHECK\_THREADS}.
1199 cnh 1.1
1200     \fbox{
1201     \begin{minipage}{4.75in}
1202     File: {\em eesupp/src/ini\_threading\_environment.F}\\
1203     File: {\em eesupp/src/check\_threads.F} \\
1204     File: {\em eesupp/src/main.F} \\
1205     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1206     File: {\em eedata } \\
1207     Parameter: {\em nThreads} \\
1208     Parameter: {\em nTx} \\
1209     Parameter: {\em nTy} \\
1210     \end{minipage}
1211     }
1212    
1213     \item {\bf memsync flags}
1214 edhill 1.21 As discussed in section \ref{sect:memory_consistency}, a low-level
1215     system function may be need to force memory consistency on some
1216     shared memory systems. The routine {\em MEMSYNC()} is used for this
1217     purpose. This routine should not need modifying and the information
1218     below is only provided for completeness. A logical parameter {\em
1219     exchNeedsMemSync} set in the routine {\em
1220     INI\_COMMUNICATION\_PATTERNS()} controls whether the {\em
1221     MEMSYNC()} primitive is called. In general this routine is only
1222     used for multi-threaded execution. The code that goes into the {\em
1223     MEMSYNC()} routine is specific to the compiler and processor used.
1224     In some cases, it must be written using a short code snippet of
1225     assembly language. For an Ultra Sparc system the following code
1226     snippet is used
1227 cnh 1.1 \begin{verbatim}
1228     asm("membar #LoadStore|#StoreStore");
1229     \end{verbatim}
1230 cnh 1.4 for an Alpha based system the equivalent code reads
1231 cnh 1.1 \begin{verbatim}
1232     asm("mb");
1233     \end{verbatim}
1234     while on an x86 system the following code is required
1235     \begin{verbatim}
1236     asm("lock; addl $0,0(%%esp)": : :"memory")
1237     \end{verbatim}
1238    
1239     \item {\bf Cache line size}
1240 edhill 1.21 As discussed in section \ref{sect:cache_effects_and_false_sharing},
1241     milti-threaded codes explicitly avoid penalties associated with
1242     excessive coherence traffic on an SMP system. To do this the shared
1243     memory data structures used by the {\em GLOBAL\_SUM}, {\em
1244     GLOBAL\_MAX} and {\em BARRIER} routines are padded. The variables
1245     that control the padding are set in the header file {\em
1246     EEPARAMS.h}. These variables are called {\em cacheLineSize}, {\em
1247     lShare1}, {\em lShare4} and {\em lShare8}. The default values
1248     should not normally need changing.
1249    
1250 cnh 1.1 \item {\bf \_BARRIER}
1251 edhill 1.21 This is a CPP macro that is expanded to a call to a routine which
1252     synchronizes all the logical processors running under the WRAPPER.
1253     Using a macro here preserves flexibility to insert a specialized
1254     call in-line into application code. By default this resolves to
1255     calling the procedure {\em BARRIER()}. The default setting for the
1256     \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1257 cnh 1.1
1258     \item {\bf \_GSUM}
1259 edhill 1.21 This is a CPP macro that is expanded to a call to a routine which
1260     sums up a floating point number over all the logical processors
1261     running under the WRAPPER. Using a macro here provides extra
1262     flexibility to insert a specialized call in-line into application
1263     code. By default this resolves to calling the procedure {\em
1264     GLOBAL\_SUM\_R8()} ( for 64-bit floating point operands) or {\em
1265     GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The
1266     default setting for the \_GSUM macro is given in the file {\em
1267     CPP\_EEMACROS.h}. The \_GSUM macro is a performance critical
1268     operation, especially for large processor count, small tile size
1269     configurations. The custom communication example discussed in
1270     section \ref{sect:jam_example} shows how the macro is used to invoke
1271     a custom global sum routine for a specific set of hardware.
1272 cnh 1.1
1273     \item {\bf \_EXCH}
1274 edhill 1.21 The \_EXCH CPP macro is used to update tile overlap regions. It is
1275     qualified by a suffix indicating whether overlap updates are for
1276     two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1277     physical fields and whether fields are 32-bit floating point (
1278     \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point (
1279     \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined in
1280     the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
1281     operation plays a crucial role in scaling to small tile, large
1282     logical and physical processor count configurations. The example in
1283     section \ref{sect:jam_example} discusses defining an optimized and
1284     specialized form on the \_EXCH operation.
1285    
1286     The \_EXCH operation is also central to supporting grids such as the
1287     cube-sphere grid. In this class of grid a rotation may be required
1288     between tiles. Aligning the coordinate requiring rotation with the
1289     tile decomposition, allows the coordinate transformation to be
1290     embedded within a custom form of the \_EXCH primitive. In these
1291     cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1292     package documentation \ref{sec:exch2}.
1293 cnh 1.1
1294     \item {\bf Reverse Mode}
1295 edhill 1.21 The communication primitives \_EXCH and \_GSUM both employ
1296     hand-written adjoint forms (or reverse mode) forms. These reverse
1297     mode forms can be found in the source code directory {\em
1298     pkg/autodiff}. For the global sum primitive the reverse mode form
1299     calls are to {\em GLOBAL\_ADSUM\_R4} and {\em GLOBAL\_ADSUM\_R8}.
1300     The reverse mode form of the exchange primitives are found in
1301     routines prefixed {\em ADEXCH}. The exchange routines make calls to
1302     the same low-level communication primitives as the forward mode
1303     operations. However, the routine argument {\em simulationMode} is
1304     set to the value {\em REVERSE\_SIMULATION}. This signifies to the
1305     low-level routines that the adjoint forms of the appropriate
1306     communication operation should be performed.
1307 afe 1.16
1308 cnh 1.1 \item {\bf MAX\_NO\_THREADS}
1309 edhill 1.21 The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum
1310     number of OS threads that a code will use. This value defaults to
1311     thirty-two and is set in the file {\em EEPARAMS.h}. For single
1312     threaded execution it can be reduced to one if required. The value
1313     is largely private to the WRAPPER and application code will not
1314     normally reference the value, except in the following scenario.
1315    
1316     For certain physical parametrization schemes it is necessary to have
1317     a substantial number of work arrays. Where these arrays are
1318     allocated in heap storage (for example COMMON blocks) multi-threaded
1319     execution will require multiple instances of the COMMON block data.
1320     This can be achieved using a Fortran 90 module construct. However,
1321     if this mechanism is unavailable then the work arrays can be extended
1322     with dimensions using the tile dimensioning scheme of {\em nSx} and
1323     {\em nSy} (as described in section
1324     \ref{sect:specifying_a_decomposition}). However, if the
1325     configuration being specified involves many more tiles than OS
1326     threads then it can save memory resources to reduce the variable
1327     {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
1328     that will be used and to declare the physical parameterization work
1329     arrays with a single {\em MAX\_NO\_THREADS} extra dimension. An
1330     example of this is given in the verification experiment {\em
1331     aim.5l\_cs}. Here the default setting of {\em MAX\_NO\_THREADS} is
1332     altered to
1333 cnh 1.1 \begin{verbatim}
1334     INTEGER MAX_NO_THREADS
1335     PARAMETER ( MAX_NO_THREADS = 6 )
1336     \end{verbatim}
1337 edhill 1.21 and several work arrays for storing intermediate calculations are
1338     created with declarations of the form.
1339 cnh 1.1 \begin{verbatim}
1340     common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1341     \end{verbatim}
1342 edhill 1.21 This declaration scheme is not used widely, because most global data
1343     is used for permanent not temporary storage of state information.
1344     In the case of permanent state information this approach cannot be
1345     used because there has to be enough storage allocated for all tiles.
1346     However, the technique can sometimes be a useful scheme for reducing
1347     memory requirements in complex physical parameterizations.
1348 adcroft 1.2 \end{enumerate}
1349    
1350     \begin{figure}
1351     \begin{verbatim}
1352     C--
1353     C-- Parallel directives for MIPS Pro Fortran compiler
1354     C--
1355     C Parallel compiler directives for SGI with IRIX
1356     C$PAR PARALLEL DO
1357     C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1358     C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1359     C
1360     DO I=1,nThreads
1361     myThid = I
1362    
1363     C-- Invoke nThreads instances of the numerical model
1364     CALL THE_MODEL_MAIN(myThid)
1365    
1366     ENDDO
1367     \end{verbatim}
1368 edhill 1.21 \caption{Prior to transferring control to the procedure {\em
1369     THE\_MODEL\_MAIN()} the WRAPPER may use MP directives to spawn
1370     multiple threads. } \label{fig:mp_directives}
1371 adcroft 1.2 \end{figure}
1372 cnh 1.1
1373    
1374     \subsubsection{Specializing the Communication Code}
1375    
1376     The isolation of performance critical communication primitives and the
1377     sub-division of the simulation domain into tiles is a powerful tool.
1378     Here we show how it can be used to improve application performance and
1379 cnh 1.5 how it can be used to adapt to new griding approaches.
1380 cnh 1.1
1381     \subsubsection{JAM example}
1382 adcroft 1.6 \label{sect:jam_example}
1383 edhill 1.21 On some platforms a big performance boost can be obtained by binding
1384     the communication routines {\em \_EXCH} and {\em \_GSUM} to
1385     specialized native libraries (for example, the shmem library on CRAY
1386     T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag is used as an
1387     illustration of a specialized communication configuration that
1388     substitutes for standard, portable forms of {\em \_EXCH} and {\em
1389     \_GSUM}. It affects three source files {\em eeboot.F}, {\em
1390     CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined is has
1391     the following effects.
1392 cnh 1.1 \begin{itemize}
1393 edhill 1.21 \item An extra phase is included at boot time to initialize the custom
1394     communications library ( see {\em ini\_jam.F}).
1395 cnh 1.1 \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1396 edhill 1.21 with calls to custom routines (see {\em gsum\_jam.F} and {\em
1397     exch\_jam.F})
1398 cnh 1.1 \item a highly specialized form of the exchange operator (optimized
1399 edhill 1.21 for overlap regions of width one) is substituted into the elliptic
1400     solver routine {\em cg2d.F}.
1401 cnh 1.1 \end{itemize}
1402     Developing specialized code for other libraries follows a similar
1403     pattern.
1404    
1405     \subsubsection{Cube sphere communication}
1406 adcroft 1.6 \label{sect:cube_sphere_communication}
1407 edhill 1.21 Actual {\em \_EXCH} routine code is generated automatically from a
1408     series of template files, for example {\em exch\_rx.template}. This
1409     is done to allow a large number of variations on the exchange process
1410     to be maintained. One set of variations supports the cube sphere grid.
1411     Support for a cube sphere grid in MITgcm is based on having each face
1412     of the cube as a separate tile or tiles. The exchange routines are
1413     then able to absorb much of the detailed rotation and reorientation
1414     required when moving around the cube grid. The set of {\em \_EXCH}
1415     routines that contain the word cube in their name perform these
1416     transformations. They are invoked when the run-time logical parameter
1417 cnh 1.1 {\em useCubedSphereExchange} is set true. To facilitate the
1418 edhill 1.21 transformations on a staggered C-grid, exchange operations are defined
1419     separately for both vector and scalar quantities and for grid-centered
1420     and for grid-face and grid-corner quantities. Three sets of exchange
1421     routines are defined. Routines with names of the form {\em exch\_rx}
1422     are used to exchange cell centered scalar quantities. Routines with
1423     names of the form {\em exch\_uv\_rx} are used to exchange vector
1424     quantities located at the C-grid velocity points. The vector
1425     quantities exchanged by the {\em exch\_uv\_rx} routines can either be
1426     signed (for example velocity components) or un-signed (for example
1427     grid-cell separations). Routines with names of the form {\em
1428     exch\_z\_rx} are used to exchange quantities at the C-grid vorticity
1429     point locations.
1430 cnh 1.1
1431    
1432    
1433    
1434     \section{MITgcm execution under WRAPPER}
1435 edhill 1.19 \begin{rawhtml}
1436     <!-- CMIREDIR:mitgcm_wrapper: -->
1437     \end{rawhtml}
1438 cnh 1.1
1439     Fitting together the WRAPPER elements, package elements and
1440     MITgcm core equation elements of the source code produces calling
1441 adcroft 1.6 sequence shown in section \ref{sect:calling_sequence}
1442 cnh 1.1
1443     \subsection{Annotated call tree for MITgcm and WRAPPER}
1444 adcroft 1.6 \label{sect:calling_sequence}
1445 cnh 1.1
1446     WRAPPER layer.
1447    
1448 heimbach 1.8 {\footnotesize
1449 cnh 1.1 \begin{verbatim}
1450    
1451     MAIN
1452     |
1453     |--EEBOOT :: WRAPPER initialization
1454     | |
1455     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1456     | | allow basic I/O.
1457     | |-- EEINTRO_MSG :: Write startup greeting.
1458     | |
1459     | |-- EESET_PARMS :: Set WRAPPER parameters
1460     | |
1461     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1462     | |
1463     | |-- INI_PROCS :: Associate processes with grid regions.
1464     | |
1465     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1466     | |
1467     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1468     | :: communication data structures
1469     |
1470     |
1471     |--CHECK_THREADS :: Validate multiple thread start up.
1472     |
1473     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1474    
1475     \end{verbatim}
1476 heimbach 1.8 }
1477 cnh 1.1
1478     Core equations plus packages.
1479    
1480 heimbach 1.8 {\footnotesize
1481 cnh 1.1 \begin{verbatim}
1482     C
1483     C
1484     C Invocation from WRAPPER level...
1485     C :
1486     C :
1487     C |
1488     C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1489     C | :: Called from WRAPPER level numerical
1490 cnh 1.4 C | :: code invocation routine. On entry
1491 cnh 1.1 C | :: to THE_MODEL_MAIN separate thread and
1492     C | :: separate processes will have been established.
1493     C | :: Each thread and process will have a unique ID
1494     C | :: but as yet it will not be associated with a
1495     C | :: specific region in decomposed discrete space.
1496     C |
1497     C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1498     C | | :: grid, solver matrices etc..
1499     C | |
1500     C | |-INI_PARMS :: Routine to set kernel model parameters.
1501     C | | :: By default kernel parameters are read from file
1502     C | | :: "data" in directory in which code executes.
1503     C | |
1504 cnh 1.4 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1505 cnh 1.1 C | |
1506 cnh 1.4 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1507 cnh 1.1 C | | | :: Grid arrays are held and described in GRID.h.
1508     C | | |
1509 cnh 1.4 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1510 cnh 1.1 C | | |
1511 cnh 1.4 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1512 cnh 1.1 C | | | :: (calculate grid from kernel parameters).
1513     C | | |
1514     C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1515 cnh 1.4 C | | | :: initialization (calculate grid from
1516 cnh 1.1 C | | | :: kernel parameters).
1517     C | | |
1518     C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1519 cnh 1.4 C | | :: grid initializations. ( input from raw
1520 cnh 1.1 C | | :: grid files, LONC.bin, DXF.bin etc... )
1521     C | |
1522     C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1523     C | |
1524     C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1525     C | | :: land masking for solid-fluid boundaries.
1526     C | |
1527     C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1528     C | |
1529     C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1530 cnh 1.4 C | | :: sphere options are coded.
1531 cnh 1.1 C | |
1532     C | |-PACAKGES_BOOT :: Start up the optional package environment.
1533     C | | :: Runtime selection of active packages.
1534     C | |
1535     C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1536     C | | |
1537     C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1538     C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1539     C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1540     C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1541     C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1542     C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1543     C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1544     C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1545     C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1546     C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1547     C | |
1548     C | |-PACKAGES_CHECK
1549     C | | |
1550     C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1551 cnh 1.4 C | | |-OBCS_CHECK :: Open bndy Package. pkg/obcs
1552 cnh 1.1 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1553     C | |
1554     C | |-PACKAGES_INIT_FIXED
1555     C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1556     C | | |-FLT_INIT :: Floats Package. see pkg/flt
1557     C | |
1558     C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1559     C | |
1560     C | |-INI_CG2D :: 2d con. grad solver initialisation.
1561     C | |
1562     C | |-INI_CG3D :: 3d con. grad solver initialisation.
1563     C | |
1564     C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1565     C | :: Includes annotated table of kernel
1566     C | :: parameter settings.
1567     C |
1568     C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1569     C |
1570     C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1571 cnh 1.4 C ! :: Auotmatically generated by TAMC/TAF.
1572 cnh 1.1 C |
1573     C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1574     C |
1575     C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1576     C |
1577     C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1578     C | |
1579     C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1580     C | | | :: variables
1581     C | | |
1582     C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1583     C | | |
1584     C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1585 cnh 1.4 C | | | :: sphere options are coded.
1586 cnh 1.1 C | | |
1587     C | | |-INI_CG2D :: 2d con. grad solver initialisation.
1588     C | | |-INI_CG3D :: 3d con. grad solver initialisation.
1589     C | | |-INI_MIXING :: Initialise diapycnal diffusivity.
1590     C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical
1591     C | | | :: fields).
1592     C | | |
1593 cnh 1.4 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1594 cnh 1.1 C | | | |-INI_VEL :: Initialize 3D flow field.
1595     C | | | |-INI_THETA :: Set model initial temperature field.
1596     C | | | |-INI_SALT :: Set model initial salinity field.
1597     C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1598     C | | |
1599     C | | |-INI_TR1 :: Set initial tracer 1 distribution.
1600     C | | |
1601     C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1602     C | | | | :: Here applied to move restart conditions
1603     C | | | | :: (saved in mid timestep) to correct level in
1604     C | | | | :: time (only used for pre-c35).
1605     C | | | |
1606     C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1607     C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1608     C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1609     C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1610     C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1611     C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1612     C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1613     C | | | | |-FIND_RHO :: Find adjacent densities.
1614     C | | | | |-CONVECT :: Mix static instability.
1615     C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1616     C | | | |
1617     C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1618     C | | |
1619     C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1620     C | | | | :: Extra time history interactions.
1621     C | | | |
1622     C | | | |-FIND_RHO :: Find adjacent densities.
1623     C | | | |-CONVECT :: Mix static instability.
1624     C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1625     C | | |
1626     C | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving
1627     C | | | | :: package data.
1628     C | | | |
1629     C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1630     C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1631     C | | | |-KPP_OPEN_DIAGS
1632     C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1633     C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1634     C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1635     C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1636     C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1637     C | | | |-INI_FORCING :: Set model initial forcing fields.
1638     C | | | | :: Either set in-line or from file as shown.
1639     C | | | |-READ_FLD_XY_RS(zonalWindFile)
1640     C | | | |-READ_FLD_XY_RS(meridWindFile)
1641     C | | | |-READ_FLD_XY_RS(surfQFile)
1642     C | | | |-READ_FLD_XY_RS(EmPmRfile)
1643     C | | | |-READ_FLD_XY_RS(thetaClimFile)
1644     C | | | |-READ_FLD_XY_RS(saltClimFile)
1645     C | | | |-READ_FLD_XY_RS(surfQswFile)
1646     C | | |
1647     C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1648     C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1649     C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1650     C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1651     C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1652     C | |
1653     C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1654     C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1655     C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1656     C | |
1657     C | |-MONITOR :: Monitor state ( see pkg/monitor )
1658     C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1659     C====|>|
1660     C====|>| ****************************
1661     C====|>| BEGIN MAIN TIMESTEPPING LOOP
1662     C====|>| ****************************
1663     C====|>|
1664     C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1665     C/\ | | |
1666     C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1667     C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1668     C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1669     C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1670     C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1671 cnh 1.4 C/\ | | | | :: Simple interpolation between end-points
1672 cnh 1.1 C/\ | | | | :: for forcing datasets.
1673     C/\ | | | |
1674     C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1675     C/\ | | |
1676     C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1677     C/\ | | | |
1678     C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1679     C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1680     C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-Rhonil] of a slice
1681     C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1682     C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1683     C/\ | | | |
1684     C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1685     C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1686     C/\ | | | | :: forcing terms.
1687     C/\ | | | |
1688     C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1689     C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1690     C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1691     C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1692     C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1693     C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1694     C/\ | | | | advection case) (see pkg/gad).
1695     C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1696     C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1697     C/\ | | | | |
1698     C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1699     C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1700     C/\ | | | |
1701     C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1702     C/\ | | | | |
1703     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1704     C/\ | | | | | :: ( see pkg/gad )
1705     C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1706     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1707     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1708     C/\ | | | |
1709     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1710     C/\ | | | |
1711     C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1712     C/\ | | | | |
1713     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1714     C/\ | | | | | :: ( see pkg/gad )
1715     C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1716     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1717     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1718     C/\ | | | |
1719     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1720     C/\ | | | |
1721     C/\ | | | |-CALC_GTR1 :: Calculate other tracer(s) tendency terms
1722     C/\ | | | | |
1723     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1724     C/\ | | | | | :: ( see pkg/gad )
1725     C/\ | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.
1726     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1727     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1728     C/\ | | | |
1729     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1730     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1731     C/\ | | | |-FREEZE :: Limit range of temperature.
1732     C/\ | | | |
1733     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1734     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1735     C/\ | | | |
1736     C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1737     C/\ | | | |-EXCH :: Update overlaps
1738     C/\ | | |
1739     C/\ | | |-DYNAMICS :: Momentum equations driver.
1740     C/\ | | | |
1741     C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1742     C/\ | | | | Potential anomaly.
1743     C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1744     C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1745     C/\ | | | |
1746     C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1747     C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1748     C/\ | | | | pkg/mom_fluxform ).
1749     C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1750     C/\ | | | | pkg/mom_vecinv ).
1751     C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1752     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1753     C/\ | | | |
1754     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1755     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1756     C/\ | | | |
1757     C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1758     C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1759     C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1760     C/\ | | |
1761     C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1762     C/\ | | |
1763     C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1764     C/\ | | |
1765     C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1766     C/\ | | |
1767     C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1768     C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1769     C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1770     C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1771     C/\ | | |
1772     C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1773     C/\ | | | |
1774     C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1775     C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1776     C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1777     C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1778     C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1779     C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1780     C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1781     C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1782     C/\ | | | | |-CONVECT :: Mix static instability.
1783     C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1784     C/\ | | | |
1785     C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1786     C/\ | | |
1787     C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1788     C/\ | | | |-EXCH
1789     C/\ | | |
1790     C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1791     C/\ | | |
1792     C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1793     C/\ | | |
1794     C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1795     C/\ | | | |-WRITE_STATE :: Core state I/O
1796     C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1797     C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1798     C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1799     C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1800     C/\ | | |
1801     C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1802     C/\ | |
1803     C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1804     C<===|=|
1805     C<===|=| **************************
1806     C<===|=| END MAIN TIMESTEPPING LOOP
1807     C<===|=| **************************
1808     C<===|=|
1809     C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1810     C |
1811     C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1812     C |
1813     C |-TIMER_PRINTALL :: Computational timing summary
1814     C |
1815     C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1816     C :: events.
1817     C
1818     \end{verbatim}
1819 heimbach 1.8 }
1820 cnh 1.1
1821     \subsection{Measuring and Characterizing Performance}
1822    
1823     TO BE DONE (CNH)
1824    
1825     \subsection{Estimating Resource Requirements}
1826    
1827     TO BE DONE (CNH)
1828    
1829     \subsubsection{Atlantic 1/6 degree example}
1830     \subsubsection{Dry Run testing}
1831     \subsubsection{Adjoint Resource Requirements}
1832     \subsubsection{State Estimation Environment Resources}
1833    

  ViewVC Help
Powered by ViewVC 1.1.22