/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Annotation of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.18 - (hide annotations) (download) (as text)
Tue Mar 23 16:47:05 2004 UTC (21 years, 3 months ago) by afe
Branch: MAIN
Changes since 1.17: +3 -3 lines
File MIME type: application/x-tex
o little tweak to make CMIREDIR tags more auk^H^Hwk-friendly

1 afe 1.18 % $Header: /u/gcmpack/manual/part4/sarch.tex,v 1.17 2004/03/23 15:29:40 afe Exp $
2 cnh 1.1
3 cnh 1.7 This chapter focuses on describing the {\bf WRAPPER} environment within which
4     both the core numerics and the pluggable packages operate. The description
5     presented here is intended to be a detailed exposition and contains significant
6     background material, as well as advanced details on working with the WRAPPER.
7 adcroft 1.9 The tutorial sections of this manual (see sections
8     \ref{sect:tutorials} and \ref{sect:tutorialIII})
9 cnh 1.7 contain more succinct, step-by-step instructions on running basic numerical
10     experiments, of varous types, both sequentially and in parallel. For many
11     projects simply starting from an example code and adapting it to suit a
12     particular situation
13     will be all that is required.
14     The first part of this chapter discusses the MITgcm architecture at an
15     abstract level. In the second part of the chapter we described practical
16     details of the MITgcm implementation and of current tools and operating system
17     features that are employed.
18 cnh 1.1
19     \section{Overall architectural goals}
20    
21     Broadly, the goals of the software architecture employed in MITgcm are
22     three-fold
23    
24     \begin{itemize}
25     \item We wish to be able to study a very broad range
26     of interesting and challenging rotating fluids problems.
27     \item We wish the model code to be readily targeted to
28     a wide range of platforms
29     \item On any given platform we would like to be
30     able to achieve performance comparable to an implementation
31     developed and specialized specifically for that platform.
32     \end{itemize}
33    
34     These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
35     which conveys the goals of the MITgcm design. The goals lead to
36     a software architecture which at the high-level can be viewed as consisting
37     of
38    
39     \begin{enumerate}
40     \item A core set of numerical and support code. This is discussed in detail in
41 adcroft 1.6 section \ref{sect:partII}.
42 cnh 1.1 \item A scheme for supporting optional "pluggable" {\bf packages} (containing
43     for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
44     These packages are used both to overlay alternate dynamics and to introduce
45     specialized physical content onto the core numerical code. An overview of
46     the {\bf package} scheme is given at the start of part \ref{part:packages}.
47     \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
48     Programming Environment Resource), within which the core numerics and pluggable
49     packages operate.
50     \end{enumerate}
51    
52     This chapter focuses on describing the {\bf WRAPPER} environment under which
53     both the core numerics and the pluggable packages function. The description
54 cnh 1.4 presented here is intended to be a detailed exposition and contains significant
55 cnh 1.1 background material, as well as advanced details on working with the WRAPPER.
56     The examples section of this manual (part \ref{part:example}) contains more
57     succinct, step-by-step instructions on running basic numerical
58     experiments both sequentially and in parallel. For many projects simply
59     starting from an example code and adapting it to suit a particular situation
60     will be all that is required.
61    
62 adcroft 1.2
63 cnh 1.1 \begin{figure}
64     \begin{center}
65 adcroft 1.2 \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
66 cnh 1.1 \end{center}
67 adcroft 1.2 \caption{
68     The MITgcm architecture is designed to allow simulation of a wide
69 cnh 1.1 range of physical problems on a wide range of hardware. The computational
70     resource requirements of the applications targeted range from around
71     $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
72     ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
73     interest range from $10^{9}$ floating point operations to more than $10^{17}$
74 adcroft 1.2 floating point operations.}
75     \label{fig:mitgcm_architecture_goals}
76 cnh 1.1 \end{figure}
77    
78     \section{WRAPPER}
79 afe 1.17 \begin{rawhtml}
80 afe 1.18 <!-- CMIREDIR:wrapper: -->
81 afe 1.17 \end{rawhtml}
82 cnh 1.1
83     A significant element of the software architecture utilized in
84     MITgcm is a software superstructure and substructure collectively
85     called the WRAPPER (Wrappable Application Parallel Programming
86     Environment Resource). All numerical and support code in MITgcm is written
87     to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
88     the WRAPPER means that coding has to follow certain, relatively
89     straightforward, rules and conventions ( these are discussed further in
90 adcroft 1.6 section \ref{sect:specifying_a_decomposition} ).
91 cnh 1.1
92     The approach taken by the WRAPPER is illustrated in figure
93     \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
94     that fits within it from architectural differences between hardware platforms
95     and operating systems. This allows numerical code to be easily retargetted.
96 adcroft 1.2
97    
98 cnh 1.1 \begin{figure}
99     \begin{center}
100 adcroft 1.2 \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
101 cnh 1.1 \end{center}
102 adcroft 1.2 \caption{
103 edhill 1.10 Numerical code is written to fit within a software support
104 cnh 1.1 infrastructure called WRAPPER. The WRAPPER is portable and
105 cnh 1.4 can be specialized for a wide range of specific target hardware and
106 cnh 1.1 programming environments, without impacting numerical code that fits
107     within the WRAPPER. Codes that fit within the WRAPPER can generally be
108     made to run as fast on a particular platform as codes specially
109 adcroft 1.2 optimized for that platform.}
110     \label{fig:fit_in_wrapper}
111 cnh 1.1 \end{figure}
112    
113     \subsection{Target hardware}
114 adcroft 1.6 \label{sect:target_hardware}
115 cnh 1.1
116     The WRAPPER is designed to target as broad as possible a range of computer
117     systems. The original development of the WRAPPER took place on a
118     multi-processor, CRAY Y-MP system. On that system, numerical code performance
119     and scaling under the WRAPPER was in excess of that of an implementation that
120     was tightly bound to the CRAY systems proprietary multi-tasking and
121     micro-tasking approach. Later developments have been carried out on
122     uniprocessor and multi-processor Sun systems with both uniform memory access
123     (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
124     been undertaken on x86 cluster systems, Alpha processor based clustered SMP
125     systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
126 edhill 1.10 The MITgcm code, operating within the WRAPPER, is also routinely used on
127 cnh 1.1 large scale MPP systems (for example T3E systems and IBM SP systems). In all
128     cases numerical code, operating within the WRAPPER, performs and scales very
129     competitively with equivalent numerical code that has been modified to contain
130     native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.
131    
132     \subsection{Supporting hardware neutrality}
133    
134 adcroft 1.6 The different systems listed in section \ref{sect:target_hardware} can be
135 cnh 1.1 categorized in many different ways. For example, one common distinction is
136     between shared-memory parallel systems (SMP's, PVP's) and distributed memory
137     parallel systems (for example x86 clusters and large MPP systems). This is one
138     example of a difference between compute platforms that can impact an
139     application. Another common distinction is between vector processing systems
140     with highly specialized CPU's and memory subsystems and commodity
141     microprocessor based systems. There are numerous other differences, especially
142     in relation to how parallel execution is supported. To capture the essential
143     differences between different platforms the WRAPPER uses a {\it machine model}.
144    
145     \subsection{WRAPPER machine model}
146    
147     Applications using the WRAPPER are not written to target just one
148     particular machine (for example an IBM SP2) or just one particular family or
149     class of machines (for example Parallel Vector Processor Systems). Instead the
150     WRAPPER provides applications with an
151     abstract {\it machine model}. The machine model is very general, however, it can
152 cnh 1.5 easily be specialized to fit, in a computationally efficient manner, any
153 cnh 1.1 computer architecture currently available to the scientific computing community.
154    
155     \subsection{Machine model parallelism}
156 afe 1.17 \begin{rawhtml}
157 afe 1.18 <!-- CMIREDIR:domain_decomp: -->
158 afe 1.17 \end{rawhtml}
159 cnh 1.1
160     Codes operating under the WRAPPER target an abstract machine that is assumed to
161     consist of one or more logical processors that can compute concurrently.
162 cnh 1.5 Computational work is divided among the logical
163 cnh 1.1 processors by allocating ``ownership'' to
164     each processor of a certain set (or sets) of calculations. Each set of
165     calculations owned by a particular processor is associated with a specific
166     region of the physical space that is being simulated, only one processor will
167     be associated with each such region (domain decomposition).
168    
169     In a strict sense the logical processors over which work is divided do not need
170     to correspond to physical processors. It is perfectly possible to execute a
171     configuration decomposed for multiple logical processors on a single physical
172     processor. This helps ensure that numerical code that is written to fit
173     within the WRAPPER will parallelize with no additional effort and is
174     also useful when debugging codes. Generally, however,
175     the computational domain will be subdivided over multiple logical
176     processors in order to then bind those logical processors to physical
177     processor resources that can compute in parallel.
178    
179     \subsubsection{Tiles}
180    
181     Computationally, associated with each region of physical
182     space allocated to a particular logical processor, there will be data
183     structures (arrays, scalar variables etc...) that hold the simulated state of
184     that region. We refer to these data structures as being {\bf owned} by the
185 cnh 1.4 processor to which their
186 cnh 1.1 associated region of physical space has been allocated. Individual
187     regions that are allocated to processors are called {\bf tiles}. A
188     processor can own more
189     than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being
190     mapped to a set of logical processors, with each processors owning a single
191     region of the domain (a single tile). Except for periods of
192     communication and coordination, each processor computes autonomously, working
193     only with data from the tile (or tiles) that the processor owns. When multiple
194     tiles are alloted to a single processor, each tile is computed on
195     independently of the other tiles, in a sequential fashion.
196    
197     \begin{figure}
198     \begin{center}
199 adcroft 1.2 \resizebox{5in}{!}{
200     \includegraphics{part4/domain_decomp.eps}
201 cnh 1.1 }
202     \end{center}
203     \caption{ The WRAPPER provides support for one and two dimensional
204     decompositions of grid-point domains. The figure shows a hypothetical domain of
205     total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in
206     two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf
207     tiles} are {\bf owned} by different processors. The {\bf owning}
208     processors perform the
209     arithmetic operations associated with a {\bf tile}. Although not illustrated
210     here, a single processor can {\bf own} several {\bf tiles}.
211     Whenever a processor wishes to transfer data between tiles or
212     communicate with other processors it calls a WRAPPER supplied
213     function.
214     } \label{fig:domaindecomp}
215     \end{figure}
216    
217    
218     \subsubsection{Tile layout}
219    
220     Tiles consist of an interior region and an overlap region. The overlap region
221     of a tile corresponds to the interior region of an adjacent tile.
222     In figure \ref{fig:tiledworld} each tile would own the region
223     within the black square and hold duplicate information for overlap
224     regions extending into the tiles to the north, south, east and west.
225     During
226     computational phases a processor will reference data in an overlap region
227     whenever it requires values that outside the domain it owns. Periodically
228     processors will make calls to WRAPPER functions to communicate data between
229     tiles, in order to keep the overlap regions up to date (see section
230 adcroft 1.6 \ref{sect:communication_primitives}). The WRAPPER functions can use a
231 cnh 1.1 variety of different mechanisms to communicate data between tiles.
232    
233     \begin{figure}
234     \begin{center}
235 adcroft 1.2 \resizebox{5in}{!}{
236     \includegraphics{part4/tiled-world.eps}
237 cnh 1.1 }
238     \end{center}
239     \caption{ A global grid subdivided into tiles.
240     Tiles contain a interior region and an overlap region.
241     Overlap regions are periodically updated from neighboring tiles.
242     } \label{fig:tiledworld}
243     \end{figure}
244    
245     \subsection{Communication mechanisms}
246    
247     Logical processors are assumed to be able to exchange information
248     between tiles and between each other using at least one of two possible
249     mechanisms.
250    
251     \begin{itemize}
252     \item {\bf Shared memory communication}.
253     Under this mode of communication data transfers are assumed to be possible
254     using direct addressing of regions of memory. In this case a CPU is able to read
255     (and write) directly to regions of memory "owned" by another CPU
256     using simple programming language level assignment operations of the
257     the sort shown in figure \ref{fig:simple_assign}. In this way one CPU
258     (CPU1 in the figure) can communicate information to another CPU (CPU2 in the
259     figure) by assigning a particular value to a particular memory location.
260    
261     \item {\bf Distributed memory communication}.
262     Under this mode of communication there is no mechanism, at the application code level,
263     for directly addressing regions of memory owned and visible to another CPU. Instead
264     a communication library must be used as illustrated in figure
265     \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the
266     communication library to communicate data from a tile that it owns to a tile
267     that another CPU owns. By default the WRAPPER binds to the MPI communication
268     library \ref{MPI} for this style of communication.
269     \end{itemize}
270    
271     The WRAPPER assumes that communication will use one of these two styles
272     of communication. The underlying hardware and operating system support
273     for the style used is not specified and can vary from system to system.
274    
275     \begin{figure}
276     \begin{verbatim}
277    
278     CPU1 | CPU2
279     ==== | ====
280     |
281     a(3) = 8 | WHILE ( a(3) .NE. 8 )
282     | WAIT
283     | END WHILE
284     |
285     \end{verbatim}
286     \caption{ In the WRAPPER shared memory communication model, simple writes to an
287     array can be made to be visible to other CPU's at the application code level.
288     So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
289     element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)
290     will be able to see the value $8$ when they read from $a(3)$.
291     This provides a very low latency and high bandwidth communication
292     mechanism.
293     } \label{fig:simple_assign}
294     \end{figure}
295    
296     \begin{figure}
297     \begin{verbatim}
298    
299     CPU1 | CPU2
300     ==== | ====
301     |
302     a(3) = 8 | WHILE ( a(3) .NE. 8 )
303     CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
304     | END WHILE
305     |
306     \end{verbatim}
307     \caption{ In the WRAPPER distributed memory communication model
308     data can not be made directly visible to other CPU's.
309     If one CPU writes the value $8$ to element $3$ of array $a$, then
310     at least one of CPU1 and/or CPU2 in the figure above will need
311     to call a bespoke communication library in order for the updated
312     value to be communicated between CPU's.
313     } \label{fig:comm_msg}
314     \end{figure}
315    
316     \subsection{Shared memory communication}
317 adcroft 1.6 \label{sect:shared_memory_communication}
318 cnh 1.1
319     Under shared communication independent CPU's are operating
320     on the exact same global address space at the application level.
321     This means that CPU 1 can directly write into global
322     data structures that CPU 2 ``owns'' using a simple
323     assignment at the application level.
324     This is the model of memory access is supported at the basic system
325     design level in ``shared-memory'' systems such as PVP systems, SMP systems,
326     and on distributed shared memory systems (the SGI Origin).
327     On such systems the WRAPPER will generally use simple read and write statements
328     to access directly application data structures when communicating between CPU's.
329    
330     In a system where assignments statements, like the one in figure
331     \ref{fig:simple_assign} map directly to
332     hardware instructions that transport data between CPU and memory banks, this
333     can be a very efficient mechanism for communication. In this case two CPU's,
334     CPU1 and CPU2, can communicate simply be reading and writing to an
335     agreed location and following a few basic rules. The latency of this sort
336     of communication is generally not that much higher than the hardware
337     latency of other memory accesses on the system. The bandwidth available
338     between CPU's communicating in this way can be close to the bandwidth of
339     the systems main-memory interconnect. This can make this method of
340     communication very efficient provided it is used appropriately.
341    
342     \subsubsection{Memory consistency}
343 adcroft 1.6 \label{sect:memory_consistency}
344 cnh 1.1
345     When using shared memory communication between
346     multiple processors the WRAPPER level shields user applications from
347     certain counter-intuitive system behaviors. In particular, one issue the
348     WRAPPER layer must deal with is a systems memory model. In general the order
349     of reads and writes expressed by the textual order of an application code may
350     not be the ordering of instructions executed by the processor performing the
351     application. The processor performing the application instructions will always
352     operate so that, for the application instructions the processor is executing,
353     any reordering is not apparent. However, in general machines are often
354     designed so that reordering of instructions is not hidden from other second
355     processors. This means that, in general, even on a shared memory system two
356     processors can observe inconsistent memory values.
357    
358     The issue of memory consistency between multiple processors is discussed at
359     length in many computer science papers, however, from a practical point of
360     view, in order to deal with this issue, shared memory machines all provide
361     some mechanism to enforce memory consistency when it is needed. The exact
362     mechanism employed will vary between systems. For communication using shared
363     memory, the WRAPPER provides a place to invoke the appropriate mechanism to
364     ensure memory consistency for a particular platform.
365    
366     \subsubsection{Cache effects and false sharing}
367 adcroft 1.6 \label{sect:cache_effects_and_false_sharing}
368 cnh 1.1
369     Shared-memory machines often have local to processor memory caches
370     which contain mirrored copies of main memory. Automatic cache-coherence
371     protocols are used to maintain consistency between caches on different
372     processors. These cache-coherence protocols typically enforce consistency
373     between regions of memory with large granularity (typically 128 or 256 byte
374     chunks). The coherency protocols employed can be expensive relative to other
375     memory accesses and so care is taken in the WRAPPER (by padding synchronization
376     structures appropriately) to avoid unnecessary coherence traffic.
377    
378     \subsubsection{Operating system support for shared memory.}
379    
380     Applications running under multiple threads within a single process can
381     use shared memory communication. In this case {\it all} the memory locations
382     in an application are potentially visible to all the compute threads. Multiple
383     threads operating within a single process is the standard mechanism for
384     supporting shared memory that the WRAPPER utilizes. Configuring and launching
385     code to run in multi-threaded mode on specific platforms is discussed in
386 adcroft 1.6 section \ref{sect:running_with_threads}. However, on many systems, potentially
387 cnh 1.1 very efficient mechanisms for using shared memory communication between
388     multiple processes (in contrast to multiple threads within a single
389     process) also exist. In most cases this works by making a limited region of
390     memory shared between processes. The MMAP \ref{magicgarden} and
391     IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do
392     vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.
393     Extensions exist for the WRAPPER that allow these mechanisms
394     to be used for shared memory communication. However, these mechanisms are not
395     distributed with the default WRAPPER sources, because of their proprietary
396     nature.
397    
398     \subsection{Distributed memory communication}
399 adcroft 1.6 \label{sect:distributed_memory_communication}
400 cnh 1.1 Many parallel systems are not constructed in a way where it is
401     possible or practical for an application to use shared memory
402     for communication. For example cluster systems consist of individual computers
403     connected by a fast network. On such systems their is no notion of shared memory
404     at the system level. For this sort of system the WRAPPER provides support
405     for communication based on a bespoke communication library
406     (see figure \ref{fig:comm_msg}). The default communication library used is MPI
407     \ref{mpi}. However, it is relatively straightforward to implement bindings to
408     optimized platform specific communication libraries. For example the work
409     described in \ref{hoe-hill:99} substituted standard MPI communication for a
410     highly optimized library.
411    
412     \subsection{Communication primitives}
413 adcroft 1.6 \label{sect:communication_primitives}
414 cnh 1.1
415     \begin{figure}
416     \begin{center}
417 adcroft 1.2 \resizebox{5in}{!}{
418     \includegraphics{part4/comm-primm.eps}
419 cnh 1.1 }
420     \end{center}
421 cnh 1.5 \caption{Three performance critical parallel primitives are provided
422     by the WRAPPER. These primitives are always used to communicate data
423 cnh 1.1 between tiles. The figure shows four tiles. The curved arrows indicate
424     exchange primitives which transfer data between the overlap regions at tile
425     edges and interior regions for nearest-neighbor tiles.
426     The straight arrows symbolize global sum operations which connect all tiles.
427     The global sum operation provides both a key arithmetic primitive and can
428     serve as a synchronization primitive. A third barrier primitive is also
429     provided, it behaves much like the global sum primitive.
430     } \label{fig:communication_primitives}
431     \end{figure}
432    
433    
434     Optimized communication support is assumed to be possibly available
435     for a small number of communication operations.
436     It is assumed that communication performance optimizations can
437     be achieved by optimizing a small number of communication primitives.
438     Three optimizable primitives are provided by the WRAPPER
439     \begin{itemize}
440     \item{\bf EXCHANGE} This operation is used to transfer data between interior
441     and overlap regions of neighboring tiles. A number of different forms of this
442     operation are supported. These different forms handle
443     \begin{itemize}
444     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
445     separately.
446     \item Bindings to different communication methods.
447     Exchange primitives select between using shared memory or distributed
448     memory communication.
449     \item Transformation operations required when transporting
450     data between different grid regions. Transferring data
451     between faces of a cube-sphere grid, for example, involves a rotation
452     of vector components.
453     \item Forward and reverse mode computations. Derivative calculations require
454     tangent linear and adjoint forms of the exchange primitives.
455    
456     \end{itemize}
457    
458     \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
459     operation for the pressure inversion phase of the MITgcm algorithm.
460     For certain configurations scaling can be highly sensitive to
461     the performance of the global sum primitive. This operation is a collective
462     operation involving all tiles of the simulated domain. Different forms
463     of the global sum primitive exist for handling
464     \begin{itemize}
465     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
466     separately.
467     \item Bindings to different communication methods.
468     Exchange primitives select between using shared memory or distributed
469     memory communication.
470     \item Forward and reverse mode computations. Derivative calculations require
471     tangent linear and adjoint forms of the exchange primitives.
472     \end{itemize}
473    
474     \item{\bf BARRIER} The WRAPPER provides a global synchronization function
475     called barrier. This is used to synchronize computations over all tiles.
476     The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in
477     some cases use the same underlying code.
478     \end{itemize}
479    
480    
481     \subsection{Memory architecture}
482    
483     The WRAPPER machine model is aimed to target efficiently systems with
484     highly pipelined memory architectures and systems with deep memory
485     hierarchies that favor memory reuse. This is achieved by supporting a
486     flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
487     Within a CPU computations are carried out sequentially on each tile
488     in turn. By reshaping tiles according to the target platform it is
489     possible to automatically tune code to improve memory performance.
490     On a vector machine a given domain might be sub-divided into a few
491     long, thin regions. On a commodity microprocessor based system, however,
492     the same region could be simulated use many more smaller
493     sub-domains.
494    
495    
496     \begin{figure}
497     \begin{center}
498 adcroft 1.2 \resizebox{5in}{!}{
499     \includegraphics{part4/tiling_detail.eps}
500 cnh 1.1 }
501     \end{center}
502     \caption{The tiling strategy that the WRAPPER supports allows tiles
503     to be shaped to suit the underlying system memory architecture.
504     Compact tiles that lead to greater memory reuse can be used on cache
505     based systems (upper half of figure) with deep memory hierarchies, long tiles
506     with large inner loops can be used to exploit vector systems having
507     highly pipelined memory systems.
508     } \label{fig:tiling-strategy}
509     \end{figure}
510    
511     \newpage
512     \subsection{Summary}
513     Following the discussion above, the machine model that the WRAPPER
514     presents to an application has the following characteristics
515    
516     \begin{itemize}
517     \item The machine consists of one or more logical processors. \vspace{-3mm}
518     \item Each processor operates on tiles that it owns.\vspace{-3mm}
519     \item A processor may own more than one tile.\vspace{-3mm}
520     \item Processors may compute concurrently.\vspace{-3mm}
521     \item Exchange of information between tiles is handled by the
522     machine (WRAPPER) not by the application.
523     \end{itemize}
524     Behind the scenes this allows the WRAPPER to adapt the machine model
525     functions to exploit hardware on which
526     \begin{itemize}
527     \item Processors may be able to communicate very efficiently with each other
528     using shared memory. \vspace{-3mm}
529     \item An alternative communication mechanism based on a relatively
530     simple inter-process communication API may be required.\vspace{-3mm}
531     \item Shared memory may not necessarily obey sequential consistency,
532     however some mechanism will exist for enforcing memory consistency.
533     \vspace{-3mm}
534     \item Memory consistency that is enforced at the hardware level
535     may be expensive. Unnecessary triggering of consistency protocols
536     should be avoided. \vspace{-3mm}
537     \item Memory access patterns may need to either repetitive or highly
538     pipelined for optimum hardware performance. \vspace{-3mm}
539     \end{itemize}
540    
541     This generic model captures the essential hardware ingredients
542     of almost all successful scientific computer systems designed in the
543     last 50 years.
544    
545     \section{Using the WRAPPER}
546    
547     In order to support maximum portability the WRAPPER is implemented primarily
548     in sequential Fortran 77. At a practical level the key steps provided by the
549     WRAPPER are
550     \begin{enumerate}
551     \item specifying how a domain will be decomposed
552     \item starting a code in either sequential or parallel modes of operations
553     \item controlling communication between tiles and between concurrently
554     computing CPU's.
555     \end{enumerate}
556     This section describes the details of each of these operations.
557 adcroft 1.6 Section \ref{sect:specifying_a_decomposition} explains how the way in which
558 cnh 1.1 a domain is decomposed (or composed) is expressed. Section
559 adcroft 1.6 \ref{sect:starting_a_code} describes practical details of running codes
560 cnh 1.1 in various different parallel modes on contemporary computer systems.
561 adcroft 1.6 Section \ref{sect:controlling_communication} explains the internal information
562 cnh 1.1 that the WRAPPER uses to control how information is communicated between
563     tiles.
564    
565     \subsection{Specifying a domain decomposition}
566 adcroft 1.6 \label{sect:specifying_a_decomposition}
567 cnh 1.1
568     At its heart much of the WRAPPER works only in terms of a collection of tiles
569     which are interconnected to each other. This is also true of application
570     code operating within the WRAPPER. Application code is written as a series
571     of compute operations, each of which operates on a single tile. If
572     application code needs to perform operations involving data
573     associated with another tile, it uses a WRAPPER function to obtain
574     that data.
575     The specification of how a global domain is constructed from tiles or alternatively
576     how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
577     This file defines the following parameters \\
578    
579     \fbox{
580     \begin{minipage}{4.75in}
581     Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
582     File: {\em model/inc/SIZE.h}
583     \end{minipage}
584     } \\
585    
586     Together these parameters define a tiling decomposition of the style shown in
587     figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
588     the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
589     maximum size of the overlap extent. This must be set to the maximum width
590     of the computation stencil that the numerical code finite-difference operations
591     require between overlap region updates. The maximum overlap required
592     by any of the operations in the MITgcm code distributed with this release is three grid
593     points. This is set by the requirements of the $\nabla^4$ dissipation and
594     diffusion operator. Code modifications and enhancements that involve adding wide
595     finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
596     Setting {\em OLx} and {\em OLy} to a too large value will decrease code
597     performance (because redundant computations will be performed), however it will
598     not cause any other problems.
599    
600     \begin{figure}
601     \begin{center}
602 adcroft 1.2 \resizebox{5in}{!}{
603     \includegraphics{part4/size_h.eps}
604 cnh 1.1 }
605     \end{center}
606     \caption{ The three level domain decomposition hierarchy employed by the
607     WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
608     to a single process. Multiple processes can exist, each with multiple tiles.
609     Tiles within a process can be spread over multiple compute threads.
610     } \label{fig:labelled_tile}
611     \end{figure}
612    
613     The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
614     be created within a single process. Each of these tiles will have internal
615     dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
616     allocated to different threads of a process that are then bound to
617     different physical processors ( see the multi-threaded
618 adcroft 1.6 execution discussion in section \ref{sect:starting_the_code} ) then
619 cnh 1.1 computation will be performed concurrently on each tile. However, it is also
620     possible to run the same decomposition within a process running a single thread on
621     a single processor. In this case the tiles will be computed over sequentially.
622     If the decomposition is run in a single process running multiple threads
623     but attached to a single physical processor, then, in general, the computation
624     for different tiles will be interleaved by system level software.
625     This too is a valid mode of operation.
626    
627     The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
628     numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
629     are used to form the loop ranges for many numerical calculations and
630     to provide dimensions for arrays holding numerical state.
631     The {\em nSx} and{\em nSy} are used in conjunction with the thread number
632     parameter {\em myThid}. Much of the numerical code operating within the
633     WRAPPER takes the form
634     \begin{verbatim}
635     DO bj=myByLo(myThid),myByHi(myThid)
636     DO bi=myBxLo(myThid),myBxHi(myThid)
637     :
638     a block of computations ranging
639     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
640     :
641     ENDDO
642     ENDDO
643    
644     communication code to sum a number or maybe update
645     tile overlap regions
646    
647     DO bj=myByLo(myThid),myByHi(myThid)
648     DO bi=myBxLo(myThid),myBxHi(myThid)
649     :
650     another block of computations ranging
651     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
652     :
653     ENDDO
654     ENDDO
655     \end{verbatim}
656     The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
657     myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
658     schematic. These variables specify the subset of the tiles in
659     the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
660     thread number {\em myThid} owns. The thread number variable {\em myThid}
661     ranges from 1 to the total number of threads requested at execution time.
662     For each value of {\em myThid} the loop scheme above will step sequentially
663     through the tiles owned by that thread. However, different threads will
664     have different ranges of tiles assigned to them, so that separate threads can
665     compute iterations of the {\em bi}, {\em bj} loop concurrently.
666     Within a {\em bi}, {\em bj} loop
667     computation is performed concurrently over as many processes and threads
668     as there are physical processors available to compute.
669    
670 afe 1.11 An exception to the the use of {\em bi} and {\em bj} in loops arises in the
671     exchange routines used when the exch2 package is used with the cubed
672     sphere. In this case {\em bj} is generally set to 1 and the loop runs from
673     1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number,
674     which is then used to reference exchange parameters.
675    
676 cnh 1.1 The amount of computation that can be embedded
677     a single loop over {\em bi} and {\em bj} varies for different parts of the
678     MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
679     from the two-dimensional implicit elliptic solver. This portion of the
680     code computes the l2Norm of a vector whose elements are held in
681     the array {\em cg2d\_r} writing the final result to scalar variable
682     {\em err}. In this case, because the l2norm requires a global reduction,
683     the {\em bi},{\em bj} loop only contains one statement. This computation
684     phase is then followed by a communication phase in which all threads and
685     processes must participate. However,
686     in other areas of the MITgcm code entries subsections of code are within
687     a single {\em bi},{\em bj} loop. For example the evaluation of all
688     the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
689     is within a single {\em bi},{\em bj} loop.
690    
691     \begin{figure}
692     \begin{verbatim}
693     REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
694     REAL*8 err
695     :
696     :
697     other computations
698     :
699     :
700     err = 0.
701     DO bj=myByLo(myThid),myByHi(myThid)
702     DO bi=myBxLo(myThid),myBxHi(myThid)
703     DO J=1,sNy
704     DO I=1,sNx
705     err = err +
706     & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
707     ENDDO
708     ENDDO
709     ENDDO
710     ENDDO
711    
712     CALL GLOBAL_SUM_R8( err , myThid )
713     err = SQRT(err)
714    
715     \end{verbatim}
716     \caption{Example of numerical code for calculating
717     the l2-norm of a vector within the WRAPPER. Notice that
718     under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
719     dimensions. These right most indices are tile indexes. Different
720     threads with a single process operate on different ranges of tile
721     index, as controlled by the settings of
722     {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
723     } \label{fig:bibj_extract}
724     \end{figure}
725    
726     The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
727     are used to indicate to the WRAPPER level how many processes (each with
728     {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
729     This information is needed during initialization and during I/O phases.
730     However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
731     the values of {\em nPx} and {\em nPy} are absent
732     from the core numerical and support code.
733    
734     \subsubsection{Examples of {\em SIZE.h} specifications}
735    
736     The following different {\em SIZE.h} parameter setting illustrate how to
737     interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
738     and {\em nPy}.
739     \begin{enumerate}
740     \item
741     \begin{verbatim}
742     PARAMETER (
743     & sNx = 90,
744     & sNy = 40,
745     & OLx = 3,
746     & OLy = 3,
747     & nSx = 1,
748     & nSy = 1,
749     & nPx = 1,
750     & nPy = 1)
751     \end{verbatim}
752     This sets up a single tile with x-dimension of ninety grid points, y-dimension of
753     forty grid points, and x and y overlaps of three grid points each.
754     \item
755     \begin{verbatim}
756     PARAMETER (
757     & sNx = 45,
758     & sNy = 20,
759     & OLx = 3,
760     & OLy = 3,
761     & nSx = 1,
762     & nSy = 1,
763     & nPx = 2,
764     & nPy = 2)
765     \end{verbatim}
766     This sets up tiles with x-dimension of forty-five grid points, y-dimension of
767     twenty grid points, and x and y overlaps of three grid points each. There are
768     four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
769     arranged so that the global domain size is again ninety grid points in x and
770     forty grid points in y. In general the formula for global grid size (held in
771     model variables {\em Nx} and {\em Ny}) is
772     \begin{verbatim}
773     Nx = sNx*nSx*nPx
774     Ny = sNy*nSy*nPy
775     \end{verbatim}
776     \item
777     \begin{verbatim}
778     PARAMETER (
779     & sNx = 90,
780     & sNy = 10,
781     & OLx = 3,
782     & OLy = 3,
783     & nSx = 1,
784     & nSy = 2,
785     & nPx = 1,
786     & nPy = 2)
787     \end{verbatim}
788     This sets up tiles with x-dimension of ninety grid points, y-dimension of
789     ten grid points, and x and y overlaps of three grid points each. There are
790     four tiles allocated to two separate processes ({\em nPy=2}) each of which
791     has two separate sub-domains {\em nSy=2},
792     The global domain size is again ninety grid points in x and
793     forty grid points in y. The two sub-domains in each process will be computed
794     sequentially if they are given to a single thread within a single process.
795     Alternatively if the code is invoked with multiple threads per process
796 edhill 1.10 the two domains in y may be computed concurrently.
797 cnh 1.1 \item
798     \begin{verbatim}
799     PARAMETER (
800     & sNx = 32,
801     & sNy = 32,
802     & OLx = 3,
803     & OLy = 3,
804     & nSx = 6,
805     & nSy = 1,
806     & nPx = 1,
807     & nPy = 1)
808     \end{verbatim}
809     This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
810     thirty-two grid points, and x and y overlaps of three grid points each.
811     There are six tiles allocated to six separate logical processors ({\em nSx=6}).
812     This set of values can be used for a cube sphere calculation.
813     Each tile of size $32 \times 32$ represents a face of the
814 cnh 1.4 cube. Initializing the tile connectivity correctly ( see section
815 adcroft 1.6 \ref{sect:cube_sphere_communication}. allows the rotations associated with
816 cnh 1.1 moving between the six cube faces to be embedded within the
817     tile-tile communication code.
818     \end{enumerate}
819    
820    
821     \subsection{Starting the code}
822 adcroft 1.6 \label{sect:starting_the_code}
823 cnh 1.1 When code is started under the WRAPPER, execution begins in a main routine {\em
824     eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
825     to the application through a routine called {\em THE\_MODEL\_MAIN()}
826     once the WRAPPER has initialized correctly and has created the necessary variables
827     to support subsequent calls to communication routines
828     by the application code. The startup calling sequence followed by the
829     WRAPPER is shown in figure \ref{fig:wrapper_startup}.
830    
831     \begin{figure}
832 heimbach 1.8 {\footnotesize
833 cnh 1.1 \begin{verbatim}
834    
835     MAIN
836     |
837     |--EEBOOT :: WRAPPER initialization
838     | |
839     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
840     | | allow basic I/O.
841     | |-- EEINTRO_MSG :: Write startup greeting.
842     | |
843     | |-- EESET_PARMS :: Set WRAPPER parameters
844     | |
845     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
846     | |
847     | |-- INI_PROCS :: Associate processes with grid regions.
848     | |
849     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
850     | |
851     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
852     | :: communication data structures
853     |
854     |
855     |--CHECK_THREADS :: Validate multiple thread start up.
856     |
857     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
858    
859    
860     \end{verbatim}
861 heimbach 1.8 }
862 cnh 1.1 \caption{Main stages of the WRAPPER startup procedure.
863     This process proceeds transfer of control to application code, which
864     occurs through the procedure {\em THE\_MODEL\_MAIN()}.
865     } \label{fig:wrapper_startup}
866     \end{figure}
867    
868     \subsubsection{Multi-threaded execution}
869 adcroft 1.6 \label{sect:multi-threaded-execution}
870 cnh 1.1 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
871     WRAPPER may cause several coarse grain threads to be initialized. The routine
872     {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
873     stack argument which is the thread number, stored in the
874     variable {\em myThid}. In addition to specifying a decomposition with
875 adcroft 1.6 multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
876 cnh 1.1 configuring and starting a code to run using multiple threads requires the following
877     steps.\\
878    
879     \paragraph{Compilation}
880     First the code must be compiled with appropriate multi-threading directives
881     active in the file {\em main.F} and with appropriate compiler flags
882     to request multi-threading support. The header files
883     {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
884     contain directives compatible with compilers for Sun, Compaq, SGI,
885     Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
886     activated by using compile time
887     directives {\em -DTARGET\_SUN},
888     {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
889     or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
890     for invoking multi-threaded compilation vary from system to system
891     and from compiler to compiler. The options will be described
892     in the individual compiler documentation. For the Fortran compiler
893     from Sun the following options are needed to correctly compile
894     multi-threaded code
895     \begin{verbatim}
896     -stackvar -explicitpar -vpara -noautopar
897     \end{verbatim}
898     These options are specific to the Sun compiler. Other compilers
899     will use different syntax that will be described in their
900     documentation. The effect of these options is as follows
901     \begin{enumerate}
902     \item {\bf -stackvar} Causes all local variables to be allocated in stack
903     storage. This is necessary for local variables to ensure that they are private
904     to their thread. Note, when using this option it may be necessary to override
905     the default limit on stack-size that the operating system assigns to a process.
906     This can normally be done by changing the settings of the command shells
907     {\em stack-size} limit variable. However, on some systems changing this limit
908     will require privileged administrator access to modify system parameters.
909    
910     \item {\bf -explicitpar} Requests that multiple threads be spawned
911     in response to explicit directives in the application code. These
912     directives are inserted with syntax appropriate to the particular target
913     platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
914    
915     \item {\bf -vpara} This causes the compiler to describe the multi-threaded
916     configuration it is creating. This is not required
917     but it can be useful when trouble shooting.
918    
919     \item {\bf -noautopar} This inhibits any automatic multi-threaded
920     parallelization the compiler may otherwise generate.
921    
922     \end{enumerate}
923    
924    
925     An example of valid settings for the {\em eedata} file for a
926     domain with two subdomains in y and running with two threads is shown
927     below
928     \begin{verbatim}
929     nTx=1,nTy=2
930     \end{verbatim}
931     This set of values will cause computations to stay within a single
932     thread when moving across the {\em nSx} sub-domains. In the y-direction,
933     however, sub-domains will be split equally between two threads.
934    
935     \paragraph{Multi-threading files and parameters} The following
936     files and variables are used in setting up multi-threaded execution.\\
937    
938     \fbox{
939     \begin{minipage}{4.75in}
940     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
941     File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
942     File: {\em model/src/THE\_MODEL\_MAIN.F}\\
943     File: {\em eesupp/src/MAIN.F}\\
944 afe 1.11 File: {\em tools/genmake2}\\
945 cnh 1.1 File: {\em eedata}\\
946     CPP: {\em TARGET\_SUN}\\
947     CPP: {\em TARGET\_DEC}\\
948     CPP: {\em TARGET\_HP }\\
949     CPP: {\em TARGET\_SGI}\\
950     CPP: {\em TARGET\_CRAY\_VECTOR}\\
951     Parameter: {\em nTx}\\
952     Parameter: {\em nTy}
953     \end{minipage}
954     } \\
955    
956     \subsubsection{Multi-process execution}
957 adcroft 1.6 \label{sect:multi-process-execution}
958 cnh 1.1
959     Despite its appealing programming model, multi-threaded execution remains
960     less common then multi-process execution. One major reason for this
961     is that many system libraries are still not ``thread-safe''. This means that for
962     example on some systems it is not safe to call system routines to
963     do I/O when running in multi-threaded mode, except for in a limited set of
964     circumstances. Another reason is that support for multi-threaded programming
965     models varies between systems.
966    
967     Multi-process execution is more ubiquitous.
968     In order to run code in a multi-process configuration a decomposition
969 adcroft 1.6 specification ( see section \ref{sect:specifying_a_decomposition})
970 cnh 1.3 is given ( in which the at least one of the
971 cnh 1.1 parameters {\em nPx} or {\em nPy} will be greater than one)
972     and then, as for multi-threaded operation,
973     appropriate compile time and run time steps must be taken.
974    
975     \paragraph{Compilation} Multi-process execution under the WRAPPER
976     assumes that the portable, MPI libraries are available
977     for controlling the start-up of multiple processes. The MPI libraries
978     are not required, although they are usually used, for performance
979     critical communication. However, in order to simplify the task
980     of controlling and coordinating the start up of a large number
981     (hundreds and possibly even thousands) of copies of the same
982     program, MPI is used. The calls to the MPI multi-process startup
983 afe 1.12 routines must be activated at compile time. Currently MPI libraries are
984     invoked by
985     specifying the appropriate options file with the
986 afe 1.15 {\tt-of} flag when running the {\em genmake2}
987 afe 1.12 script, which generates the Makefile for compiling and linking MITgcm.
988     (Previously this was done by setting the {\em ALLOW\_USE\_MPI} and
989     {\em ALWAYS\_USE\_MPI} flags in the {\em CPP\_EEOPTIONS.h} file.) More
990     detailed information about the use of {\em genmake2} for specifying
991 afe 1.15 local compiler flags is located in section \ref{sect:genmake}.\\
992 cnh 1.1
993    
994     \fbox{
995     \begin{minipage}{4.75in}
996 afe 1.15 Directory: {\em tools/build\_options}\\
997 afe 1.11 File: {\em tools/genmake2}
998 cnh 1.1 \end{minipage}
999     } \\
1000     \paragraph{\bf Execution} The mechanics of starting a program in
1001     multi-process mode under MPI is not standardized. Documentation
1002     associated with the distribution of MPI installed on a system will
1003     describe how to start a program using that distribution.
1004     For the free, open-source MPICH system the MITgcm program is started
1005     using a command such as
1006     \begin{verbatim}
1007     mpirun -np 64 -machinefile mf ./mitgcmuv
1008     \end{verbatim}
1009 cnh 1.5 In this example the text {\em -np 64} specifies the number of processes
1010 cnh 1.1 that will be created. The numeric value {\em 64} must be equal to the
1011     product of the processor grid settings of {\em nPx} and {\em nPy}
1012     in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
1013     called ``mf'' will be read to get a list of processor names on
1014     which the sixty-four processes will execute. The syntax of this file
1015 afe 1.15 is specified by the MPI distribution.
1016 cnh 1.1 \\
1017    
1018     \fbox{
1019     \begin{minipage}{4.75in}
1020     File: {\em SIZE.h}\\
1021     Parameter: {\em nPx}\\
1022     Parameter: {\em nPy}
1023     \end{minipage}
1024     } \\
1025    
1026 adcroft 1.2
1027     \paragraph{Environment variables}
1028     On most systems multi-threaded execution also requires the setting
1029     of a special environment variable. On many machines this variable
1030     is called PARALLEL and its values should be set to the number
1031     of parallel threads required. Generally the help pages associated
1032     with the multi-threaded compiler on a machine will explain
1033     how to set the required environment variables for that machines.
1034    
1035     \paragraph{Runtime input parameters}
1036     Finally the file {\em eedata} needs to be configured to indicate
1037     the number of threads to be used in the x and y directions.
1038     The variables {\em nTx} and {\em nTy} in this file are used to
1039     specify the information required. The product of {\em nTx} and
1040     {\em nTy} must be equal to the number of threads spawned i.e.
1041     the setting of the environment variable PARALLEL.
1042     The value of {\em nTx} must subdivide the number of sub-domains
1043     in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
1044     number of sub-domains in y ({\em nSy}) exactly.
1045 cnh 1.1 The multiprocess startup of the MITgcm executable {\em mitgcmuv}
1046     is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1047     {\em INI\_PROCS()}. The first routine performs basic steps required
1048     to make sure each process is started and has a textual output
1049     stream associated with it. By default two output files are opened
1050     for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.
1051     The {\bf NNNNN} part of the name is filled in with the process
1052     number so that process number 0 will create output files
1053     {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create
1054     output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
1055     are used for reporting status and configuration information and
1056     for reporting error conditions on a process by process basis.
1057 adcroft 1.2 The {\em EEBOOT\_MINIMAL()} procedure also sets the variables
1058 cnh 1.1 {\em myProcId} and {\em MPI\_COMM\_MODEL}.
1059     These variables are related
1060     to processor identification are are used later in the routine
1061     {\em INI\_PROCS()} to allocate tiles to processes.
1062    
1063     Allocation of processes to tiles in controlled by the routine
1064     {\em INI\_PROCS()}. For each process this routine sets
1065     the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
1066 afe 1.15 These variables specify in index space the coordinates
1067     of the southernmost and westernmost corner of the
1068     southernmost and westernmost tile owned by this process.
1069 cnh 1.1 The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
1070     are also set in this routine. These are used to identify
1071     processes holding tiles to the west, east, south and north
1072     of this process. These values are stored in global storage
1073     in the header file {\em EESUPPORT.h} for use by
1074 afe 1.13 communication routines. The above does not hold when the
1075     exch2 package is used -- exch2 sets its own parameters to
1076     specify the global indices of tiles and their relationships
1077 afe 1.15 to each other. See the documentation on the exch2 package
1078     (\ref{sec:exch2}) for
1079     details.
1080 cnh 1.1 \\
1081    
1082     \fbox{
1083     \begin{minipage}{4.75in}
1084     File: {\em eesupp/src/eeboot\_minimal.F}\\
1085     File: {\em eesupp/src/ini\_procs.F} \\
1086     File: {\em eesupp/inc/EESUPPORT.h} \\
1087     Parameter: {\em myProcId} \\
1088     Parameter: {\em MPI\_COMM\_MODEL} \\
1089     Parameter: {\em myXGlobalLo} \\
1090     Parameter: {\em myYGlobalLo} \\
1091     Parameter: {\em pidW } \\
1092     Parameter: {\em pidE } \\
1093     Parameter: {\em pidS } \\
1094     Parameter: {\em pidN }
1095     \end{minipage}
1096     } \\
1097    
1098    
1099     \subsection{Controlling communication}
1100     The WRAPPER maintains internal information that is used for communication
1101     operations and that can be customized for different platforms. This section
1102     describes the information that is held and used.
1103 adcroft 1.2
1104 cnh 1.1 \begin{enumerate}
1105 afe 1.15 \item {\bf Tile-tile connectivity information}
1106     For each tile the WRAPPER
1107     sets a flag that sets the tile number to the north,
1108     south, east and
1109 cnh 1.1 west of that tile. This number is unique over all tiles in a
1110 afe 1.15 configuration. Except when using the cubed sphere and the exch2 package,
1111 afe 1.14 the number is held in the variables {\em tileNo}
1112 cnh 1.1 ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
1113     {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
1114     that specifies the type of communication that is used between tiles.
1115     This information is held in the variables {\em tileCommModeN},
1116     {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.
1117     This latter set of variables can take one of the following values
1118     {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
1119     A value of {\em COMM\_NONE} is used to indicate that a tile has no
1120 cnh 1.4 neighbor to communicate with on a particular face. A value
1121 cnh 1.1 of {\em COMM\_MSG} is used to indicated that some form of distributed
1122     memory communication is required to communicate between
1123 adcroft 1.6 these tile faces ( see section \ref{sect:distributed_memory_communication}).
1124 cnh 1.1 A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1125     forms of shared memory communication ( see section
1126 adcroft 1.6 \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1127 cnh 1.1 that a CPU should communicate by writing to data structures owned by another
1128     CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1129     from data structures owned by another CPU. These flags affect the behavior
1130     of the WRAPPER exchange primitive
1131     (see figure \ref{fig:communication_primitives}). The routine
1132     {\em ini\_communication\_patterns()} is responsible for setting the
1133     communication mode values for each tile.
1134 afe 1.14
1135     When using the cubed sphere configuration with the exch2 package, the
1136     relationships between tiles and their communication methods are set
1137 afe 1.15 by the package in other variables. See the exch2 package documentation
1138     (\ref{sec:exch2} for details.
1139 afe 1.14
1140    
1141 cnh 1.1
1142     \fbox{
1143     \begin{minipage}{4.75in}
1144     File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1145     File: {\em eesupp/inc/EESUPPORT.h} \\
1146     Parameter: {\em tileNo} \\
1147     Parameter: {\em tileNoE} \\
1148     Parameter: {\em tileNoW} \\
1149     Parameter: {\em tileNoN} \\
1150     Parameter: {\em tileNoS} \\
1151     Parameter: {\em tileCommModeE} \\
1152     Parameter: {\em tileCommModeW} \\
1153     Parameter: {\em tileCommModeN} \\
1154     Parameter: {\em tileCommModeS} \\
1155     \end{minipage}
1156     } \\
1157    
1158     \item {\bf MP directives}
1159     The WRAPPER transfers control to numerical application code through
1160     the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1161     that allows for it to be invoked by several threads. Support for this
1162     is based on using multi-processing (MP) compiler directives.
1163     Most commercially available Fortran compilers support the generation
1164     of code to spawn multiple threads through some form of compiler directives.
1165     As this is generally much more convenient than writing code to interface
1166     to operating system libraries to explicitly spawn threads, and on some systems
1167     this may be the only method available the WRAPPER is distributed with
1168     template MP directives for a number of systems.
1169    
1170     These directives are inserted into the code just before and after the
1171     transfer of control to numerical algorithm code through the routine
1172     {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of
1173     the code that performs this process for a Silicon Graphics system.
1174     This code is extracted from the files {\em main.F} and
1175     {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies
1176     how many instances of the routine {\em THE\_MODEL\_MAIN} will
1177     be created. The value of {\em nThreads} is set in the routine
1178     {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the
1179     the product of the parameters {\em nTx} and {\em nTy} that
1180     are read from the file {\em eedata}. If the value of {\em nThreads}
1181     is inconsistent with the number of threads requested from the
1182     operating system (for example by using an environment
1183 adcroft 1.6 variable as described in section \ref{sect:multi_threaded_execution})
1184 cnh 1.1 then usually an error will be reported by the routine
1185     {\em CHECK\_THREADS}.\\
1186    
1187     \fbox{
1188     \begin{minipage}{4.75in}
1189     File: {\em eesupp/src/ini\_threading\_environment.F}\\
1190     File: {\em eesupp/src/check\_threads.F} \\
1191     File: {\em eesupp/src/main.F} \\
1192     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1193     File: {\em eedata } \\
1194     Parameter: {\em nThreads} \\
1195     Parameter: {\em nTx} \\
1196     Parameter: {\em nTy} \\
1197     \end{minipage}
1198     }
1199    
1200     \item {\bf memsync flags}
1201 adcroft 1.6 As discussed in section \ref{sect:memory_consistency}, when using shared memory,
1202 cnh 1.1 a low-level system function may be need to force memory consistency.
1203     The routine {\em MEMSYNC()} is used for this purpose. This routine should
1204     not need modifying and the information below is only provided for
1205     completeness. A logical parameter {\em exchNeedsMemSync} set
1206     in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether
1207     the {\em MEMSYNC()} primitive is called. In general this
1208     routine is only used for multi-threaded execution.
1209     The code that goes into the {\em MEMSYNC()}
1210     routine is specific to the compiler and
1211     processor being used for multi-threaded execution and in general
1212     must be written using a short code snippet of assembly language.
1213     For an Ultra Sparc system the following code snippet is used
1214     \begin{verbatim}
1215     asm("membar #LoadStore|#StoreStore");
1216     \end{verbatim}
1217 cnh 1.4 for an Alpha based system the equivalent code reads
1218 cnh 1.1 \begin{verbatim}
1219     asm("mb");
1220     \end{verbatim}
1221     while on an x86 system the following code is required
1222     \begin{verbatim}
1223     asm("lock; addl $0,0(%%esp)": : :"memory")
1224     \end{verbatim}
1225    
1226     \item {\bf Cache line size}
1227 adcroft 1.6 As discussed in section \ref{sect:cache_effects_and_false_sharing},
1228 cnh 1.1 milti-threaded codes explicitly avoid penalties associated with excessive
1229 cnh 1.5 coherence traffic on an SMP system. To do this the shared memory data structures
1230 cnh 1.1 used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1231     are padded. The variables that control the padding are set in the
1232     header file {\em EEPARAMS.h}. These variables are called
1233     {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and
1234     {\em lShare8}. The default values should not normally need changing.
1235     \item {\bf \_BARRIER}
1236     This is a CPP macro that is expanded to a call to a routine
1237 cnh 1.5 which synchronizes all the logical processors running under the
1238 cnh 1.1 WRAPPER. Using a macro here preserves flexibility to insert
1239     a specialized call in-line into application code. By default this
1240     resolves to calling the procedure {\em BARRIER()}. The default
1241     setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1242    
1243     \item {\bf \_GSUM}
1244     This is a CPP macro that is expanded to a call to a routine
1245 cnh 1.5 which sums up a floating point number
1246 cnh 1.1 over all the logical processors running under the
1247     WRAPPER. Using a macro here provides extra flexibility to insert
1248     a specialized call in-line into application code. By default this
1249 cnh 1.5 resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
1250     64-bit floating point operands)
1251     or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
1252 cnh 1.1 setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1253     The \_GSUM macro is a performance critical operation, especially for
1254     large processor count, small tile size configurations.
1255 adcroft 1.6 The custom communication example discussed in section \ref{sect:jam_example}
1256 cnh 1.1 shows how the macro is used to invoke a custom global sum routine
1257     for a specific set of hardware.
1258    
1259     \item {\bf \_EXCH}
1260     The \_EXCH CPP macro is used to update tile overlap regions.
1261     It is qualified by a suffix indicating whether overlap updates are for
1262     two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1263     physical fields and whether fields are 32-bit floating point
1264     ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point
1265     ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined
1266     in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1267     \_EXCH operation plays a crucial role in scaling to small tile,
1268     large logical and physical processor count configurations.
1269 adcroft 1.6 The example in section \ref{sect:jam_example} discusses defining an
1270 cnh 1.5 optimized and specialized form on the \_EXCH operation.
1271 cnh 1.1
1272     The \_EXCH operation is also central to supporting grids such as
1273     the cube-sphere grid. In this class of grid a rotation may be required
1274     between tiles. Aligning the coordinate requiring rotation with the
1275 cnh 1.5 tile decomposition, allows the coordinate transformation to
1276 afe 1.16 be embedded within a custom form of the \_EXCH primitive. In these
1277     cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1278     package documentation \ref{sec:exch2}.
1279 cnh 1.1
1280     \item {\bf Reverse Mode}
1281     The communication primitives \_EXCH and \_GSUM both employ
1282     hand-written adjoint forms (or reverse mode) forms.
1283     These reverse mode forms can be found in the
1284 cnh 1.5 source code directory {\em pkg/autodiff}.
1285 cnh 1.1 For the global sum primitive the reverse mode form
1286     calls are to {\em GLOBAL\_ADSUM\_R4} and
1287     {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1288 cnh 1.5 exchange primitives are found in routines
1289 cnh 1.1 prefixed {\em ADEXCH}. The exchange routines make calls to
1290     the same low-level communication primitives as the forward mode
1291     operations. However, the routine argument {\em simulationMode}
1292     is set to the value {\em REVERSE\_SIMULATION}. This signifies
1293     ti the low-level routines that the adjoint forms of the
1294     appropriate communication operation should be performed.
1295 afe 1.16
1296 cnh 1.1 \item {\bf MAX\_NO\_THREADS}
1297     The variable {\em MAX\_NO\_THREADS} is used to indicate the
1298     maximum number of OS threads that a code will use. This
1299     value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1300     For single threaded execution it can be reduced to one if required.
1301 cnh 1.5 The value; is largely private to the WRAPPER and application code
1302 cnh 1.1 will nor normally reference the value, except in the following scenario.
1303    
1304     For certain physical parametrization schemes it is necessary to have
1305     a substantial number of work arrays. Where these arrays are allocated
1306     in heap storage ( for example COMMON blocks ) multi-threaded
1307     execution will require multiple instances of the COMMON block data.
1308     This can be achieved using a Fortran 90 module construct, however,
1309     if this might be unavailable then the work arrays can be extended
1310     with dimensions use the tile dimensioning scheme of {\em nSx}
1311     and {\em nSy} ( as described in section
1312 adcroft 1.6 \ref{sect:specifying_a_decomposition}). However, if the configuration
1313 cnh 1.1 being specified involves many more tiles than OS threads then
1314     it can save memory resources to reduce the variable
1315     {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1316 cnh 1.5 will be used and to declare the physical parameterization
1317     work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
1318 cnh 1.1 An example of this is given in the verification experiment
1319     {\em aim.5l\_cs}. Here the default setting of
1320     {\em MAX\_NO\_THREADS} is altered to
1321     \begin{verbatim}
1322     INTEGER MAX_NO_THREADS
1323     PARAMETER ( MAX_NO_THREADS = 6 )
1324     \end{verbatim}
1325     and several work arrays for storing intermediate calculations are
1326     created with declarations of the form.
1327     \begin{verbatim}
1328     common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1329     \end{verbatim}
1330 cnh 1.5 This declaration scheme is not used widely, because most global data
1331 cnh 1.1 is used for permanent not temporary storage of state information.
1332     In the case of permanent state information this approach cannot be used
1333     because there has to be enough storage allocated for all tiles.
1334     However, the technique can sometimes be a useful scheme for reducing memory
1335 cnh 1.5 requirements in complex physical parameterizations.
1336 adcroft 1.2 \end{enumerate}
1337    
1338     \begin{figure}
1339     \begin{verbatim}
1340     C--
1341     C-- Parallel directives for MIPS Pro Fortran compiler
1342     C--
1343     C Parallel compiler directives for SGI with IRIX
1344     C$PAR PARALLEL DO
1345     C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1346     C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1347     C
1348     DO I=1,nThreads
1349     myThid = I
1350    
1351     C-- Invoke nThreads instances of the numerical model
1352     CALL THE_MODEL_MAIN(myThid)
1353    
1354     ENDDO
1355     \end{verbatim}
1356     \caption{Prior to transferring control to
1357     the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
1358     MP directives to spawn multiple threads.
1359     } \label{fig:mp_directives}
1360     \end{figure}
1361 cnh 1.1
1362    
1363     \subsubsection{Specializing the Communication Code}
1364    
1365     The isolation of performance critical communication primitives and the
1366     sub-division of the simulation domain into tiles is a powerful tool.
1367     Here we show how it can be used to improve application performance and
1368 cnh 1.5 how it can be used to adapt to new griding approaches.
1369 cnh 1.1
1370     \subsubsection{JAM example}
1371 adcroft 1.6 \label{sect:jam_example}
1372 cnh 1.1 On some platforms a big performance boost can be obtained by
1373     binding the communication routines {\em \_EXCH} and
1374     {\em \_GSUM} to specialized native libraries ) fro example the
1375     shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag
1376     is used as an illustration of a specialized communication configuration
1377     that substitutes for standard, portable forms of {\em \_EXCH} and
1378     {\em \_GSUM}. It affects three source files {\em eeboot.F},
1379     {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined
1380     is has the following effects.
1381     \begin{itemize}
1382     \item An extra phase is included at boot time to initialize the custom
1383     communications library ( see {\em ini\_jam.F}).
1384     \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1385     with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
1386     \item a highly specialized form of the exchange operator (optimized
1387 cnh 1.4 for overlap regions of width one) is substituted into the elliptic
1388 cnh 1.1 solver routine {\em cg2d.F}.
1389     \end{itemize}
1390     Developing specialized code for other libraries follows a similar
1391     pattern.
1392    
1393     \subsubsection{Cube sphere communication}
1394 adcroft 1.6 \label{sect:cube_sphere_communication}
1395 cnh 1.1 Actual {\em \_EXCH} routine code is generated automatically from
1396     a series of template files, for example {\em exch\_rx.template}.
1397     This is done to allow a large number of variations on the exchange
1398     process to be maintained. One set of variations supports the
1399 cnh 1.4 cube sphere grid. Support for a cube sphere grid in MITgcm is based
1400 afe 1.16 on having each face of the cube as a separate tile or tiles.
1401 cnh 1.4 The exchange routines are then able to absorb much of the
1402 cnh 1.1 detailed rotation and reorientation required when moving around the
1403     cube grid. The set of {\em \_EXCH} routines that contain the
1404     word cube in their name perform these transformations.
1405     They are invoked when the run-time logical parameter
1406     {\em useCubedSphereExchange} is set true. To facilitate the
1407     transformations on a staggered C-grid, exchange operations are defined
1408 cnh 1.4 separately for both vector and scalar quantities and for
1409 cnh 1.1 grid-centered and for grid-face and corner quantities.
1410     Three sets of exchange routines are defined. Routines
1411     with names of the form {\em exch\_rx} are used to exchange
1412     cell centered scalar quantities. Routines with names of the form
1413     {\em exch\_uv\_rx} are used to exchange vector quantities located at
1414     the C-grid velocity points. The vector quantities exchanged by the
1415     {\em exch\_uv\_rx} routines can either be signed (for example velocity
1416     components) or un-signed (for example grid-cell separations).
1417     Routines with names of the form {\em exch\_z\_rx} are used to exchange
1418     quantities at the C-grid vorticity point locations.
1419    
1420    
1421    
1422    
1423     \section{MITgcm execution under WRAPPER}
1424    
1425     Fitting together the WRAPPER elements, package elements and
1426     MITgcm core equation elements of the source code produces calling
1427 adcroft 1.6 sequence shown in section \ref{sect:calling_sequence}
1428 cnh 1.1
1429     \subsection{Annotated call tree for MITgcm and WRAPPER}
1430 adcroft 1.6 \label{sect:calling_sequence}
1431 cnh 1.1
1432     WRAPPER layer.
1433    
1434 heimbach 1.8 {\footnotesize
1435 cnh 1.1 \begin{verbatim}
1436    
1437     MAIN
1438     |
1439     |--EEBOOT :: WRAPPER initialization
1440     | |
1441     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1442     | | allow basic I/O.
1443     | |-- EEINTRO_MSG :: Write startup greeting.
1444     | |
1445     | |-- EESET_PARMS :: Set WRAPPER parameters
1446     | |
1447     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1448     | |
1449     | |-- INI_PROCS :: Associate processes with grid regions.
1450     | |
1451     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1452     | |
1453     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1454     | :: communication data structures
1455     |
1456     |
1457     |--CHECK_THREADS :: Validate multiple thread start up.
1458     |
1459     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1460    
1461     \end{verbatim}
1462 heimbach 1.8 }
1463 cnh 1.1
1464     Core equations plus packages.
1465    
1466 heimbach 1.8 {\footnotesize
1467 cnh 1.1 \begin{verbatim}
1468     C
1469     C
1470     C Invocation from WRAPPER level...
1471     C :
1472     C :
1473     C |
1474     C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1475     C | :: Called from WRAPPER level numerical
1476 cnh 1.4 C | :: code invocation routine. On entry
1477 cnh 1.1 C | :: to THE_MODEL_MAIN separate thread and
1478     C | :: separate processes will have been established.
1479     C | :: Each thread and process will have a unique ID
1480     C | :: but as yet it will not be associated with a
1481     C | :: specific region in decomposed discrete space.
1482     C |
1483     C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1484     C | | :: grid, solver matrices etc..
1485     C | |
1486     C | |-INI_PARMS :: Routine to set kernel model parameters.
1487     C | | :: By default kernel parameters are read from file
1488     C | | :: "data" in directory in which code executes.
1489     C | |
1490 cnh 1.4 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1491 cnh 1.1 C | |
1492 cnh 1.4 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1493 cnh 1.1 C | | | :: Grid arrays are held and described in GRID.h.
1494     C | | |
1495 cnh 1.4 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1496 cnh 1.1 C | | |
1497 cnh 1.4 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1498 cnh 1.1 C | | | :: (calculate grid from kernel parameters).
1499     C | | |
1500     C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1501 cnh 1.4 C | | | :: initialization (calculate grid from
1502 cnh 1.1 C | | | :: kernel parameters).
1503     C | | |
1504     C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1505 cnh 1.4 C | | :: grid initializations. ( input from raw
1506 cnh 1.1 C | | :: grid files, LONC.bin, DXF.bin etc... )
1507     C | |
1508     C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1509     C | |
1510     C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1511     C | | :: land masking for solid-fluid boundaries.
1512     C | |
1513     C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1514     C | |
1515     C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1516 cnh 1.4 C | | :: sphere options are coded.
1517 cnh 1.1 C | |
1518     C | |-PACAKGES_BOOT :: Start up the optional package environment.
1519     C | | :: Runtime selection of active packages.
1520     C | |
1521     C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1522     C | | |
1523     C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1524     C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1525     C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1526     C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1527     C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1528     C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1529     C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1530     C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1531     C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1532     C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1533     C | |
1534     C | |-PACKAGES_CHECK
1535     C | | |
1536     C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1537 cnh 1.4 C | | |-OBCS_CHECK :: Open bndy Package. pkg/obcs
1538 cnh 1.1 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1539     C | |
1540     C | |-PACKAGES_INIT_FIXED
1541     C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1542     C | | |-FLT_INIT :: Floats Package. see pkg/flt
1543     C | |
1544     C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1545     C | |
1546     C | |-INI_CG2D :: 2d con. grad solver initialisation.
1547     C | |
1548     C | |-INI_CG3D :: 3d con. grad solver initialisation.
1549     C | |
1550     C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1551     C | :: Includes annotated table of kernel
1552     C | :: parameter settings.
1553     C |
1554     C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1555     C |
1556     C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1557 cnh 1.4 C ! :: Auotmatically generated by TAMC/TAF.
1558 cnh 1.1 C |
1559     C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1560     C |
1561     C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1562     C |
1563     C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1564     C | |
1565     C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1566     C | | | :: variables
1567     C | | |
1568     C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1569     C | | |
1570     C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1571 cnh 1.4 C | | | :: sphere options are coded.
1572 cnh 1.1 C | | |
1573     C | | |-INI_CG2D :: 2d con. grad solver initialisation.
1574     C | | |-INI_CG3D :: 3d con. grad solver initialisation.
1575     C | | |-INI_MIXING :: Initialise diapycnal diffusivity.
1576     C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical
1577     C | | | :: fields).
1578     C | | |
1579 cnh 1.4 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1580 cnh 1.1 C | | | |-INI_VEL :: Initialize 3D flow field.
1581     C | | | |-INI_THETA :: Set model initial temperature field.
1582     C | | | |-INI_SALT :: Set model initial salinity field.
1583     C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1584     C | | |
1585     C | | |-INI_TR1 :: Set initial tracer 1 distribution.
1586     C | | |
1587     C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1588     C | | | | :: Here applied to move restart conditions
1589     C | | | | :: (saved in mid timestep) to correct level in
1590     C | | | | :: time (only used for pre-c35).
1591     C | | | |
1592     C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1593     C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1594     C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1595     C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1596     C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1597     C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1598     C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1599     C | | | | |-FIND_RHO :: Find adjacent densities.
1600     C | | | | |-CONVECT :: Mix static instability.
1601     C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1602     C | | | |
1603     C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1604     C | | |
1605     C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1606     C | | | | :: Extra time history interactions.
1607     C | | | |
1608     C | | | |-FIND_RHO :: Find adjacent densities.
1609     C | | | |-CONVECT :: Mix static instability.
1610     C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1611     C | | |
1612     C | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving
1613     C | | | | :: package data.
1614     C | | | |
1615     C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1616     C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1617     C | | | |-KPP_OPEN_DIAGS
1618     C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1619     C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1620     C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1621     C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1622     C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1623     C | | | |-INI_FORCING :: Set model initial forcing fields.
1624     C | | | | :: Either set in-line or from file as shown.
1625     C | | | |-READ_FLD_XY_RS(zonalWindFile)
1626     C | | | |-READ_FLD_XY_RS(meridWindFile)
1627     C | | | |-READ_FLD_XY_RS(surfQFile)
1628     C | | | |-READ_FLD_XY_RS(EmPmRfile)
1629     C | | | |-READ_FLD_XY_RS(thetaClimFile)
1630     C | | | |-READ_FLD_XY_RS(saltClimFile)
1631     C | | | |-READ_FLD_XY_RS(surfQswFile)
1632     C | | |
1633     C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1634     C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1635     C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1636     C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1637     C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1638     C | |
1639     C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1640     C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1641     C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1642     C | |
1643     C | |-MONITOR :: Monitor state ( see pkg/monitor )
1644     C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1645     C====|>|
1646     C====|>| ****************************
1647     C====|>| BEGIN MAIN TIMESTEPPING LOOP
1648     C====|>| ****************************
1649     C====|>|
1650     C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1651     C/\ | | |
1652     C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1653     C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1654     C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1655     C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1656     C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1657 cnh 1.4 C/\ | | | | :: Simple interpolation between end-points
1658 cnh 1.1 C/\ | | | | :: for forcing datasets.
1659     C/\ | | | |
1660     C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1661     C/\ | | |
1662     C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1663     C/\ | | | |
1664     C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1665     C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1666     C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-Rhonil] of a slice
1667     C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1668     C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1669     C/\ | | | |
1670     C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1671     C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1672     C/\ | | | | :: forcing terms.
1673     C/\ | | | |
1674     C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1675     C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1676     C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1677     C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1678     C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1679     C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1680     C/\ | | | | advection case) (see pkg/gad).
1681     C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1682     C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1683     C/\ | | | | |
1684     C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1685     C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1686     C/\ | | | |
1687     C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1688     C/\ | | | | |
1689     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1690     C/\ | | | | | :: ( see pkg/gad )
1691     C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1692     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1693     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1694     C/\ | | | |
1695     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1696     C/\ | | | |
1697     C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1698     C/\ | | | | |
1699     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1700     C/\ | | | | | :: ( see pkg/gad )
1701     C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1702     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1703     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1704     C/\ | | | |
1705     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1706     C/\ | | | |
1707     C/\ | | | |-CALC_GTR1 :: Calculate other tracer(s) tendency terms
1708     C/\ | | | | |
1709     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1710     C/\ | | | | | :: ( see pkg/gad )
1711     C/\ | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.
1712     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1713     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1714     C/\ | | | |
1715     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1716     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1717     C/\ | | | |-FREEZE :: Limit range of temperature.
1718     C/\ | | | |
1719     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1720     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1721     C/\ | | | |
1722     C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1723     C/\ | | | |-EXCH :: Update overlaps
1724     C/\ | | |
1725     C/\ | | |-DYNAMICS :: Momentum equations driver.
1726     C/\ | | | |
1727     C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1728     C/\ | | | | Potential anomaly.
1729     C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1730     C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1731     C/\ | | | |
1732     C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1733     C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1734     C/\ | | | | pkg/mom_fluxform ).
1735     C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1736     C/\ | | | | pkg/mom_vecinv ).
1737     C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1738     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1739     C/\ | | | |
1740     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1741     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1742     C/\ | | | |
1743     C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1744     C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1745     C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1746     C/\ | | |
1747     C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1748     C/\ | | |
1749     C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1750     C/\ | | |
1751     C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1752     C/\ | | |
1753     C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1754     C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1755     C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1756     C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1757     C/\ | | |
1758     C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1759     C/\ | | | |
1760     C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1761     C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1762     C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1763     C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1764     C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1765     C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1766     C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1767     C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1768     C/\ | | | | |-CONVECT :: Mix static instability.
1769     C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1770     C/\ | | | |
1771     C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1772     C/\ | | |
1773     C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1774     C/\ | | | |-EXCH
1775     C/\ | | |
1776     C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1777     C/\ | | |
1778     C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1779     C/\ | | |
1780     C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1781     C/\ | | | |-WRITE_STATE :: Core state I/O
1782     C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1783     C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1784     C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1785     C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1786     C/\ | | |
1787     C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1788     C/\ | |
1789     C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1790     C<===|=|
1791     C<===|=| **************************
1792     C<===|=| END MAIN TIMESTEPPING LOOP
1793     C<===|=| **************************
1794     C<===|=|
1795     C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1796     C |
1797     C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1798     C |
1799     C |-TIMER_PRINTALL :: Computational timing summary
1800     C |
1801     C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1802     C :: events.
1803     C
1804     \end{verbatim}
1805 heimbach 1.8 }
1806 cnh 1.1
1807     \subsection{Measuring and Characterizing Performance}
1808    
1809     TO BE DONE (CNH)
1810    
1811     \subsection{Estimating Resource Requirements}
1812    
1813     TO BE DONE (CNH)
1814    
1815     \subsubsection{Atlantic 1/6 degree example}
1816     \subsubsection{Dry Run testing}
1817     \subsubsection{Adjoint Resource Requirements}
1818     \subsubsection{State Estimation Environment Resources}
1819    

  ViewVC Help
Powered by ViewVC 1.1.22