/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Annotation of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.20 - (hide annotations) (download) (as text)
Sat Oct 16 03:40:16 2004 UTC (20 years, 9 months ago) by edhill
Branch: MAIN
CVS Tags: checkpoint57l_post
Changes since 1.19: +7 -1 lines
File MIME type: application/x-tex
 o add HTML comments as a step towards "URL permanence" which will help
   solve:
   - stale links from the CMI web site
   - rotten indexing by bonniefy.pl
 o also cleanup a merge-mangle in diagnostics.tex

1 edhill 1.20 % $Header: /u/gcmpack/manual/part4/sarch.tex,v 1.19 2004/04/20 23:32:59 edhill Exp $
2 cnh 1.1
3 cnh 1.7 This chapter focuses on describing the {\bf WRAPPER} environment within which
4     both the core numerics and the pluggable packages operate. The description
5     presented here is intended to be a detailed exposition and contains significant
6     background material, as well as advanced details on working with the WRAPPER.
7 adcroft 1.9 The tutorial sections of this manual (see sections
8     \ref{sect:tutorials} and \ref{sect:tutorialIII})
9 cnh 1.7 contain more succinct, step-by-step instructions on running basic numerical
10     experiments, of varous types, both sequentially and in parallel. For many
11     projects simply starting from an example code and adapting it to suit a
12     particular situation
13     will be all that is required.
14     The first part of this chapter discusses the MITgcm architecture at an
15     abstract level. In the second part of the chapter we described practical
16     details of the MITgcm implementation and of current tools and operating system
17     features that are employed.
18 cnh 1.1
19     \section{Overall architectural goals}
20 edhill 1.20 \begin{rawhtml}
21     <!-- CMIREDIR:overall_architectural_goals: -->
22     \end{rawhtml}
23 cnh 1.1
24     Broadly, the goals of the software architecture employed in MITgcm are
25     three-fold
26    
27     \begin{itemize}
28     \item We wish to be able to study a very broad range
29     of interesting and challenging rotating fluids problems.
30     \item We wish the model code to be readily targeted to
31     a wide range of platforms
32     \item On any given platform we would like to be
33     able to achieve performance comparable to an implementation
34     developed and specialized specifically for that platform.
35     \end{itemize}
36    
37     These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
38     which conveys the goals of the MITgcm design. The goals lead to
39     a software architecture which at the high-level can be viewed as consisting
40     of
41    
42     \begin{enumerate}
43     \item A core set of numerical and support code. This is discussed in detail in
44 adcroft 1.6 section \ref{sect:partII}.
45 cnh 1.1 \item A scheme for supporting optional "pluggable" {\bf packages} (containing
46     for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
47     These packages are used both to overlay alternate dynamics and to introduce
48     specialized physical content onto the core numerical code. An overview of
49     the {\bf package} scheme is given at the start of part \ref{part:packages}.
50     \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
51     Programming Environment Resource), within which the core numerics and pluggable
52     packages operate.
53     \end{enumerate}
54    
55     This chapter focuses on describing the {\bf WRAPPER} environment under which
56     both the core numerics and the pluggable packages function. The description
57 cnh 1.4 presented here is intended to be a detailed exposition and contains significant
58 cnh 1.1 background material, as well as advanced details on working with the WRAPPER.
59     The examples section of this manual (part \ref{part:example}) contains more
60     succinct, step-by-step instructions on running basic numerical
61     experiments both sequentially and in parallel. For many projects simply
62     starting from an example code and adapting it to suit a particular situation
63     will be all that is required.
64    
65 adcroft 1.2
66 cnh 1.1 \begin{figure}
67     \begin{center}
68 adcroft 1.2 \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
69 cnh 1.1 \end{center}
70 adcroft 1.2 \caption{
71     The MITgcm architecture is designed to allow simulation of a wide
72 cnh 1.1 range of physical problems on a wide range of hardware. The computational
73     resource requirements of the applications targeted range from around
74     $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
75     ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
76     interest range from $10^{9}$ floating point operations to more than $10^{17}$
77 adcroft 1.2 floating point operations.}
78     \label{fig:mitgcm_architecture_goals}
79 cnh 1.1 \end{figure}
80    
81     \section{WRAPPER}
82 afe 1.17 \begin{rawhtml}
83 afe 1.18 <!-- CMIREDIR:wrapper: -->
84 afe 1.17 \end{rawhtml}
85 cnh 1.1
86     A significant element of the software architecture utilized in
87     MITgcm is a software superstructure and substructure collectively
88     called the WRAPPER (Wrappable Application Parallel Programming
89     Environment Resource). All numerical and support code in MITgcm is written
90     to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
91     the WRAPPER means that coding has to follow certain, relatively
92     straightforward, rules and conventions ( these are discussed further in
93 adcroft 1.6 section \ref{sect:specifying_a_decomposition} ).
94 cnh 1.1
95     The approach taken by the WRAPPER is illustrated in figure
96     \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
97     that fits within it from architectural differences between hardware platforms
98     and operating systems. This allows numerical code to be easily retargetted.
99 adcroft 1.2
100    
101 cnh 1.1 \begin{figure}
102     \begin{center}
103 adcroft 1.2 \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
104 cnh 1.1 \end{center}
105 adcroft 1.2 \caption{
106 edhill 1.10 Numerical code is written to fit within a software support
107 cnh 1.1 infrastructure called WRAPPER. The WRAPPER is portable and
108 cnh 1.4 can be specialized for a wide range of specific target hardware and
109 cnh 1.1 programming environments, without impacting numerical code that fits
110     within the WRAPPER. Codes that fit within the WRAPPER can generally be
111     made to run as fast on a particular platform as codes specially
112 adcroft 1.2 optimized for that platform.}
113     \label{fig:fit_in_wrapper}
114 cnh 1.1 \end{figure}
115    
116     \subsection{Target hardware}
117 adcroft 1.6 \label{sect:target_hardware}
118 cnh 1.1
119     The WRAPPER is designed to target as broad as possible a range of computer
120     systems. The original development of the WRAPPER took place on a
121     multi-processor, CRAY Y-MP system. On that system, numerical code performance
122     and scaling under the WRAPPER was in excess of that of an implementation that
123     was tightly bound to the CRAY systems proprietary multi-tasking and
124     micro-tasking approach. Later developments have been carried out on
125     uniprocessor and multi-processor Sun systems with both uniform memory access
126     (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
127     been undertaken on x86 cluster systems, Alpha processor based clustered SMP
128     systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
129 edhill 1.10 The MITgcm code, operating within the WRAPPER, is also routinely used on
130 cnh 1.1 large scale MPP systems (for example T3E systems and IBM SP systems). In all
131     cases numerical code, operating within the WRAPPER, performs and scales very
132     competitively with equivalent numerical code that has been modified to contain
133     native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.
134    
135     \subsection{Supporting hardware neutrality}
136    
137 adcroft 1.6 The different systems listed in section \ref{sect:target_hardware} can be
138 cnh 1.1 categorized in many different ways. For example, one common distinction is
139     between shared-memory parallel systems (SMP's, PVP's) and distributed memory
140     parallel systems (for example x86 clusters and large MPP systems). This is one
141     example of a difference between compute platforms that can impact an
142     application. Another common distinction is between vector processing systems
143     with highly specialized CPU's and memory subsystems and commodity
144     microprocessor based systems. There are numerous other differences, especially
145     in relation to how parallel execution is supported. To capture the essential
146     differences between different platforms the WRAPPER uses a {\it machine model}.
147    
148     \subsection{WRAPPER machine model}
149    
150     Applications using the WRAPPER are not written to target just one
151     particular machine (for example an IBM SP2) or just one particular family or
152     class of machines (for example Parallel Vector Processor Systems). Instead the
153     WRAPPER provides applications with an
154     abstract {\it machine model}. The machine model is very general, however, it can
155 cnh 1.5 easily be specialized to fit, in a computationally efficient manner, any
156 cnh 1.1 computer architecture currently available to the scientific computing community.
157    
158     \subsection{Machine model parallelism}
159 afe 1.17 \begin{rawhtml}
160 afe 1.18 <!-- CMIREDIR:domain_decomp: -->
161 afe 1.17 \end{rawhtml}
162 cnh 1.1
163     Codes operating under the WRAPPER target an abstract machine that is assumed to
164     consist of one or more logical processors that can compute concurrently.
165 cnh 1.5 Computational work is divided among the logical
166 cnh 1.1 processors by allocating ``ownership'' to
167     each processor of a certain set (or sets) of calculations. Each set of
168     calculations owned by a particular processor is associated with a specific
169     region of the physical space that is being simulated, only one processor will
170     be associated with each such region (domain decomposition).
171    
172     In a strict sense the logical processors over which work is divided do not need
173     to correspond to physical processors. It is perfectly possible to execute a
174     configuration decomposed for multiple logical processors on a single physical
175     processor. This helps ensure that numerical code that is written to fit
176     within the WRAPPER will parallelize with no additional effort and is
177     also useful when debugging codes. Generally, however,
178     the computational domain will be subdivided over multiple logical
179     processors in order to then bind those logical processors to physical
180     processor resources that can compute in parallel.
181    
182     \subsubsection{Tiles}
183    
184     Computationally, associated with each region of physical
185     space allocated to a particular logical processor, there will be data
186     structures (arrays, scalar variables etc...) that hold the simulated state of
187     that region. We refer to these data structures as being {\bf owned} by the
188 cnh 1.4 processor to which their
189 cnh 1.1 associated region of physical space has been allocated. Individual
190     regions that are allocated to processors are called {\bf tiles}. A
191     processor can own more
192     than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being
193     mapped to a set of logical processors, with each processors owning a single
194     region of the domain (a single tile). Except for periods of
195     communication and coordination, each processor computes autonomously, working
196     only with data from the tile (or tiles) that the processor owns. When multiple
197     tiles are alloted to a single processor, each tile is computed on
198     independently of the other tiles, in a sequential fashion.
199    
200     \begin{figure}
201     \begin{center}
202 adcroft 1.2 \resizebox{5in}{!}{
203     \includegraphics{part4/domain_decomp.eps}
204 cnh 1.1 }
205     \end{center}
206     \caption{ The WRAPPER provides support for one and two dimensional
207     decompositions of grid-point domains. The figure shows a hypothetical domain of
208     total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in
209     two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf
210     tiles} are {\bf owned} by different processors. The {\bf owning}
211     processors perform the
212     arithmetic operations associated with a {\bf tile}. Although not illustrated
213     here, a single processor can {\bf own} several {\bf tiles}.
214     Whenever a processor wishes to transfer data between tiles or
215     communicate with other processors it calls a WRAPPER supplied
216     function.
217     } \label{fig:domaindecomp}
218     \end{figure}
219    
220    
221     \subsubsection{Tile layout}
222    
223     Tiles consist of an interior region and an overlap region. The overlap region
224     of a tile corresponds to the interior region of an adjacent tile.
225     In figure \ref{fig:tiledworld} each tile would own the region
226     within the black square and hold duplicate information for overlap
227     regions extending into the tiles to the north, south, east and west.
228     During
229     computational phases a processor will reference data in an overlap region
230     whenever it requires values that outside the domain it owns. Periodically
231     processors will make calls to WRAPPER functions to communicate data between
232     tiles, in order to keep the overlap regions up to date (see section
233 adcroft 1.6 \ref{sect:communication_primitives}). The WRAPPER functions can use a
234 cnh 1.1 variety of different mechanisms to communicate data between tiles.
235    
236     \begin{figure}
237     \begin{center}
238 adcroft 1.2 \resizebox{5in}{!}{
239     \includegraphics{part4/tiled-world.eps}
240 cnh 1.1 }
241     \end{center}
242     \caption{ A global grid subdivided into tiles.
243     Tiles contain a interior region and an overlap region.
244     Overlap regions are periodically updated from neighboring tiles.
245     } \label{fig:tiledworld}
246     \end{figure}
247    
248     \subsection{Communication mechanisms}
249    
250     Logical processors are assumed to be able to exchange information
251     between tiles and between each other using at least one of two possible
252     mechanisms.
253    
254     \begin{itemize}
255     \item {\bf Shared memory communication}.
256     Under this mode of communication data transfers are assumed to be possible
257     using direct addressing of regions of memory. In this case a CPU is able to read
258     (and write) directly to regions of memory "owned" by another CPU
259     using simple programming language level assignment operations of the
260     the sort shown in figure \ref{fig:simple_assign}. In this way one CPU
261     (CPU1 in the figure) can communicate information to another CPU (CPU2 in the
262     figure) by assigning a particular value to a particular memory location.
263    
264     \item {\bf Distributed memory communication}.
265     Under this mode of communication there is no mechanism, at the application code level,
266     for directly addressing regions of memory owned and visible to another CPU. Instead
267     a communication library must be used as illustrated in figure
268     \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the
269     communication library to communicate data from a tile that it owns to a tile
270     that another CPU owns. By default the WRAPPER binds to the MPI communication
271     library \ref{MPI} for this style of communication.
272     \end{itemize}
273    
274     The WRAPPER assumes that communication will use one of these two styles
275     of communication. The underlying hardware and operating system support
276     for the style used is not specified and can vary from system to system.
277    
278     \begin{figure}
279     \begin{verbatim}
280    
281     CPU1 | CPU2
282     ==== | ====
283     |
284     a(3) = 8 | WHILE ( a(3) .NE. 8 )
285     | WAIT
286     | END WHILE
287     |
288     \end{verbatim}
289     \caption{ In the WRAPPER shared memory communication model, simple writes to an
290     array can be made to be visible to other CPU's at the application code level.
291     So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
292     element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)
293     will be able to see the value $8$ when they read from $a(3)$.
294     This provides a very low latency and high bandwidth communication
295     mechanism.
296     } \label{fig:simple_assign}
297     \end{figure}
298    
299     \begin{figure}
300     \begin{verbatim}
301    
302     CPU1 | CPU2
303     ==== | ====
304     |
305     a(3) = 8 | WHILE ( a(3) .NE. 8 )
306     CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
307     | END WHILE
308     |
309     \end{verbatim}
310     \caption{ In the WRAPPER distributed memory communication model
311     data can not be made directly visible to other CPU's.
312     If one CPU writes the value $8$ to element $3$ of array $a$, then
313     at least one of CPU1 and/or CPU2 in the figure above will need
314     to call a bespoke communication library in order for the updated
315     value to be communicated between CPU's.
316     } \label{fig:comm_msg}
317     \end{figure}
318    
319     \subsection{Shared memory communication}
320 adcroft 1.6 \label{sect:shared_memory_communication}
321 cnh 1.1
322     Under shared communication independent CPU's are operating
323     on the exact same global address space at the application level.
324     This means that CPU 1 can directly write into global
325     data structures that CPU 2 ``owns'' using a simple
326     assignment at the application level.
327     This is the model of memory access is supported at the basic system
328     design level in ``shared-memory'' systems such as PVP systems, SMP systems,
329     and on distributed shared memory systems (the SGI Origin).
330     On such systems the WRAPPER will generally use simple read and write statements
331     to access directly application data structures when communicating between CPU's.
332    
333     In a system where assignments statements, like the one in figure
334     \ref{fig:simple_assign} map directly to
335     hardware instructions that transport data between CPU and memory banks, this
336     can be a very efficient mechanism for communication. In this case two CPU's,
337     CPU1 and CPU2, can communicate simply be reading and writing to an
338     agreed location and following a few basic rules. The latency of this sort
339     of communication is generally not that much higher than the hardware
340     latency of other memory accesses on the system. The bandwidth available
341     between CPU's communicating in this way can be close to the bandwidth of
342     the systems main-memory interconnect. This can make this method of
343     communication very efficient provided it is used appropriately.
344    
345     \subsubsection{Memory consistency}
346 adcroft 1.6 \label{sect:memory_consistency}
347 cnh 1.1
348     When using shared memory communication between
349     multiple processors the WRAPPER level shields user applications from
350     certain counter-intuitive system behaviors. In particular, one issue the
351     WRAPPER layer must deal with is a systems memory model. In general the order
352     of reads and writes expressed by the textual order of an application code may
353     not be the ordering of instructions executed by the processor performing the
354     application. The processor performing the application instructions will always
355     operate so that, for the application instructions the processor is executing,
356     any reordering is not apparent. However, in general machines are often
357     designed so that reordering of instructions is not hidden from other second
358     processors. This means that, in general, even on a shared memory system two
359     processors can observe inconsistent memory values.
360    
361     The issue of memory consistency between multiple processors is discussed at
362     length in many computer science papers, however, from a practical point of
363     view, in order to deal with this issue, shared memory machines all provide
364     some mechanism to enforce memory consistency when it is needed. The exact
365     mechanism employed will vary between systems. For communication using shared
366     memory, the WRAPPER provides a place to invoke the appropriate mechanism to
367     ensure memory consistency for a particular platform.
368    
369     \subsubsection{Cache effects and false sharing}
370 adcroft 1.6 \label{sect:cache_effects_and_false_sharing}
371 cnh 1.1
372     Shared-memory machines often have local to processor memory caches
373     which contain mirrored copies of main memory. Automatic cache-coherence
374     protocols are used to maintain consistency between caches on different
375     processors. These cache-coherence protocols typically enforce consistency
376     between regions of memory with large granularity (typically 128 or 256 byte
377     chunks). The coherency protocols employed can be expensive relative to other
378     memory accesses and so care is taken in the WRAPPER (by padding synchronization
379     structures appropriately) to avoid unnecessary coherence traffic.
380    
381     \subsubsection{Operating system support for shared memory.}
382    
383     Applications running under multiple threads within a single process can
384     use shared memory communication. In this case {\it all} the memory locations
385     in an application are potentially visible to all the compute threads. Multiple
386     threads operating within a single process is the standard mechanism for
387     supporting shared memory that the WRAPPER utilizes. Configuring and launching
388     code to run in multi-threaded mode on specific platforms is discussed in
389 adcroft 1.6 section \ref{sect:running_with_threads}. However, on many systems, potentially
390 cnh 1.1 very efficient mechanisms for using shared memory communication between
391     multiple processes (in contrast to multiple threads within a single
392     process) also exist. In most cases this works by making a limited region of
393     memory shared between processes. The MMAP \ref{magicgarden} and
394     IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do
395     vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.
396     Extensions exist for the WRAPPER that allow these mechanisms
397     to be used for shared memory communication. However, these mechanisms are not
398     distributed with the default WRAPPER sources, because of their proprietary
399     nature.
400    
401     \subsection{Distributed memory communication}
402 adcroft 1.6 \label{sect:distributed_memory_communication}
403 cnh 1.1 Many parallel systems are not constructed in a way where it is
404     possible or practical for an application to use shared memory
405     for communication. For example cluster systems consist of individual computers
406     connected by a fast network. On such systems their is no notion of shared memory
407     at the system level. For this sort of system the WRAPPER provides support
408     for communication based on a bespoke communication library
409     (see figure \ref{fig:comm_msg}). The default communication library used is MPI
410     \ref{mpi}. However, it is relatively straightforward to implement bindings to
411     optimized platform specific communication libraries. For example the work
412     described in \ref{hoe-hill:99} substituted standard MPI communication for a
413     highly optimized library.
414    
415     \subsection{Communication primitives}
416 adcroft 1.6 \label{sect:communication_primitives}
417 cnh 1.1
418     \begin{figure}
419     \begin{center}
420 adcroft 1.2 \resizebox{5in}{!}{
421     \includegraphics{part4/comm-primm.eps}
422 cnh 1.1 }
423     \end{center}
424 cnh 1.5 \caption{Three performance critical parallel primitives are provided
425     by the WRAPPER. These primitives are always used to communicate data
426 cnh 1.1 between tiles. The figure shows four tiles. The curved arrows indicate
427     exchange primitives which transfer data between the overlap regions at tile
428     edges and interior regions for nearest-neighbor tiles.
429     The straight arrows symbolize global sum operations which connect all tiles.
430     The global sum operation provides both a key arithmetic primitive and can
431     serve as a synchronization primitive. A third barrier primitive is also
432     provided, it behaves much like the global sum primitive.
433     } \label{fig:communication_primitives}
434     \end{figure}
435    
436    
437     Optimized communication support is assumed to be possibly available
438     for a small number of communication operations.
439     It is assumed that communication performance optimizations can
440     be achieved by optimizing a small number of communication primitives.
441     Three optimizable primitives are provided by the WRAPPER
442     \begin{itemize}
443     \item{\bf EXCHANGE} This operation is used to transfer data between interior
444     and overlap regions of neighboring tiles. A number of different forms of this
445     operation are supported. These different forms handle
446     \begin{itemize}
447     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
448     separately.
449     \item Bindings to different communication methods.
450     Exchange primitives select between using shared memory or distributed
451     memory communication.
452     \item Transformation operations required when transporting
453     data between different grid regions. Transferring data
454     between faces of a cube-sphere grid, for example, involves a rotation
455     of vector components.
456     \item Forward and reverse mode computations. Derivative calculations require
457     tangent linear and adjoint forms of the exchange primitives.
458    
459     \end{itemize}
460    
461     \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
462     operation for the pressure inversion phase of the MITgcm algorithm.
463     For certain configurations scaling can be highly sensitive to
464     the performance of the global sum primitive. This operation is a collective
465     operation involving all tiles of the simulated domain. Different forms
466     of the global sum primitive exist for handling
467     \begin{itemize}
468     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
469     separately.
470     \item Bindings to different communication methods.
471     Exchange primitives select between using shared memory or distributed
472     memory communication.
473     \item Forward and reverse mode computations. Derivative calculations require
474     tangent linear and adjoint forms of the exchange primitives.
475     \end{itemize}
476    
477     \item{\bf BARRIER} The WRAPPER provides a global synchronization function
478     called barrier. This is used to synchronize computations over all tiles.
479     The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in
480     some cases use the same underlying code.
481     \end{itemize}
482    
483    
484     \subsection{Memory architecture}
485    
486     The WRAPPER machine model is aimed to target efficiently systems with
487     highly pipelined memory architectures and systems with deep memory
488     hierarchies that favor memory reuse. This is achieved by supporting a
489     flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
490     Within a CPU computations are carried out sequentially on each tile
491     in turn. By reshaping tiles according to the target platform it is
492     possible to automatically tune code to improve memory performance.
493     On a vector machine a given domain might be sub-divided into a few
494     long, thin regions. On a commodity microprocessor based system, however,
495     the same region could be simulated use many more smaller
496     sub-domains.
497    
498    
499     \begin{figure}
500     \begin{center}
501 adcroft 1.2 \resizebox{5in}{!}{
502     \includegraphics{part4/tiling_detail.eps}
503 cnh 1.1 }
504     \end{center}
505     \caption{The tiling strategy that the WRAPPER supports allows tiles
506     to be shaped to suit the underlying system memory architecture.
507     Compact tiles that lead to greater memory reuse can be used on cache
508     based systems (upper half of figure) with deep memory hierarchies, long tiles
509     with large inner loops can be used to exploit vector systems having
510     highly pipelined memory systems.
511     } \label{fig:tiling-strategy}
512     \end{figure}
513    
514     \newpage
515     \subsection{Summary}
516     Following the discussion above, the machine model that the WRAPPER
517     presents to an application has the following characteristics
518    
519     \begin{itemize}
520     \item The machine consists of one or more logical processors. \vspace{-3mm}
521     \item Each processor operates on tiles that it owns.\vspace{-3mm}
522     \item A processor may own more than one tile.\vspace{-3mm}
523     \item Processors may compute concurrently.\vspace{-3mm}
524     \item Exchange of information between tiles is handled by the
525     machine (WRAPPER) not by the application.
526     \end{itemize}
527     Behind the scenes this allows the WRAPPER to adapt the machine model
528     functions to exploit hardware on which
529     \begin{itemize}
530     \item Processors may be able to communicate very efficiently with each other
531     using shared memory. \vspace{-3mm}
532     \item An alternative communication mechanism based on a relatively
533     simple inter-process communication API may be required.\vspace{-3mm}
534     \item Shared memory may not necessarily obey sequential consistency,
535     however some mechanism will exist for enforcing memory consistency.
536     \vspace{-3mm}
537     \item Memory consistency that is enforced at the hardware level
538     may be expensive. Unnecessary triggering of consistency protocols
539     should be avoided. \vspace{-3mm}
540     \item Memory access patterns may need to either repetitive or highly
541     pipelined for optimum hardware performance. \vspace{-3mm}
542     \end{itemize}
543    
544     This generic model captures the essential hardware ingredients
545     of almost all successful scientific computer systems designed in the
546     last 50 years.
547    
548     \section{Using the WRAPPER}
549 edhill 1.20 \begin{rawhtml}
550     <!-- CMIREDIR:using_the_wrapper: -->
551     \end{rawhtml}
552 cnh 1.1
553     In order to support maximum portability the WRAPPER is implemented primarily
554     in sequential Fortran 77. At a practical level the key steps provided by the
555     WRAPPER are
556     \begin{enumerate}
557     \item specifying how a domain will be decomposed
558     \item starting a code in either sequential or parallel modes of operations
559     \item controlling communication between tiles and between concurrently
560     computing CPU's.
561     \end{enumerate}
562     This section describes the details of each of these operations.
563 adcroft 1.6 Section \ref{sect:specifying_a_decomposition} explains how the way in which
564 cnh 1.1 a domain is decomposed (or composed) is expressed. Section
565 adcroft 1.6 \ref{sect:starting_a_code} describes practical details of running codes
566 cnh 1.1 in various different parallel modes on contemporary computer systems.
567 adcroft 1.6 Section \ref{sect:controlling_communication} explains the internal information
568 cnh 1.1 that the WRAPPER uses to control how information is communicated between
569     tiles.
570    
571     \subsection{Specifying a domain decomposition}
572 adcroft 1.6 \label{sect:specifying_a_decomposition}
573 cnh 1.1
574     At its heart much of the WRAPPER works only in terms of a collection of tiles
575     which are interconnected to each other. This is also true of application
576     code operating within the WRAPPER. Application code is written as a series
577     of compute operations, each of which operates on a single tile. If
578     application code needs to perform operations involving data
579     associated with another tile, it uses a WRAPPER function to obtain
580     that data.
581     The specification of how a global domain is constructed from tiles or alternatively
582     how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
583     This file defines the following parameters \\
584    
585     \fbox{
586     \begin{minipage}{4.75in}
587     Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
588     File: {\em model/inc/SIZE.h}
589     \end{minipage}
590     } \\
591    
592     Together these parameters define a tiling decomposition of the style shown in
593     figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
594     the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
595     maximum size of the overlap extent. This must be set to the maximum width
596     of the computation stencil that the numerical code finite-difference operations
597     require between overlap region updates. The maximum overlap required
598     by any of the operations in the MITgcm code distributed with this release is three grid
599     points. This is set by the requirements of the $\nabla^4$ dissipation and
600     diffusion operator. Code modifications and enhancements that involve adding wide
601     finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
602     Setting {\em OLx} and {\em OLy} to a too large value will decrease code
603     performance (because redundant computations will be performed), however it will
604     not cause any other problems.
605    
606     \begin{figure}
607     \begin{center}
608 adcroft 1.2 \resizebox{5in}{!}{
609     \includegraphics{part4/size_h.eps}
610 cnh 1.1 }
611     \end{center}
612     \caption{ The three level domain decomposition hierarchy employed by the
613     WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
614     to a single process. Multiple processes can exist, each with multiple tiles.
615     Tiles within a process can be spread over multiple compute threads.
616     } \label{fig:labelled_tile}
617     \end{figure}
618    
619     The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
620     be created within a single process. Each of these tiles will have internal
621     dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
622     allocated to different threads of a process that are then bound to
623     different physical processors ( see the multi-threaded
624 adcroft 1.6 execution discussion in section \ref{sect:starting_the_code} ) then
625 cnh 1.1 computation will be performed concurrently on each tile. However, it is also
626     possible to run the same decomposition within a process running a single thread on
627     a single processor. In this case the tiles will be computed over sequentially.
628     If the decomposition is run in a single process running multiple threads
629     but attached to a single physical processor, then, in general, the computation
630     for different tiles will be interleaved by system level software.
631     This too is a valid mode of operation.
632    
633     The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
634     numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
635     are used to form the loop ranges for many numerical calculations and
636     to provide dimensions for arrays holding numerical state.
637     The {\em nSx} and{\em nSy} are used in conjunction with the thread number
638     parameter {\em myThid}. Much of the numerical code operating within the
639     WRAPPER takes the form
640     \begin{verbatim}
641     DO bj=myByLo(myThid),myByHi(myThid)
642     DO bi=myBxLo(myThid),myBxHi(myThid)
643     :
644     a block of computations ranging
645     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
646     :
647     ENDDO
648     ENDDO
649    
650     communication code to sum a number or maybe update
651     tile overlap regions
652    
653     DO bj=myByLo(myThid),myByHi(myThid)
654     DO bi=myBxLo(myThid),myBxHi(myThid)
655     :
656     another block of computations ranging
657     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
658     :
659     ENDDO
660     ENDDO
661     \end{verbatim}
662     The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
663     myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
664     schematic. These variables specify the subset of the tiles in
665     the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
666     thread number {\em myThid} owns. The thread number variable {\em myThid}
667     ranges from 1 to the total number of threads requested at execution time.
668     For each value of {\em myThid} the loop scheme above will step sequentially
669     through the tiles owned by that thread. However, different threads will
670     have different ranges of tiles assigned to them, so that separate threads can
671     compute iterations of the {\em bi}, {\em bj} loop concurrently.
672     Within a {\em bi}, {\em bj} loop
673     computation is performed concurrently over as many processes and threads
674     as there are physical processors available to compute.
675    
676 afe 1.11 An exception to the the use of {\em bi} and {\em bj} in loops arises in the
677     exchange routines used when the exch2 package is used with the cubed
678     sphere. In this case {\em bj} is generally set to 1 and the loop runs from
679     1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number,
680     which is then used to reference exchange parameters.
681    
682 cnh 1.1 The amount of computation that can be embedded
683     a single loop over {\em bi} and {\em bj} varies for different parts of the
684     MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
685     from the two-dimensional implicit elliptic solver. This portion of the
686     code computes the l2Norm of a vector whose elements are held in
687     the array {\em cg2d\_r} writing the final result to scalar variable
688     {\em err}. In this case, because the l2norm requires a global reduction,
689     the {\em bi},{\em bj} loop only contains one statement. This computation
690     phase is then followed by a communication phase in which all threads and
691     processes must participate. However,
692     in other areas of the MITgcm code entries subsections of code are within
693     a single {\em bi},{\em bj} loop. For example the evaluation of all
694     the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
695     is within a single {\em bi},{\em bj} loop.
696    
697     \begin{figure}
698     \begin{verbatim}
699     REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
700     REAL*8 err
701     :
702     :
703     other computations
704     :
705     :
706     err = 0.
707     DO bj=myByLo(myThid),myByHi(myThid)
708     DO bi=myBxLo(myThid),myBxHi(myThid)
709     DO J=1,sNy
710     DO I=1,sNx
711     err = err +
712     & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
713     ENDDO
714     ENDDO
715     ENDDO
716     ENDDO
717    
718     CALL GLOBAL_SUM_R8( err , myThid )
719     err = SQRT(err)
720    
721     \end{verbatim}
722     \caption{Example of numerical code for calculating
723     the l2-norm of a vector within the WRAPPER. Notice that
724     under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
725     dimensions. These right most indices are tile indexes. Different
726     threads with a single process operate on different ranges of tile
727     index, as controlled by the settings of
728     {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
729     } \label{fig:bibj_extract}
730     \end{figure}
731    
732     The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
733     are used to indicate to the WRAPPER level how many processes (each with
734     {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
735     This information is needed during initialization and during I/O phases.
736     However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
737     the values of {\em nPx} and {\em nPy} are absent
738     from the core numerical and support code.
739    
740     \subsubsection{Examples of {\em SIZE.h} specifications}
741    
742     The following different {\em SIZE.h} parameter setting illustrate how to
743     interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
744     and {\em nPy}.
745     \begin{enumerate}
746     \item
747     \begin{verbatim}
748     PARAMETER (
749     & sNx = 90,
750     & sNy = 40,
751     & OLx = 3,
752     & OLy = 3,
753     & nSx = 1,
754     & nSy = 1,
755     & nPx = 1,
756     & nPy = 1)
757     \end{verbatim}
758     This sets up a single tile with x-dimension of ninety grid points, y-dimension of
759     forty grid points, and x and y overlaps of three grid points each.
760     \item
761     \begin{verbatim}
762     PARAMETER (
763     & sNx = 45,
764     & sNy = 20,
765     & OLx = 3,
766     & OLy = 3,
767     & nSx = 1,
768     & nSy = 1,
769     & nPx = 2,
770     & nPy = 2)
771     \end{verbatim}
772     This sets up tiles with x-dimension of forty-five grid points, y-dimension of
773     twenty grid points, and x and y overlaps of three grid points each. There are
774     four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
775     arranged so that the global domain size is again ninety grid points in x and
776     forty grid points in y. In general the formula for global grid size (held in
777     model variables {\em Nx} and {\em Ny}) is
778     \begin{verbatim}
779     Nx = sNx*nSx*nPx
780     Ny = sNy*nSy*nPy
781     \end{verbatim}
782     \item
783     \begin{verbatim}
784     PARAMETER (
785     & sNx = 90,
786     & sNy = 10,
787     & OLx = 3,
788     & OLy = 3,
789     & nSx = 1,
790     & nSy = 2,
791     & nPx = 1,
792     & nPy = 2)
793     \end{verbatim}
794     This sets up tiles with x-dimension of ninety grid points, y-dimension of
795     ten grid points, and x and y overlaps of three grid points each. There are
796     four tiles allocated to two separate processes ({\em nPy=2}) each of which
797     has two separate sub-domains {\em nSy=2},
798     The global domain size is again ninety grid points in x and
799     forty grid points in y. The two sub-domains in each process will be computed
800     sequentially if they are given to a single thread within a single process.
801     Alternatively if the code is invoked with multiple threads per process
802 edhill 1.10 the two domains in y may be computed concurrently.
803 cnh 1.1 \item
804     \begin{verbatim}
805     PARAMETER (
806     & sNx = 32,
807     & sNy = 32,
808     & OLx = 3,
809     & OLy = 3,
810     & nSx = 6,
811     & nSy = 1,
812     & nPx = 1,
813     & nPy = 1)
814     \end{verbatim}
815     This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
816     thirty-two grid points, and x and y overlaps of three grid points each.
817     There are six tiles allocated to six separate logical processors ({\em nSx=6}).
818     This set of values can be used for a cube sphere calculation.
819     Each tile of size $32 \times 32$ represents a face of the
820 cnh 1.4 cube. Initializing the tile connectivity correctly ( see section
821 adcroft 1.6 \ref{sect:cube_sphere_communication}. allows the rotations associated with
822 cnh 1.1 moving between the six cube faces to be embedded within the
823     tile-tile communication code.
824     \end{enumerate}
825    
826    
827     \subsection{Starting the code}
828 adcroft 1.6 \label{sect:starting_the_code}
829 cnh 1.1 When code is started under the WRAPPER, execution begins in a main routine {\em
830     eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
831     to the application through a routine called {\em THE\_MODEL\_MAIN()}
832     once the WRAPPER has initialized correctly and has created the necessary variables
833     to support subsequent calls to communication routines
834     by the application code. The startup calling sequence followed by the
835     WRAPPER is shown in figure \ref{fig:wrapper_startup}.
836    
837     \begin{figure}
838 heimbach 1.8 {\footnotesize
839 cnh 1.1 \begin{verbatim}
840    
841     MAIN
842     |
843     |--EEBOOT :: WRAPPER initialization
844     | |
845     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
846     | | allow basic I/O.
847     | |-- EEINTRO_MSG :: Write startup greeting.
848     | |
849     | |-- EESET_PARMS :: Set WRAPPER parameters
850     | |
851     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
852     | |
853     | |-- INI_PROCS :: Associate processes with grid regions.
854     | |
855     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
856     | |
857     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
858     | :: communication data structures
859     |
860     |
861     |--CHECK_THREADS :: Validate multiple thread start up.
862     |
863     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
864    
865    
866     \end{verbatim}
867 heimbach 1.8 }
868 cnh 1.1 \caption{Main stages of the WRAPPER startup procedure.
869     This process proceeds transfer of control to application code, which
870     occurs through the procedure {\em THE\_MODEL\_MAIN()}.
871     } \label{fig:wrapper_startup}
872     \end{figure}
873    
874     \subsubsection{Multi-threaded execution}
875 adcroft 1.6 \label{sect:multi-threaded-execution}
876 cnh 1.1 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
877     WRAPPER may cause several coarse grain threads to be initialized. The routine
878     {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
879     stack argument which is the thread number, stored in the
880     variable {\em myThid}. In addition to specifying a decomposition with
881 adcroft 1.6 multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
882 cnh 1.1 configuring and starting a code to run using multiple threads requires the following
883     steps.\\
884    
885     \paragraph{Compilation}
886     First the code must be compiled with appropriate multi-threading directives
887     active in the file {\em main.F} and with appropriate compiler flags
888     to request multi-threading support. The header files
889     {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
890     contain directives compatible with compilers for Sun, Compaq, SGI,
891     Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
892     activated by using compile time
893     directives {\em -DTARGET\_SUN},
894     {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
895     or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
896     for invoking multi-threaded compilation vary from system to system
897     and from compiler to compiler. The options will be described
898     in the individual compiler documentation. For the Fortran compiler
899     from Sun the following options are needed to correctly compile
900     multi-threaded code
901     \begin{verbatim}
902     -stackvar -explicitpar -vpara -noautopar
903     \end{verbatim}
904     These options are specific to the Sun compiler. Other compilers
905     will use different syntax that will be described in their
906     documentation. The effect of these options is as follows
907     \begin{enumerate}
908     \item {\bf -stackvar} Causes all local variables to be allocated in stack
909     storage. This is necessary for local variables to ensure that they are private
910     to their thread. Note, when using this option it may be necessary to override
911     the default limit on stack-size that the operating system assigns to a process.
912     This can normally be done by changing the settings of the command shells
913     {\em stack-size} limit variable. However, on some systems changing this limit
914     will require privileged administrator access to modify system parameters.
915    
916     \item {\bf -explicitpar} Requests that multiple threads be spawned
917     in response to explicit directives in the application code. These
918     directives are inserted with syntax appropriate to the particular target
919     platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
920    
921     \item {\bf -vpara} This causes the compiler to describe the multi-threaded
922     configuration it is creating. This is not required
923     but it can be useful when trouble shooting.
924    
925     \item {\bf -noautopar} This inhibits any automatic multi-threaded
926     parallelization the compiler may otherwise generate.
927    
928     \end{enumerate}
929    
930    
931     An example of valid settings for the {\em eedata} file for a
932     domain with two subdomains in y and running with two threads is shown
933     below
934     \begin{verbatim}
935     nTx=1,nTy=2
936     \end{verbatim}
937     This set of values will cause computations to stay within a single
938     thread when moving across the {\em nSx} sub-domains. In the y-direction,
939     however, sub-domains will be split equally between two threads.
940    
941     \paragraph{Multi-threading files and parameters} The following
942     files and variables are used in setting up multi-threaded execution.\\
943    
944     \fbox{
945     \begin{minipage}{4.75in}
946     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
947     File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
948     File: {\em model/src/THE\_MODEL\_MAIN.F}\\
949     File: {\em eesupp/src/MAIN.F}\\
950 afe 1.11 File: {\em tools/genmake2}\\
951 cnh 1.1 File: {\em eedata}\\
952     CPP: {\em TARGET\_SUN}\\
953     CPP: {\em TARGET\_DEC}\\
954     CPP: {\em TARGET\_HP }\\
955     CPP: {\em TARGET\_SGI}\\
956     CPP: {\em TARGET\_CRAY\_VECTOR}\\
957     Parameter: {\em nTx}\\
958     Parameter: {\em nTy}
959     \end{minipage}
960     } \\
961    
962     \subsubsection{Multi-process execution}
963 adcroft 1.6 \label{sect:multi-process-execution}
964 cnh 1.1
965     Despite its appealing programming model, multi-threaded execution remains
966     less common then multi-process execution. One major reason for this
967     is that many system libraries are still not ``thread-safe''. This means that for
968     example on some systems it is not safe to call system routines to
969     do I/O when running in multi-threaded mode, except for in a limited set of
970     circumstances. Another reason is that support for multi-threaded programming
971     models varies between systems.
972    
973     Multi-process execution is more ubiquitous.
974     In order to run code in a multi-process configuration a decomposition
975 adcroft 1.6 specification ( see section \ref{sect:specifying_a_decomposition})
976 cnh 1.3 is given ( in which the at least one of the
977 cnh 1.1 parameters {\em nPx} or {\em nPy} will be greater than one)
978     and then, as for multi-threaded operation,
979     appropriate compile time and run time steps must be taken.
980    
981     \paragraph{Compilation} Multi-process execution under the WRAPPER
982     assumes that the portable, MPI libraries are available
983     for controlling the start-up of multiple processes. The MPI libraries
984     are not required, although they are usually used, for performance
985     critical communication. However, in order to simplify the task
986     of controlling and coordinating the start up of a large number
987     (hundreds and possibly even thousands) of copies of the same
988     program, MPI is used. The calls to the MPI multi-process startup
989 afe 1.12 routines must be activated at compile time. Currently MPI libraries are
990     invoked by
991     specifying the appropriate options file with the
992 afe 1.15 {\tt-of} flag when running the {\em genmake2}
993 afe 1.12 script, which generates the Makefile for compiling and linking MITgcm.
994     (Previously this was done by setting the {\em ALLOW\_USE\_MPI} and
995     {\em ALWAYS\_USE\_MPI} flags in the {\em CPP\_EEOPTIONS.h} file.) More
996     detailed information about the use of {\em genmake2} for specifying
997 afe 1.15 local compiler flags is located in section \ref{sect:genmake}.\\
998 cnh 1.1
999    
1000     \fbox{
1001     \begin{minipage}{4.75in}
1002 afe 1.15 Directory: {\em tools/build\_options}\\
1003 afe 1.11 File: {\em tools/genmake2}
1004 cnh 1.1 \end{minipage}
1005     } \\
1006     \paragraph{\bf Execution} The mechanics of starting a program in
1007     multi-process mode under MPI is not standardized. Documentation
1008     associated with the distribution of MPI installed on a system will
1009     describe how to start a program using that distribution.
1010     For the free, open-source MPICH system the MITgcm program is started
1011     using a command such as
1012     \begin{verbatim}
1013     mpirun -np 64 -machinefile mf ./mitgcmuv
1014     \end{verbatim}
1015 cnh 1.5 In this example the text {\em -np 64} specifies the number of processes
1016 cnh 1.1 that will be created. The numeric value {\em 64} must be equal to the
1017     product of the processor grid settings of {\em nPx} and {\em nPy}
1018     in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
1019     called ``mf'' will be read to get a list of processor names on
1020     which the sixty-four processes will execute. The syntax of this file
1021 afe 1.15 is specified by the MPI distribution.
1022 cnh 1.1 \\
1023    
1024     \fbox{
1025     \begin{minipage}{4.75in}
1026     File: {\em SIZE.h}\\
1027     Parameter: {\em nPx}\\
1028     Parameter: {\em nPy}
1029     \end{minipage}
1030     } \\
1031    
1032 adcroft 1.2
1033     \paragraph{Environment variables}
1034     On most systems multi-threaded execution also requires the setting
1035     of a special environment variable. On many machines this variable
1036     is called PARALLEL and its values should be set to the number
1037     of parallel threads required. Generally the help pages associated
1038     with the multi-threaded compiler on a machine will explain
1039     how to set the required environment variables for that machines.
1040    
1041     \paragraph{Runtime input parameters}
1042     Finally the file {\em eedata} needs to be configured to indicate
1043     the number of threads to be used in the x and y directions.
1044     The variables {\em nTx} and {\em nTy} in this file are used to
1045     specify the information required. The product of {\em nTx} and
1046     {\em nTy} must be equal to the number of threads spawned i.e.
1047     the setting of the environment variable PARALLEL.
1048     The value of {\em nTx} must subdivide the number of sub-domains
1049     in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
1050     number of sub-domains in y ({\em nSy}) exactly.
1051 cnh 1.1 The multiprocess startup of the MITgcm executable {\em mitgcmuv}
1052     is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1053     {\em INI\_PROCS()}. The first routine performs basic steps required
1054     to make sure each process is started and has a textual output
1055     stream associated with it. By default two output files are opened
1056     for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.
1057     The {\bf NNNNN} part of the name is filled in with the process
1058     number so that process number 0 will create output files
1059     {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create
1060     output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
1061     are used for reporting status and configuration information and
1062     for reporting error conditions on a process by process basis.
1063 adcroft 1.2 The {\em EEBOOT\_MINIMAL()} procedure also sets the variables
1064 cnh 1.1 {\em myProcId} and {\em MPI\_COMM\_MODEL}.
1065     These variables are related
1066     to processor identification are are used later in the routine
1067     {\em INI\_PROCS()} to allocate tiles to processes.
1068    
1069     Allocation of processes to tiles in controlled by the routine
1070     {\em INI\_PROCS()}. For each process this routine sets
1071     the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
1072 afe 1.15 These variables specify in index space the coordinates
1073     of the southernmost and westernmost corner of the
1074     southernmost and westernmost tile owned by this process.
1075 cnh 1.1 The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
1076     are also set in this routine. These are used to identify
1077     processes holding tiles to the west, east, south and north
1078     of this process. These values are stored in global storage
1079     in the header file {\em EESUPPORT.h} for use by
1080 afe 1.13 communication routines. The above does not hold when the
1081     exch2 package is used -- exch2 sets its own parameters to
1082     specify the global indices of tiles and their relationships
1083 afe 1.15 to each other. See the documentation on the exch2 package
1084     (\ref{sec:exch2}) for
1085     details.
1086 cnh 1.1 \\
1087    
1088     \fbox{
1089     \begin{minipage}{4.75in}
1090     File: {\em eesupp/src/eeboot\_minimal.F}\\
1091     File: {\em eesupp/src/ini\_procs.F} \\
1092     File: {\em eesupp/inc/EESUPPORT.h} \\
1093     Parameter: {\em myProcId} \\
1094     Parameter: {\em MPI\_COMM\_MODEL} \\
1095     Parameter: {\em myXGlobalLo} \\
1096     Parameter: {\em myYGlobalLo} \\
1097     Parameter: {\em pidW } \\
1098     Parameter: {\em pidE } \\
1099     Parameter: {\em pidS } \\
1100     Parameter: {\em pidN }
1101     \end{minipage}
1102     } \\
1103    
1104    
1105     \subsection{Controlling communication}
1106     The WRAPPER maintains internal information that is used for communication
1107     operations and that can be customized for different platforms. This section
1108     describes the information that is held and used.
1109 adcroft 1.2
1110 cnh 1.1 \begin{enumerate}
1111 afe 1.15 \item {\bf Tile-tile connectivity information}
1112     For each tile the WRAPPER
1113     sets a flag that sets the tile number to the north,
1114     south, east and
1115 cnh 1.1 west of that tile. This number is unique over all tiles in a
1116 afe 1.15 configuration. Except when using the cubed sphere and the exch2 package,
1117 afe 1.14 the number is held in the variables {\em tileNo}
1118 cnh 1.1 ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
1119     {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
1120     that specifies the type of communication that is used between tiles.
1121     This information is held in the variables {\em tileCommModeN},
1122     {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.
1123     This latter set of variables can take one of the following values
1124     {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
1125     A value of {\em COMM\_NONE} is used to indicate that a tile has no
1126 cnh 1.4 neighbor to communicate with on a particular face. A value
1127 cnh 1.1 of {\em COMM\_MSG} is used to indicated that some form of distributed
1128     memory communication is required to communicate between
1129 adcroft 1.6 these tile faces ( see section \ref{sect:distributed_memory_communication}).
1130 cnh 1.1 A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1131     forms of shared memory communication ( see section
1132 adcroft 1.6 \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1133 cnh 1.1 that a CPU should communicate by writing to data structures owned by another
1134     CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1135     from data structures owned by another CPU. These flags affect the behavior
1136     of the WRAPPER exchange primitive
1137     (see figure \ref{fig:communication_primitives}). The routine
1138     {\em ini\_communication\_patterns()} is responsible for setting the
1139     communication mode values for each tile.
1140 afe 1.14
1141     When using the cubed sphere configuration with the exch2 package, the
1142     relationships between tiles and their communication methods are set
1143 afe 1.15 by the package in other variables. See the exch2 package documentation
1144     (\ref{sec:exch2} for details.
1145 afe 1.14
1146    
1147 cnh 1.1
1148     \fbox{
1149     \begin{minipage}{4.75in}
1150     File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1151     File: {\em eesupp/inc/EESUPPORT.h} \\
1152     Parameter: {\em tileNo} \\
1153     Parameter: {\em tileNoE} \\
1154     Parameter: {\em tileNoW} \\
1155     Parameter: {\em tileNoN} \\
1156     Parameter: {\em tileNoS} \\
1157     Parameter: {\em tileCommModeE} \\
1158     Parameter: {\em tileCommModeW} \\
1159     Parameter: {\em tileCommModeN} \\
1160     Parameter: {\em tileCommModeS} \\
1161     \end{minipage}
1162     } \\
1163    
1164     \item {\bf MP directives}
1165     The WRAPPER transfers control to numerical application code through
1166     the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1167     that allows for it to be invoked by several threads. Support for this
1168     is based on using multi-processing (MP) compiler directives.
1169     Most commercially available Fortran compilers support the generation
1170     of code to spawn multiple threads through some form of compiler directives.
1171     As this is generally much more convenient than writing code to interface
1172     to operating system libraries to explicitly spawn threads, and on some systems
1173     this may be the only method available the WRAPPER is distributed with
1174     template MP directives for a number of systems.
1175    
1176     These directives are inserted into the code just before and after the
1177     transfer of control to numerical algorithm code through the routine
1178     {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of
1179     the code that performs this process for a Silicon Graphics system.
1180     This code is extracted from the files {\em main.F} and
1181     {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies
1182     how many instances of the routine {\em THE\_MODEL\_MAIN} will
1183     be created. The value of {\em nThreads} is set in the routine
1184     {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the
1185     the product of the parameters {\em nTx} and {\em nTy} that
1186     are read from the file {\em eedata}. If the value of {\em nThreads}
1187     is inconsistent with the number of threads requested from the
1188     operating system (for example by using an environment
1189 adcroft 1.6 variable as described in section \ref{sect:multi_threaded_execution})
1190 cnh 1.1 then usually an error will be reported by the routine
1191     {\em CHECK\_THREADS}.\\
1192    
1193     \fbox{
1194     \begin{minipage}{4.75in}
1195     File: {\em eesupp/src/ini\_threading\_environment.F}\\
1196     File: {\em eesupp/src/check\_threads.F} \\
1197     File: {\em eesupp/src/main.F} \\
1198     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1199     File: {\em eedata } \\
1200     Parameter: {\em nThreads} \\
1201     Parameter: {\em nTx} \\
1202     Parameter: {\em nTy} \\
1203     \end{minipage}
1204     }
1205    
1206     \item {\bf memsync flags}
1207 adcroft 1.6 As discussed in section \ref{sect:memory_consistency}, when using shared memory,
1208 cnh 1.1 a low-level system function may be need to force memory consistency.
1209     The routine {\em MEMSYNC()} is used for this purpose. This routine should
1210     not need modifying and the information below is only provided for
1211     completeness. A logical parameter {\em exchNeedsMemSync} set
1212     in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether
1213     the {\em MEMSYNC()} primitive is called. In general this
1214     routine is only used for multi-threaded execution.
1215     The code that goes into the {\em MEMSYNC()}
1216     routine is specific to the compiler and
1217     processor being used for multi-threaded execution and in general
1218     must be written using a short code snippet of assembly language.
1219     For an Ultra Sparc system the following code snippet is used
1220     \begin{verbatim}
1221     asm("membar #LoadStore|#StoreStore");
1222     \end{verbatim}
1223 cnh 1.4 for an Alpha based system the equivalent code reads
1224 cnh 1.1 \begin{verbatim}
1225     asm("mb");
1226     \end{verbatim}
1227     while on an x86 system the following code is required
1228     \begin{verbatim}
1229     asm("lock; addl $0,0(%%esp)": : :"memory")
1230     \end{verbatim}
1231    
1232     \item {\bf Cache line size}
1233 adcroft 1.6 As discussed in section \ref{sect:cache_effects_and_false_sharing},
1234 cnh 1.1 milti-threaded codes explicitly avoid penalties associated with excessive
1235 cnh 1.5 coherence traffic on an SMP system. To do this the shared memory data structures
1236 cnh 1.1 used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1237     are padded. The variables that control the padding are set in the
1238     header file {\em EEPARAMS.h}. These variables are called
1239     {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and
1240     {\em lShare8}. The default values should not normally need changing.
1241     \item {\bf \_BARRIER}
1242     This is a CPP macro that is expanded to a call to a routine
1243 cnh 1.5 which synchronizes all the logical processors running under the
1244 cnh 1.1 WRAPPER. Using a macro here preserves flexibility to insert
1245     a specialized call in-line into application code. By default this
1246     resolves to calling the procedure {\em BARRIER()}. The default
1247     setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1248    
1249     \item {\bf \_GSUM}
1250     This is a CPP macro that is expanded to a call to a routine
1251 cnh 1.5 which sums up a floating point number
1252 cnh 1.1 over all the logical processors running under the
1253     WRAPPER. Using a macro here provides extra flexibility to insert
1254     a specialized call in-line into application code. By default this
1255 cnh 1.5 resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
1256     64-bit floating point operands)
1257     or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
1258 cnh 1.1 setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1259     The \_GSUM macro is a performance critical operation, especially for
1260     large processor count, small tile size configurations.
1261 adcroft 1.6 The custom communication example discussed in section \ref{sect:jam_example}
1262 cnh 1.1 shows how the macro is used to invoke a custom global sum routine
1263     for a specific set of hardware.
1264    
1265     \item {\bf \_EXCH}
1266     The \_EXCH CPP macro is used to update tile overlap regions.
1267     It is qualified by a suffix indicating whether overlap updates are for
1268     two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1269     physical fields and whether fields are 32-bit floating point
1270     ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point
1271     ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined
1272     in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1273     \_EXCH operation plays a crucial role in scaling to small tile,
1274     large logical and physical processor count configurations.
1275 adcroft 1.6 The example in section \ref{sect:jam_example} discusses defining an
1276 cnh 1.5 optimized and specialized form on the \_EXCH operation.
1277 cnh 1.1
1278     The \_EXCH operation is also central to supporting grids such as
1279     the cube-sphere grid. In this class of grid a rotation may be required
1280     between tiles. Aligning the coordinate requiring rotation with the
1281 cnh 1.5 tile decomposition, allows the coordinate transformation to
1282 afe 1.16 be embedded within a custom form of the \_EXCH primitive. In these
1283     cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1284     package documentation \ref{sec:exch2}.
1285 cnh 1.1
1286     \item {\bf Reverse Mode}
1287     The communication primitives \_EXCH and \_GSUM both employ
1288     hand-written adjoint forms (or reverse mode) forms.
1289     These reverse mode forms can be found in the
1290 cnh 1.5 source code directory {\em pkg/autodiff}.
1291 cnh 1.1 For the global sum primitive the reverse mode form
1292     calls are to {\em GLOBAL\_ADSUM\_R4} and
1293     {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1294 cnh 1.5 exchange primitives are found in routines
1295 cnh 1.1 prefixed {\em ADEXCH}. The exchange routines make calls to
1296     the same low-level communication primitives as the forward mode
1297     operations. However, the routine argument {\em simulationMode}
1298     is set to the value {\em REVERSE\_SIMULATION}. This signifies
1299     ti the low-level routines that the adjoint forms of the
1300     appropriate communication operation should be performed.
1301 afe 1.16
1302 cnh 1.1 \item {\bf MAX\_NO\_THREADS}
1303     The variable {\em MAX\_NO\_THREADS} is used to indicate the
1304     maximum number of OS threads that a code will use. This
1305     value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1306     For single threaded execution it can be reduced to one if required.
1307 cnh 1.5 The value; is largely private to the WRAPPER and application code
1308 cnh 1.1 will nor normally reference the value, except in the following scenario.
1309    
1310     For certain physical parametrization schemes it is necessary to have
1311     a substantial number of work arrays. Where these arrays are allocated
1312     in heap storage ( for example COMMON blocks ) multi-threaded
1313     execution will require multiple instances of the COMMON block data.
1314     This can be achieved using a Fortran 90 module construct, however,
1315     if this might be unavailable then the work arrays can be extended
1316     with dimensions use the tile dimensioning scheme of {\em nSx}
1317     and {\em nSy} ( as described in section
1318 adcroft 1.6 \ref{sect:specifying_a_decomposition}). However, if the configuration
1319 cnh 1.1 being specified involves many more tiles than OS threads then
1320     it can save memory resources to reduce the variable
1321     {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1322 cnh 1.5 will be used and to declare the physical parameterization
1323     work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
1324 cnh 1.1 An example of this is given in the verification experiment
1325     {\em aim.5l\_cs}. Here the default setting of
1326     {\em MAX\_NO\_THREADS} is altered to
1327     \begin{verbatim}
1328     INTEGER MAX_NO_THREADS
1329     PARAMETER ( MAX_NO_THREADS = 6 )
1330     \end{verbatim}
1331     and several work arrays for storing intermediate calculations are
1332     created with declarations of the form.
1333     \begin{verbatim}
1334     common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1335     \end{verbatim}
1336 cnh 1.5 This declaration scheme is not used widely, because most global data
1337 cnh 1.1 is used for permanent not temporary storage of state information.
1338     In the case of permanent state information this approach cannot be used
1339     because there has to be enough storage allocated for all tiles.
1340     However, the technique can sometimes be a useful scheme for reducing memory
1341 cnh 1.5 requirements in complex physical parameterizations.
1342 adcroft 1.2 \end{enumerate}
1343    
1344     \begin{figure}
1345     \begin{verbatim}
1346     C--
1347     C-- Parallel directives for MIPS Pro Fortran compiler
1348     C--
1349     C Parallel compiler directives for SGI with IRIX
1350     C$PAR PARALLEL DO
1351     C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1352     C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1353     C
1354     DO I=1,nThreads
1355     myThid = I
1356    
1357     C-- Invoke nThreads instances of the numerical model
1358     CALL THE_MODEL_MAIN(myThid)
1359    
1360     ENDDO
1361     \end{verbatim}
1362     \caption{Prior to transferring control to
1363     the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
1364     MP directives to spawn multiple threads.
1365     } \label{fig:mp_directives}
1366     \end{figure}
1367 cnh 1.1
1368    
1369     \subsubsection{Specializing the Communication Code}
1370    
1371     The isolation of performance critical communication primitives and the
1372     sub-division of the simulation domain into tiles is a powerful tool.
1373     Here we show how it can be used to improve application performance and
1374 cnh 1.5 how it can be used to adapt to new griding approaches.
1375 cnh 1.1
1376     \subsubsection{JAM example}
1377 adcroft 1.6 \label{sect:jam_example}
1378 cnh 1.1 On some platforms a big performance boost can be obtained by
1379     binding the communication routines {\em \_EXCH} and
1380     {\em \_GSUM} to specialized native libraries ) fro example the
1381     shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag
1382     is used as an illustration of a specialized communication configuration
1383     that substitutes for standard, portable forms of {\em \_EXCH} and
1384     {\em \_GSUM}. It affects three source files {\em eeboot.F},
1385     {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined
1386     is has the following effects.
1387     \begin{itemize}
1388     \item An extra phase is included at boot time to initialize the custom
1389     communications library ( see {\em ini\_jam.F}).
1390     \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1391     with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
1392     \item a highly specialized form of the exchange operator (optimized
1393 cnh 1.4 for overlap regions of width one) is substituted into the elliptic
1394 cnh 1.1 solver routine {\em cg2d.F}.
1395     \end{itemize}
1396     Developing specialized code for other libraries follows a similar
1397     pattern.
1398    
1399     \subsubsection{Cube sphere communication}
1400 adcroft 1.6 \label{sect:cube_sphere_communication}
1401 cnh 1.1 Actual {\em \_EXCH} routine code is generated automatically from
1402     a series of template files, for example {\em exch\_rx.template}.
1403     This is done to allow a large number of variations on the exchange
1404     process to be maintained. One set of variations supports the
1405 cnh 1.4 cube sphere grid. Support for a cube sphere grid in MITgcm is based
1406 afe 1.16 on having each face of the cube as a separate tile or tiles.
1407 cnh 1.4 The exchange routines are then able to absorb much of the
1408 cnh 1.1 detailed rotation and reorientation required when moving around the
1409     cube grid. The set of {\em \_EXCH} routines that contain the
1410     word cube in their name perform these transformations.
1411     They are invoked when the run-time logical parameter
1412     {\em useCubedSphereExchange} is set true. To facilitate the
1413     transformations on a staggered C-grid, exchange operations are defined
1414 cnh 1.4 separately for both vector and scalar quantities and for
1415 cnh 1.1 grid-centered and for grid-face and corner quantities.
1416     Three sets of exchange routines are defined. Routines
1417     with names of the form {\em exch\_rx} are used to exchange
1418     cell centered scalar quantities. Routines with names of the form
1419     {\em exch\_uv\_rx} are used to exchange vector quantities located at
1420     the C-grid velocity points. The vector quantities exchanged by the
1421     {\em exch\_uv\_rx} routines can either be signed (for example velocity
1422     components) or un-signed (for example grid-cell separations).
1423     Routines with names of the form {\em exch\_z\_rx} are used to exchange
1424     quantities at the C-grid vorticity point locations.
1425    
1426    
1427    
1428    
1429     \section{MITgcm execution under WRAPPER}
1430 edhill 1.19 \begin{rawhtml}
1431     <!-- CMIREDIR:mitgcm_wrapper: -->
1432     \end{rawhtml}
1433 cnh 1.1
1434     Fitting together the WRAPPER elements, package elements and
1435     MITgcm core equation elements of the source code produces calling
1436 adcroft 1.6 sequence shown in section \ref{sect:calling_sequence}
1437 cnh 1.1
1438     \subsection{Annotated call tree for MITgcm and WRAPPER}
1439 adcroft 1.6 \label{sect:calling_sequence}
1440 cnh 1.1
1441     WRAPPER layer.
1442    
1443 heimbach 1.8 {\footnotesize
1444 cnh 1.1 \begin{verbatim}
1445    
1446     MAIN
1447     |
1448     |--EEBOOT :: WRAPPER initialization
1449     | |
1450     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1451     | | allow basic I/O.
1452     | |-- EEINTRO_MSG :: Write startup greeting.
1453     | |
1454     | |-- EESET_PARMS :: Set WRAPPER parameters
1455     | |
1456     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1457     | |
1458     | |-- INI_PROCS :: Associate processes with grid regions.
1459     | |
1460     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1461     | |
1462     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1463     | :: communication data structures
1464     |
1465     |
1466     |--CHECK_THREADS :: Validate multiple thread start up.
1467     |
1468     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1469    
1470     \end{verbatim}
1471 heimbach 1.8 }
1472 cnh 1.1
1473     Core equations plus packages.
1474    
1475 heimbach 1.8 {\footnotesize
1476 cnh 1.1 \begin{verbatim}
1477     C
1478     C
1479     C Invocation from WRAPPER level...
1480     C :
1481     C :
1482     C |
1483     C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1484     C | :: Called from WRAPPER level numerical
1485 cnh 1.4 C | :: code invocation routine. On entry
1486 cnh 1.1 C | :: to THE_MODEL_MAIN separate thread and
1487     C | :: separate processes will have been established.
1488     C | :: Each thread and process will have a unique ID
1489     C | :: but as yet it will not be associated with a
1490     C | :: specific region in decomposed discrete space.
1491     C |
1492     C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1493     C | | :: grid, solver matrices etc..
1494     C | |
1495     C | |-INI_PARMS :: Routine to set kernel model parameters.
1496     C | | :: By default kernel parameters are read from file
1497     C | | :: "data" in directory in which code executes.
1498     C | |
1499 cnh 1.4 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1500 cnh 1.1 C | |
1501 cnh 1.4 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1502 cnh 1.1 C | | | :: Grid arrays are held and described in GRID.h.
1503     C | | |
1504 cnh 1.4 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1505 cnh 1.1 C | | |
1506 cnh 1.4 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1507 cnh 1.1 C | | | :: (calculate grid from kernel parameters).
1508     C | | |
1509     C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1510 cnh 1.4 C | | | :: initialization (calculate grid from
1511 cnh 1.1 C | | | :: kernel parameters).
1512     C | | |
1513     C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1514 cnh 1.4 C | | :: grid initializations. ( input from raw
1515 cnh 1.1 C | | :: grid files, LONC.bin, DXF.bin etc... )
1516     C | |
1517     C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1518     C | |
1519     C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1520     C | | :: land masking for solid-fluid boundaries.
1521     C | |
1522     C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1523     C | |
1524     C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1525 cnh 1.4 C | | :: sphere options are coded.
1526 cnh 1.1 C | |
1527     C | |-PACAKGES_BOOT :: Start up the optional package environment.
1528     C | | :: Runtime selection of active packages.
1529     C | |
1530     C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1531     C | | |
1532     C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1533     C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1534     C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1535     C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1536     C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1537     C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1538     C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1539     C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1540     C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1541     C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1542     C | |
1543     C | |-PACKAGES_CHECK
1544     C | | |
1545     C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1546 cnh 1.4 C | | |-OBCS_CHECK :: Open bndy Package. pkg/obcs
1547 cnh 1.1 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1548     C | |
1549     C | |-PACKAGES_INIT_FIXED
1550     C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1551     C | | |-FLT_INIT :: Floats Package. see pkg/flt
1552     C | |
1553     C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1554     C | |
1555     C | |-INI_CG2D :: 2d con. grad solver initialisation.
1556     C | |
1557     C | |-INI_CG3D :: 3d con. grad solver initialisation.
1558     C | |
1559     C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1560     C | :: Includes annotated table of kernel
1561     C | :: parameter settings.
1562     C |
1563     C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1564     C |
1565     C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1566 cnh 1.4 C ! :: Auotmatically generated by TAMC/TAF.
1567 cnh 1.1 C |
1568     C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1569     C |
1570     C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1571     C |
1572     C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1573     C | |
1574     C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1575     C | | | :: variables
1576     C | | |
1577     C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1578     C | | |
1579     C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1580 cnh 1.4 C | | | :: sphere options are coded.
1581 cnh 1.1 C | | |
1582     C | | |-INI_CG2D :: 2d con. grad solver initialisation.
1583     C | | |-INI_CG3D :: 3d con. grad solver initialisation.
1584     C | | |-INI_MIXING :: Initialise diapycnal diffusivity.
1585     C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical
1586     C | | | :: fields).
1587     C | | |
1588 cnh 1.4 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1589 cnh 1.1 C | | | |-INI_VEL :: Initialize 3D flow field.
1590     C | | | |-INI_THETA :: Set model initial temperature field.
1591     C | | | |-INI_SALT :: Set model initial salinity field.
1592     C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1593     C | | |
1594     C | | |-INI_TR1 :: Set initial tracer 1 distribution.
1595     C | | |
1596     C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1597     C | | | | :: Here applied to move restart conditions
1598     C | | | | :: (saved in mid timestep) to correct level in
1599     C | | | | :: time (only used for pre-c35).
1600     C | | | |
1601     C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1602     C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1603     C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1604     C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1605     C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1606     C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1607     C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1608     C | | | | |-FIND_RHO :: Find adjacent densities.
1609     C | | | | |-CONVECT :: Mix static instability.
1610     C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1611     C | | | |
1612     C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1613     C | | |
1614     C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1615     C | | | | :: Extra time history interactions.
1616     C | | | |
1617     C | | | |-FIND_RHO :: Find adjacent densities.
1618     C | | | |-CONVECT :: Mix static instability.
1619     C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1620     C | | |
1621     C | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving
1622     C | | | | :: package data.
1623     C | | | |
1624     C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1625     C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1626     C | | | |-KPP_OPEN_DIAGS
1627     C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1628     C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1629     C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1630     C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1631     C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1632     C | | | |-INI_FORCING :: Set model initial forcing fields.
1633     C | | | | :: Either set in-line or from file as shown.
1634     C | | | |-READ_FLD_XY_RS(zonalWindFile)
1635     C | | | |-READ_FLD_XY_RS(meridWindFile)
1636     C | | | |-READ_FLD_XY_RS(surfQFile)
1637     C | | | |-READ_FLD_XY_RS(EmPmRfile)
1638     C | | | |-READ_FLD_XY_RS(thetaClimFile)
1639     C | | | |-READ_FLD_XY_RS(saltClimFile)
1640     C | | | |-READ_FLD_XY_RS(surfQswFile)
1641     C | | |
1642     C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1643     C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1644     C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1645     C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1646     C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1647     C | |
1648     C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1649     C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1650     C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1651     C | |
1652     C | |-MONITOR :: Monitor state ( see pkg/monitor )
1653     C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1654     C====|>|
1655     C====|>| ****************************
1656     C====|>| BEGIN MAIN TIMESTEPPING LOOP
1657     C====|>| ****************************
1658     C====|>|
1659     C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1660     C/\ | | |
1661     C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1662     C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1663     C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1664     C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1665     C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1666 cnh 1.4 C/\ | | | | :: Simple interpolation between end-points
1667 cnh 1.1 C/\ | | | | :: for forcing datasets.
1668     C/\ | | | |
1669     C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1670     C/\ | | |
1671     C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1672     C/\ | | | |
1673     C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1674     C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1675     C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-Rhonil] of a slice
1676     C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1677     C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1678     C/\ | | | |
1679     C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1680     C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1681     C/\ | | | | :: forcing terms.
1682     C/\ | | | |
1683     C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1684     C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1685     C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1686     C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1687     C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1688     C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1689     C/\ | | | | advection case) (see pkg/gad).
1690     C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1691     C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1692     C/\ | | | | |
1693     C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1694     C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1695     C/\ | | | |
1696     C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1697     C/\ | | | | |
1698     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1699     C/\ | | | | | :: ( see pkg/gad )
1700     C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1701     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1702     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1703     C/\ | | | |
1704     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1705     C/\ | | | |
1706     C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1707     C/\ | | | | |
1708     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1709     C/\ | | | | | :: ( see pkg/gad )
1710     C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1711     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1712     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1713     C/\ | | | |
1714     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1715     C/\ | | | |
1716     C/\ | | | |-CALC_GTR1 :: Calculate other tracer(s) tendency terms
1717     C/\ | | | | |
1718     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1719     C/\ | | | | | :: ( see pkg/gad )
1720     C/\ | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.
1721     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1722     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1723     C/\ | | | |
1724     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1725     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1726     C/\ | | | |-FREEZE :: Limit range of temperature.
1727     C/\ | | | |
1728     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1729     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1730     C/\ | | | |
1731     C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1732     C/\ | | | |-EXCH :: Update overlaps
1733     C/\ | | |
1734     C/\ | | |-DYNAMICS :: Momentum equations driver.
1735     C/\ | | | |
1736     C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1737     C/\ | | | | Potential anomaly.
1738     C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1739     C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1740     C/\ | | | |
1741     C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1742     C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1743     C/\ | | | | pkg/mom_fluxform ).
1744     C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1745     C/\ | | | | pkg/mom_vecinv ).
1746     C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1747     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1748     C/\ | | | |
1749     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1750     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1751     C/\ | | | |
1752     C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1753     C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1754     C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1755     C/\ | | |
1756     C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1757     C/\ | | |
1758     C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1759     C/\ | | |
1760     C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1761     C/\ | | |
1762     C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1763     C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1764     C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1765     C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1766     C/\ | | |
1767     C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1768     C/\ | | | |
1769     C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1770     C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1771     C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1772     C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1773     C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1774     C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1775     C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1776     C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1777     C/\ | | | | |-CONVECT :: Mix static instability.
1778     C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1779     C/\ | | | |
1780     C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1781     C/\ | | |
1782     C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1783     C/\ | | | |-EXCH
1784     C/\ | | |
1785     C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1786     C/\ | | |
1787     C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1788     C/\ | | |
1789     C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1790     C/\ | | | |-WRITE_STATE :: Core state I/O
1791     C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1792     C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1793     C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1794     C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1795     C/\ | | |
1796     C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1797     C/\ | |
1798     C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1799     C<===|=|
1800     C<===|=| **************************
1801     C<===|=| END MAIN TIMESTEPPING LOOP
1802     C<===|=| **************************
1803     C<===|=|
1804     C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1805     C |
1806     C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1807     C |
1808     C |-TIMER_PRINTALL :: Computational timing summary
1809     C |
1810     C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1811     C :: events.
1812     C
1813     \end{verbatim}
1814 heimbach 1.8 }
1815 cnh 1.1
1816     \subsection{Measuring and Characterizing Performance}
1817    
1818     TO BE DONE (CNH)
1819    
1820     \subsection{Estimating Resource Requirements}
1821    
1822     TO BE DONE (CNH)
1823    
1824     \subsubsection{Atlantic 1/6 degree example}
1825     \subsubsection{Dry Run testing}
1826     \subsubsection{Adjoint Resource Requirements}
1827     \subsubsection{State Estimation Environment Resources}
1828    

  ViewVC Help
Powered by ViewVC 1.1.22