/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Annotation of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.4 - (hide annotations) (download) (as text)
Thu Oct 25 18:36:55 2001 UTC (23 years, 8 months ago) by cnh
Branch: MAIN
Changes since 1.3: +25 -25 lines
File MIME type: application/x-tex
We can't spell

1 cnh 1.4 % $Header: /u/u0/gcmpack/mitgcmdoc/part4/sarch.tex,v 1.3 2001/10/22 03:28:33 cnh Exp $
2 cnh 1.1
3     In this chapter we describe the software architecture and
4     implementation strategy for the MITgcm code. The first part of this
5     chapter discusses the MITgcm architecture at an abstract level. In the second
6     part of the chapter we described practical details of the MITgcm implementation
7     and of current tools and operating system features that are employed.
8    
9     \section{Overall architectural goals}
10    
11     Broadly, the goals of the software architecture employed in MITgcm are
12     three-fold
13    
14     \begin{itemize}
15     \item We wish to be able to study a very broad range
16     of interesting and challenging rotating fluids problems.
17     \item We wish the model code to be readily targeted to
18     a wide range of platforms
19     \item On any given platform we would like to be
20     able to achieve performance comparable to an implementation
21     developed and specialized specifically for that platform.
22     \end{itemize}
23    
24     These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
25     which conveys the goals of the MITgcm design. The goals lead to
26     a software architecture which at the high-level can be viewed as consisting
27     of
28    
29     \begin{enumerate}
30     \item A core set of numerical and support code. This is discussed in detail in
31     section \ref{sec:partII}.
32     \item A scheme for supporting optional "pluggable" {\bf packages} (containing
33     for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
34     These packages are used both to overlay alternate dynamics and to introduce
35     specialized physical content onto the core numerical code. An overview of
36     the {\bf package} scheme is given at the start of part \ref{part:packages}.
37     \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
38     Programming Environment Resource), within which the core numerics and pluggable
39     packages operate.
40     \end{enumerate}
41    
42     This chapter focuses on describing the {\bf WRAPPER} environment under which
43     both the core numerics and the pluggable packages function. The description
44 cnh 1.4 presented here is intended to be a detailed exposition and contains significant
45 cnh 1.1 background material, as well as advanced details on working with the WRAPPER.
46     The examples section of this manual (part \ref{part:example}) contains more
47     succinct, step-by-step instructions on running basic numerical
48     experiments both sequentially and in parallel. For many projects simply
49     starting from an example code and adapting it to suit a particular situation
50     will be all that is required.
51    
52 adcroft 1.2
53 cnh 1.1 \begin{figure}
54     \begin{center}
55 adcroft 1.2 \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
56 cnh 1.1 \end{center}
57 adcroft 1.2 \caption{
58     The MITgcm architecture is designed to allow simulation of a wide
59 cnh 1.1 range of physical problems on a wide range of hardware. The computational
60     resource requirements of the applications targeted range from around
61     $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
62     ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
63     interest range from $10^{9}$ floating point operations to more than $10^{17}$
64 adcroft 1.2 floating point operations.}
65     \label{fig:mitgcm_architecture_goals}
66 cnh 1.1 \end{figure}
67    
68     \section{WRAPPER}
69    
70     A significant element of the software architecture utilized in
71     MITgcm is a software superstructure and substructure collectively
72     called the WRAPPER (Wrappable Application Parallel Programming
73     Environment Resource). All numerical and support code in MITgcm is written
74     to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
75     the WRAPPER means that coding has to follow certain, relatively
76     straightforward, rules and conventions ( these are discussed further in
77     section \ref{sec:specifying_a_decomposition} ).
78    
79     The approach taken by the WRAPPER is illustrated in figure
80     \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
81     that fits within it from architectural differences between hardware platforms
82     and operating systems. This allows numerical code to be easily retargetted.
83 adcroft 1.2
84    
85 cnh 1.1 \begin{figure}
86     \begin{center}
87 adcroft 1.2 \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
88 cnh 1.1 \end{center}
89 adcroft 1.2 \caption{
90     Numerical code is written too fit within a software support
91 cnh 1.1 infrastructure called WRAPPER. The WRAPPER is portable and
92 cnh 1.4 can be specialized for a wide range of specific target hardware and
93 cnh 1.1 programming environments, without impacting numerical code that fits
94     within the WRAPPER. Codes that fit within the WRAPPER can generally be
95     made to run as fast on a particular platform as codes specially
96 adcroft 1.2 optimized for that platform.}
97     \label{fig:fit_in_wrapper}
98 cnh 1.1 \end{figure}
99    
100     \subsection{Target hardware}
101     \label{sec:target_hardware}
102    
103     The WRAPPER is designed to target as broad as possible a range of computer
104     systems. The original development of the WRAPPER took place on a
105     multi-processor, CRAY Y-MP system. On that system, numerical code performance
106     and scaling under the WRAPPER was in excess of that of an implementation that
107     was tightly bound to the CRAY systems proprietary multi-tasking and
108     micro-tasking approach. Later developments have been carried out on
109     uniprocessor and multi-processor Sun systems with both uniform memory access
110     (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
111     been undertaken on x86 cluster systems, Alpha processor based clustered SMP
112     systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
113     The MITgcm code, operating within the WRAPPER, is also used routinely used on
114     large scale MPP systems (for example T3E systems and IBM SP systems). In all
115     cases numerical code, operating within the WRAPPER, performs and scales very
116     competitively with equivalent numerical code that has been modified to contain
117     native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.
118    
119     \subsection{Supporting hardware neutrality}
120    
121     The different systems listed in section \ref{sec:target_hardware} can be
122     categorized in many different ways. For example, one common distinction is
123     between shared-memory parallel systems (SMP's, PVP's) and distributed memory
124     parallel systems (for example x86 clusters and large MPP systems). This is one
125     example of a difference between compute platforms that can impact an
126     application. Another common distinction is between vector processing systems
127     with highly specialized CPU's and memory subsystems and commodity
128     microprocessor based systems. There are numerous other differences, especially
129     in relation to how parallel execution is supported. To capture the essential
130     differences between different platforms the WRAPPER uses a {\it machine model}.
131    
132     \subsection{WRAPPER machine model}
133    
134     Applications using the WRAPPER are not written to target just one
135     particular machine (for example an IBM SP2) or just one particular family or
136     class of machines (for example Parallel Vector Processor Systems). Instead the
137     WRAPPER provides applications with an
138     abstract {\it machine model}. The machine model is very general, however, it can
139     easily be specialized to fit, in a computationally effificent manner, any
140     computer architecture currently available to the scientific computing community.
141    
142     \subsection{Machine model parallelism}
143    
144     Codes operating under the WRAPPER target an abstract machine that is assumed to
145     consist of one or more logical processors that can compute concurrently.
146     Computational work is divided amongst the logical
147     processors by allocating ``ownership'' to
148     each processor of a certain set (or sets) of calculations. Each set of
149     calculations owned by a particular processor is associated with a specific
150     region of the physical space that is being simulated, only one processor will
151     be associated with each such region (domain decomposition).
152    
153     In a strict sense the logical processors over which work is divided do not need
154     to correspond to physical processors. It is perfectly possible to execute a
155     configuration decomposed for multiple logical processors on a single physical
156     processor. This helps ensure that numerical code that is written to fit
157     within the WRAPPER will parallelize with no additional effort and is
158     also useful when debugging codes. Generally, however,
159     the computational domain will be subdivided over multiple logical
160     processors in order to then bind those logical processors to physical
161     processor resources that can compute in parallel.
162    
163     \subsubsection{Tiles}
164    
165     Computationally, associated with each region of physical
166     space allocated to a particular logical processor, there will be data
167     structures (arrays, scalar variables etc...) that hold the simulated state of
168     that region. We refer to these data structures as being {\bf owned} by the
169 cnh 1.4 processor to which their
170 cnh 1.1 associated region of physical space has been allocated. Individual
171     regions that are allocated to processors are called {\bf tiles}. A
172     processor can own more
173     than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being
174     mapped to a set of logical processors, with each processors owning a single
175     region of the domain (a single tile). Except for periods of
176     communication and coordination, each processor computes autonomously, working
177     only with data from the tile (or tiles) that the processor owns. When multiple
178     tiles are alloted to a single processor, each tile is computed on
179     independently of the other tiles, in a sequential fashion.
180    
181     \begin{figure}
182     \begin{center}
183 adcroft 1.2 \resizebox{5in}{!}{
184     \includegraphics{part4/domain_decomp.eps}
185 cnh 1.1 }
186     \end{center}
187     \caption{ The WRAPPER provides support for one and two dimensional
188     decompositions of grid-point domains. The figure shows a hypothetical domain of
189     total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in
190     two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf
191     tiles} are {\bf owned} by different processors. The {\bf owning}
192     processors perform the
193     arithmetic operations associated with a {\bf tile}. Although not illustrated
194     here, a single processor can {\bf own} several {\bf tiles}.
195     Whenever a processor wishes to transfer data between tiles or
196     communicate with other processors it calls a WRAPPER supplied
197     function.
198     } \label{fig:domaindecomp}
199     \end{figure}
200    
201    
202     \subsubsection{Tile layout}
203    
204     Tiles consist of an interior region and an overlap region. The overlap region
205     of a tile corresponds to the interior region of an adjacent tile.
206     In figure \ref{fig:tiledworld} each tile would own the region
207     within the black square and hold duplicate information for overlap
208     regions extending into the tiles to the north, south, east and west.
209     During
210     computational phases a processor will reference data in an overlap region
211     whenever it requires values that outside the domain it owns. Periodically
212     processors will make calls to WRAPPER functions to communicate data between
213     tiles, in order to keep the overlap regions up to date (see section
214     \ref{sec:communication_primitives}). The WRAPPER functions can use a
215     variety of different mechanisms to communicate data between tiles.
216    
217     \begin{figure}
218     \begin{center}
219 adcroft 1.2 \resizebox{5in}{!}{
220     \includegraphics{part4/tiled-world.eps}
221 cnh 1.1 }
222     \end{center}
223     \caption{ A global grid subdivided into tiles.
224     Tiles contain a interior region and an overlap region.
225     Overlap regions are periodically updated from neighboring tiles.
226     } \label{fig:tiledworld}
227     \end{figure}
228    
229     \subsection{Communication mechanisms}
230    
231     Logical processors are assumed to be able to exchange information
232     between tiles and between each other using at least one of two possible
233     mechanisms.
234    
235     \begin{itemize}
236     \item {\bf Shared memory communication}.
237     Under this mode of communication data transfers are assumed to be possible
238     using direct addressing of regions of memory. In this case a CPU is able to read
239     (and write) directly to regions of memory "owned" by another CPU
240     using simple programming language level assignment operations of the
241     the sort shown in figure \ref{fig:simple_assign}. In this way one CPU
242     (CPU1 in the figure) can communicate information to another CPU (CPU2 in the
243     figure) by assigning a particular value to a particular memory location.
244    
245     \item {\bf Distributed memory communication}.
246     Under this mode of communication there is no mechanism, at the application code level,
247     for directly addressing regions of memory owned and visible to another CPU. Instead
248     a communication library must be used as illustrated in figure
249     \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the
250     communication library to communicate data from a tile that it owns to a tile
251     that another CPU owns. By default the WRAPPER binds to the MPI communication
252     library \ref{MPI} for this style of communication.
253     \end{itemize}
254    
255     The WRAPPER assumes that communication will use one of these two styles
256     of communication. The underlying hardware and operating system support
257     for the style used is not specified and can vary from system to system.
258    
259     \begin{figure}
260     \begin{verbatim}
261    
262     CPU1 | CPU2
263     ==== | ====
264     |
265     a(3) = 8 | WHILE ( a(3) .NE. 8 )
266     | WAIT
267     | END WHILE
268     |
269     \end{verbatim}
270     \caption{ In the WRAPPER shared memory communication model, simple writes to an
271     array can be made to be visible to other CPU's at the application code level.
272     So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
273     element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)
274     will be able to see the value $8$ when they read from $a(3)$.
275     This provides a very low latency and high bandwidth communication
276     mechanism.
277     } \label{fig:simple_assign}
278     \end{figure}
279    
280     \begin{figure}
281     \begin{verbatim}
282    
283     CPU1 | CPU2
284     ==== | ====
285     |
286     a(3) = 8 | WHILE ( a(3) .NE. 8 )
287     CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
288     | END WHILE
289     |
290     \end{verbatim}
291     \caption{ In the WRAPPER distributed memory communication model
292     data can not be made directly visible to other CPU's.
293     If one CPU writes the value $8$ to element $3$ of array $a$, then
294     at least one of CPU1 and/or CPU2 in the figure above will need
295     to call a bespoke communication library in order for the updated
296     value to be communicated between CPU's.
297     } \label{fig:comm_msg}
298     \end{figure}
299    
300     \subsection{Shared memory communication}
301     \label{sec:shared_memory_communication}
302    
303     Under shared communication independent CPU's are operating
304     on the exact same global address space at the application level.
305     This means that CPU 1 can directly write into global
306     data structures that CPU 2 ``owns'' using a simple
307     assignment at the application level.
308     This is the model of memory access is supported at the basic system
309     design level in ``shared-memory'' systems such as PVP systems, SMP systems,
310     and on distributed shared memory systems (the SGI Origin).
311     On such systems the WRAPPER will generally use simple read and write statements
312     to access directly application data structures when communicating between CPU's.
313    
314     In a system where assignments statements, like the one in figure
315     \ref{fig:simple_assign} map directly to
316     hardware instructions that transport data between CPU and memory banks, this
317     can be a very efficient mechanism for communication. In this case two CPU's,
318     CPU1 and CPU2, can communicate simply be reading and writing to an
319     agreed location and following a few basic rules. The latency of this sort
320     of communication is generally not that much higher than the hardware
321     latency of other memory accesses on the system. The bandwidth available
322     between CPU's communicating in this way can be close to the bandwidth of
323     the systems main-memory interconnect. This can make this method of
324     communication very efficient provided it is used appropriately.
325    
326     \subsubsection{Memory consistency}
327     \label{sec:memory_consistency}
328    
329     When using shared memory communication between
330     multiple processors the WRAPPER level shields user applications from
331     certain counter-intuitive system behaviors. In particular, one issue the
332     WRAPPER layer must deal with is a systems memory model. In general the order
333     of reads and writes expressed by the textual order of an application code may
334     not be the ordering of instructions executed by the processor performing the
335     application. The processor performing the application instructions will always
336     operate so that, for the application instructions the processor is executing,
337     any reordering is not apparent. However, in general machines are often
338     designed so that reordering of instructions is not hidden from other second
339     processors. This means that, in general, even on a shared memory system two
340     processors can observe inconsistent memory values.
341    
342     The issue of memory consistency between multiple processors is discussed at
343     length in many computer science papers, however, from a practical point of
344     view, in order to deal with this issue, shared memory machines all provide
345     some mechanism to enforce memory consistency when it is needed. The exact
346     mechanism employed will vary between systems. For communication using shared
347     memory, the WRAPPER provides a place to invoke the appropriate mechanism to
348     ensure memory consistency for a particular platform.
349    
350     \subsubsection{Cache effects and false sharing}
351     \label{sec:cache_effects_and_false_sharing}
352    
353     Shared-memory machines often have local to processor memory caches
354     which contain mirrored copies of main memory. Automatic cache-coherence
355     protocols are used to maintain consistency between caches on different
356     processors. These cache-coherence protocols typically enforce consistency
357     between regions of memory with large granularity (typically 128 or 256 byte
358     chunks). The coherency protocols employed can be expensive relative to other
359     memory accesses and so care is taken in the WRAPPER (by padding synchronization
360     structures appropriately) to avoid unnecessary coherence traffic.
361    
362     \subsubsection{Operating system support for shared memory.}
363    
364     Applications running under multiple threads within a single process can
365     use shared memory communication. In this case {\it all} the memory locations
366     in an application are potentially visible to all the compute threads. Multiple
367     threads operating within a single process is the standard mechanism for
368     supporting shared memory that the WRAPPER utilizes. Configuring and launching
369     code to run in multi-threaded mode on specific platforms is discussed in
370     section \ref{sec:running_with_threads}. However, on many systems, potentially
371     very efficient mechanisms for using shared memory communication between
372     multiple processes (in contrast to multiple threads within a single
373     process) also exist. In most cases this works by making a limited region of
374     memory shared between processes. The MMAP \ref{magicgarden} and
375     IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do
376     vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.
377     Extensions exist for the WRAPPER that allow these mechanisms
378     to be used for shared memory communication. However, these mechanisms are not
379     distributed with the default WRAPPER sources, because of their proprietary
380     nature.
381    
382     \subsection{Distributed memory communication}
383     \label{sec:distributed_memory_communication}
384     Many parallel systems are not constructed in a way where it is
385     possible or practical for an application to use shared memory
386     for communication. For example cluster systems consist of individual computers
387     connected by a fast network. On such systems their is no notion of shared memory
388     at the system level. For this sort of system the WRAPPER provides support
389     for communication based on a bespoke communication library
390     (see figure \ref{fig:comm_msg}). The default communication library used is MPI
391     \ref{mpi}. However, it is relatively straightforward to implement bindings to
392     optimized platform specific communication libraries. For example the work
393     described in \ref{hoe-hill:99} substituted standard MPI communication for a
394     highly optimized library.
395    
396     \subsection{Communication primitives}
397     \label{sec:communication_primitives}
398    
399     \begin{figure}
400     \begin{center}
401 adcroft 1.2 \resizebox{5in}{!}{
402     \includegraphics{part4/comm-primm.eps}
403 cnh 1.1 }
404     \end{center}
405     \caption{Three performance critical parallel primititives are provided
406     by the WRAPPER. These primititives are always used to communicate data
407     between tiles. The figure shows four tiles. The curved arrows indicate
408     exchange primitives which transfer data between the overlap regions at tile
409     edges and interior regions for nearest-neighbor tiles.
410     The straight arrows symbolize global sum operations which connect all tiles.
411     The global sum operation provides both a key arithmetic primitive and can
412     serve as a synchronization primitive. A third barrier primitive is also
413     provided, it behaves much like the global sum primitive.
414     } \label{fig:communication_primitives}
415     \end{figure}
416    
417    
418     Optimized communication support is assumed to be possibly available
419     for a small number of communication operations.
420     It is assumed that communication performance optimizations can
421     be achieved by optimizing a small number of communication primitives.
422     Three optimizable primitives are provided by the WRAPPER
423     \begin{itemize}
424     \item{\bf EXCHANGE} This operation is used to transfer data between interior
425     and overlap regions of neighboring tiles. A number of different forms of this
426     operation are supported. These different forms handle
427     \begin{itemize}
428     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
429     separately.
430     \item Bindings to different communication methods.
431     Exchange primitives select between using shared memory or distributed
432     memory communication.
433     \item Transformation operations required when transporting
434     data between different grid regions. Transferring data
435     between faces of a cube-sphere grid, for example, involves a rotation
436     of vector components.
437     \item Forward and reverse mode computations. Derivative calculations require
438     tangent linear and adjoint forms of the exchange primitives.
439    
440     \end{itemize}
441    
442     \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
443     operation for the pressure inversion phase of the MITgcm algorithm.
444     For certain configurations scaling can be highly sensitive to
445     the performance of the global sum primitive. This operation is a collective
446     operation involving all tiles of the simulated domain. Different forms
447     of the global sum primitive exist for handling
448     \begin{itemize}
449     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
450     separately.
451     \item Bindings to different communication methods.
452     Exchange primitives select between using shared memory or distributed
453     memory communication.
454     \item Forward and reverse mode computations. Derivative calculations require
455     tangent linear and adjoint forms of the exchange primitives.
456     \end{itemize}
457    
458     \item{\bf BARRIER} The WRAPPER provides a global synchronization function
459     called barrier. This is used to synchronize computations over all tiles.
460     The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in
461     some cases use the same underlying code.
462     \end{itemize}
463    
464    
465     \subsection{Memory architecture}
466    
467     The WRAPPER machine model is aimed to target efficiently systems with
468     highly pipelined memory architectures and systems with deep memory
469     hierarchies that favor memory reuse. This is achieved by supporting a
470     flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
471     Within a CPU computations are carried out sequentially on each tile
472     in turn. By reshaping tiles according to the target platform it is
473     possible to automatically tune code to improve memory performance.
474     On a vector machine a given domain might be sub-divided into a few
475     long, thin regions. On a commodity microprocessor based system, however,
476     the same region could be simulated use many more smaller
477     sub-domains.
478    
479    
480     \begin{figure}
481     \begin{center}
482 adcroft 1.2 \resizebox{5in}{!}{
483     \includegraphics{part4/tiling_detail.eps}
484 cnh 1.1 }
485     \end{center}
486     \caption{The tiling strategy that the WRAPPER supports allows tiles
487     to be shaped to suit the underlying system memory architecture.
488     Compact tiles that lead to greater memory reuse can be used on cache
489     based systems (upper half of figure) with deep memory hierarchies, long tiles
490     with large inner loops can be used to exploit vector systems having
491     highly pipelined memory systems.
492     } \label{fig:tiling-strategy}
493     \end{figure}
494    
495     \newpage
496     \subsection{Summary}
497     Following the discussion above, the machine model that the WRAPPER
498     presents to an application has the following characteristics
499    
500     \begin{itemize}
501     \item The machine consists of one or more logical processors. \vspace{-3mm}
502     \item Each processor operates on tiles that it owns.\vspace{-3mm}
503     \item A processor may own more than one tile.\vspace{-3mm}
504     \item Processors may compute concurrently.\vspace{-3mm}
505     \item Exchange of information between tiles is handled by the
506     machine (WRAPPER) not by the application.
507     \end{itemize}
508     Behind the scenes this allows the WRAPPER to adapt the machine model
509     functions to exploit hardware on which
510     \begin{itemize}
511     \item Processors may be able to communicate very efficiently with each other
512     using shared memory. \vspace{-3mm}
513     \item An alternative communication mechanism based on a relatively
514     simple inter-process communication API may be required.\vspace{-3mm}
515     \item Shared memory may not necessarily obey sequential consistency,
516     however some mechanism will exist for enforcing memory consistency.
517     \vspace{-3mm}
518     \item Memory consistency that is enforced at the hardware level
519     may be expensive. Unnecessary triggering of consistency protocols
520     should be avoided. \vspace{-3mm}
521     \item Memory access patterns may need to either repetitive or highly
522     pipelined for optimum hardware performance. \vspace{-3mm}
523     \end{itemize}
524    
525     This generic model captures the essential hardware ingredients
526     of almost all successful scientific computer systems designed in the
527     last 50 years.
528    
529     \section{Using the WRAPPER}
530    
531     In order to support maximum portability the WRAPPER is implemented primarily
532     in sequential Fortran 77. At a practical level the key steps provided by the
533     WRAPPER are
534     \begin{enumerate}
535     \item specifying how a domain will be decomposed
536     \item starting a code in either sequential or parallel modes of operations
537     \item controlling communication between tiles and between concurrently
538     computing CPU's.
539     \end{enumerate}
540     This section describes the details of each of these operations.
541     Section \ref{sec:specifying_a_decomposition} explains how the way in which
542     a domain is decomposed (or composed) is expressed. Section
543     \ref{sec:starting_a_code} describes practical details of running codes
544     in various different parallel modes on contemporary computer systems.
545     Section \ref{sec:controlling_communication} explains the internal information
546     that the WRAPPER uses to control how information is communicated between
547     tiles.
548    
549     \subsection{Specifying a domain decomposition}
550     \label{sec:specifying_a_decomposition}
551    
552     At its heart much of the WRAPPER works only in terms of a collection of tiles
553     which are interconnected to each other. This is also true of application
554     code operating within the WRAPPER. Application code is written as a series
555     of compute operations, each of which operates on a single tile. If
556     application code needs to perform operations involving data
557     associated with another tile, it uses a WRAPPER function to obtain
558     that data.
559     The specification of how a global domain is constructed from tiles or alternatively
560     how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
561     This file defines the following parameters \\
562    
563     \fbox{
564     \begin{minipage}{4.75in}
565     Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
566     File: {\em model/inc/SIZE.h}
567     \end{minipage}
568     } \\
569    
570     Together these parameters define a tiling decomposition of the style shown in
571     figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
572     the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
573     maximum size of the overlap extent. This must be set to the maximum width
574     of the computation stencil that the numerical code finite-difference operations
575     require between overlap region updates. The maximum overlap required
576     by any of the operations in the MITgcm code distributed with this release is three grid
577     points. This is set by the requirements of the $\nabla^4$ dissipation and
578     diffusion operator. Code modifications and enhancements that involve adding wide
579     finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
580     Setting {\em OLx} and {\em OLy} to a too large value will decrease code
581     performance (because redundant computations will be performed), however it will
582     not cause any other problems.
583    
584     \begin{figure}
585     \begin{center}
586 adcroft 1.2 \resizebox{5in}{!}{
587     \includegraphics{part4/size_h.eps}
588 cnh 1.1 }
589     \end{center}
590     \caption{ The three level domain decomposition hierarchy employed by the
591     WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
592     to a single process. Multiple processes can exist, each with multiple tiles.
593     Tiles within a process can be spread over multiple compute threads.
594     } \label{fig:labelled_tile}
595     \end{figure}
596    
597     The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
598     be created within a single process. Each of these tiles will have internal
599     dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
600     allocated to different threads of a process that are then bound to
601     different physical processors ( see the multi-threaded
602     execution discussion in section \ref{sec:starting_the_code} ) then
603     computation will be performed concurrently on each tile. However, it is also
604     possible to run the same decomposition within a process running a single thread on
605     a single processor. In this case the tiles will be computed over sequentially.
606     If the decomposition is run in a single process running multiple threads
607     but attached to a single physical processor, then, in general, the computation
608     for different tiles will be interleaved by system level software.
609     This too is a valid mode of operation.
610    
611     The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
612     numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
613     are used to form the loop ranges for many numerical calculations and
614     to provide dimensions for arrays holding numerical state.
615     The {\em nSx} and{\em nSy} are used in conjunction with the thread number
616     parameter {\em myThid}. Much of the numerical code operating within the
617     WRAPPER takes the form
618     \begin{verbatim}
619     DO bj=myByLo(myThid),myByHi(myThid)
620     DO bi=myBxLo(myThid),myBxHi(myThid)
621     :
622     a block of computations ranging
623     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
624     :
625     ENDDO
626     ENDDO
627    
628     communication code to sum a number or maybe update
629     tile overlap regions
630    
631     DO bj=myByLo(myThid),myByHi(myThid)
632     DO bi=myBxLo(myThid),myBxHi(myThid)
633     :
634     another block of computations ranging
635     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
636     :
637     ENDDO
638     ENDDO
639     \end{verbatim}
640     The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
641     myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
642     schematic. These variables specify the subset of the tiles in
643     the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
644     thread number {\em myThid} owns. The thread number variable {\em myThid}
645     ranges from 1 to the total number of threads requested at execution time.
646     For each value of {\em myThid} the loop scheme above will step sequentially
647     through the tiles owned by that thread. However, different threads will
648     have different ranges of tiles assigned to them, so that separate threads can
649     compute iterations of the {\em bi}, {\em bj} loop concurrently.
650     Within a {\em bi}, {\em bj} loop
651     computation is performed concurrently over as many processes and threads
652     as there are physical processors available to compute.
653    
654     The amount of computation that can be embedded
655     a single loop over {\em bi} and {\em bj} varies for different parts of the
656     MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
657     from the two-dimensional implicit elliptic solver. This portion of the
658     code computes the l2Norm of a vector whose elements are held in
659     the array {\em cg2d\_r} writing the final result to scalar variable
660     {\em err}. In this case, because the l2norm requires a global reduction,
661     the {\em bi},{\em bj} loop only contains one statement. This computation
662     phase is then followed by a communication phase in which all threads and
663     processes must participate. However,
664     in other areas of the MITgcm code entries subsections of code are within
665     a single {\em bi},{\em bj} loop. For example the evaluation of all
666     the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
667     is within a single {\em bi},{\em bj} loop.
668    
669     \begin{figure}
670     \begin{verbatim}
671     REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
672     REAL*8 err
673     :
674     :
675     other computations
676     :
677     :
678     err = 0.
679     DO bj=myByLo(myThid),myByHi(myThid)
680     DO bi=myBxLo(myThid),myBxHi(myThid)
681     DO J=1,sNy
682     DO I=1,sNx
683     err = err +
684     & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
685     ENDDO
686     ENDDO
687     ENDDO
688     ENDDO
689    
690     CALL GLOBAL_SUM_R8( err , myThid )
691     err = SQRT(err)
692    
693     \end{verbatim}
694     \caption{Example of numerical code for calculating
695     the l2-norm of a vector within the WRAPPER. Notice that
696     under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
697     dimensions. These right most indices are tile indexes. Different
698     threads with a single process operate on different ranges of tile
699     index, as controlled by the settings of
700     {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
701     } \label{fig:bibj_extract}
702     \end{figure}
703    
704     The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
705     are used to indicate to the WRAPPER level how many processes (each with
706     {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
707     This information is needed during initialization and during I/O phases.
708     However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
709     the values of {\em nPx} and {\em nPy} are absent
710     from the core numerical and support code.
711    
712     \subsubsection{Examples of {\em SIZE.h} specifications}
713    
714     The following different {\em SIZE.h} parameter setting illustrate how to
715     interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
716     and {\em nPy}.
717     \begin{enumerate}
718     \item
719     \begin{verbatim}
720     PARAMETER (
721     & sNx = 90,
722     & sNy = 40,
723     & OLx = 3,
724     & OLy = 3,
725     & nSx = 1,
726     & nSy = 1,
727     & nPx = 1,
728     & nPy = 1)
729     \end{verbatim}
730     This sets up a single tile with x-dimension of ninety grid points, y-dimension of
731     forty grid points, and x and y overlaps of three grid points each.
732     \item
733     \begin{verbatim}
734     PARAMETER (
735     & sNx = 45,
736     & sNy = 20,
737     & OLx = 3,
738     & OLy = 3,
739     & nSx = 1,
740     & nSy = 1,
741     & nPx = 2,
742     & nPy = 2)
743     \end{verbatim}
744     This sets up tiles with x-dimension of forty-five grid points, y-dimension of
745     twenty grid points, and x and y overlaps of three grid points each. There are
746     four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
747     arranged so that the global domain size is again ninety grid points in x and
748     forty grid points in y. In general the formula for global grid size (held in
749     model variables {\em Nx} and {\em Ny}) is
750     \begin{verbatim}
751     Nx = sNx*nSx*nPx
752     Ny = sNy*nSy*nPy
753     \end{verbatim}
754     \item
755     \begin{verbatim}
756     PARAMETER (
757     & sNx = 90,
758     & sNy = 10,
759     & OLx = 3,
760     & OLy = 3,
761     & nSx = 1,
762     & nSy = 2,
763     & nPx = 1,
764     & nPy = 2)
765     \end{verbatim}
766     This sets up tiles with x-dimension of ninety grid points, y-dimension of
767     ten grid points, and x and y overlaps of three grid points each. There are
768     four tiles allocated to two separate processes ({\em nPy=2}) each of which
769     has two separate sub-domains {\em nSy=2},
770     The global domain size is again ninety grid points in x and
771     forty grid points in y. The two sub-domains in each process will be computed
772     sequentially if they are given to a single thread within a single process.
773     Alternatively if the code is invoked with multiple threads per process
774     the two domains in y may be computed on concurrently.
775     \item
776     \begin{verbatim}
777     PARAMETER (
778     & sNx = 32,
779     & sNy = 32,
780     & OLx = 3,
781     & OLy = 3,
782     & nSx = 6,
783     & nSy = 1,
784     & nPx = 1,
785     & nPy = 1)
786     \end{verbatim}
787     This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
788     thirty-two grid points, and x and y overlaps of three grid points each.
789     There are six tiles allocated to six separate logical processors ({\em nSx=6}).
790     This set of values can be used for a cube sphere calculation.
791     Each tile of size $32 \times 32$ represents a face of the
792 cnh 1.4 cube. Initializing the tile connectivity correctly ( see section
793 cnh 1.1 \ref{sec:cube_sphere_communication}. allows the rotations associated with
794     moving between the six cube faces to be embedded within the
795     tile-tile communication code.
796     \end{enumerate}
797    
798    
799     \subsection{Starting the code}
800     \label{sec:starting_the_code}
801     When code is started under the WRAPPER, execution begins in a main routine {\em
802     eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
803     to the application through a routine called {\em THE\_MODEL\_MAIN()}
804     once the WRAPPER has initialized correctly and has created the necessary variables
805     to support subsequent calls to communication routines
806     by the application code. The startup calling sequence followed by the
807     WRAPPER is shown in figure \ref{fig:wrapper_startup}.
808    
809     \begin{figure}
810     \begin{verbatim}
811    
812     MAIN
813     |
814     |--EEBOOT :: WRAPPER initialization
815     | |
816     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
817     | | allow basic I/O.
818     | |-- EEINTRO_MSG :: Write startup greeting.
819     | |
820     | |-- EESET_PARMS :: Set WRAPPER parameters
821     | |
822     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
823     | |
824     | |-- INI_PROCS :: Associate processes with grid regions.
825     | |
826     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
827     | |
828     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
829     | :: communication data structures
830     |
831     |
832     |--CHECK_THREADS :: Validate multiple thread start up.
833     |
834     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
835    
836    
837     \end{verbatim}
838     \caption{Main stages of the WRAPPER startup procedure.
839     This process proceeds transfer of control to application code, which
840     occurs through the procedure {\em THE\_MODEL\_MAIN()}.
841     } \label{fig:wrapper_startup}
842     \end{figure}
843    
844     \subsubsection{Multi-threaded execution}
845 cnh 1.3 \label{sec:multi-threaded-execution}
846 cnh 1.1 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
847     WRAPPER may cause several coarse grain threads to be initialized. The routine
848     {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
849     stack argument which is the thread number, stored in the
850     variable {\em myThid}. In addition to specifying a decomposition with
851     multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})
852     configuring and starting a code to run using multiple threads requires the following
853     steps.\\
854    
855     \paragraph{Compilation}
856     First the code must be compiled with appropriate multi-threading directives
857     active in the file {\em main.F} and with appropriate compiler flags
858     to request multi-threading support. The header files
859     {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
860     contain directives compatible with compilers for Sun, Compaq, SGI,
861     Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
862     activated by using compile time
863     directives {\em -DTARGET\_SUN},
864     {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
865     or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
866     for invoking multi-threaded compilation vary from system to system
867     and from compiler to compiler. The options will be described
868     in the individual compiler documentation. For the Fortran compiler
869     from Sun the following options are needed to correctly compile
870     multi-threaded code
871     \begin{verbatim}
872     -stackvar -explicitpar -vpara -noautopar
873     \end{verbatim}
874     These options are specific to the Sun compiler. Other compilers
875     will use different syntax that will be described in their
876     documentation. The effect of these options is as follows
877     \begin{enumerate}
878     \item {\bf -stackvar} Causes all local variables to be allocated in stack
879     storage. This is necessary for local variables to ensure that they are private
880     to their thread. Note, when using this option it may be necessary to override
881     the default limit on stack-size that the operating system assigns to a process.
882     This can normally be done by changing the settings of the command shells
883     {\em stack-size} limit variable. However, on some systems changing this limit
884     will require privileged administrator access to modify system parameters.
885    
886     \item {\bf -explicitpar} Requests that multiple threads be spawned
887     in response to explicit directives in the application code. These
888     directives are inserted with syntax appropriate to the particular target
889     platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
890    
891     \item {\bf -vpara} This causes the compiler to describe the multi-threaded
892     configuration it is creating. This is not required
893     but it can be useful when trouble shooting.
894    
895     \item {\bf -noautopar} This inhibits any automatic multi-threaded
896     parallelization the compiler may otherwise generate.
897    
898     \end{enumerate}
899    
900    
901     An example of valid settings for the {\em eedata} file for a
902     domain with two subdomains in y and running with two threads is shown
903     below
904     \begin{verbatim}
905     nTx=1,nTy=2
906     \end{verbatim}
907     This set of values will cause computations to stay within a single
908     thread when moving across the {\em nSx} sub-domains. In the y-direction,
909     however, sub-domains will be split equally between two threads.
910    
911     \paragraph{Multi-threading files and parameters} The following
912     files and variables are used in setting up multi-threaded execution.\\
913    
914     \fbox{
915     \begin{minipage}{4.75in}
916     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
917     File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
918     File: {\em model/src/THE\_MODEL\_MAIN.F}\\
919     File: {\em eesupp/src/MAIN.F}\\
920     File: {\em tools/genmake}\\
921     File: {\em eedata}\\
922     CPP: {\em TARGET\_SUN}\\
923     CPP: {\em TARGET\_DEC}\\
924     CPP: {\em TARGET\_HP }\\
925     CPP: {\em TARGET\_SGI}\\
926     CPP: {\em TARGET\_CRAY\_VECTOR}\\
927     Parameter: {\em nTx}\\
928     Parameter: {\em nTy}
929     \end{minipage}
930     } \\
931    
932     \subsubsection{Multi-process execution}
933 cnh 1.3 \label{sec:multi-process-execution}
934 cnh 1.1
935     Despite its appealing programming model, multi-threaded execution remains
936     less common then multi-process execution. One major reason for this
937     is that many system libraries are still not ``thread-safe''. This means that for
938     example on some systems it is not safe to call system routines to
939     do I/O when running in multi-threaded mode, except for in a limited set of
940     circumstances. Another reason is that support for multi-threaded programming
941     models varies between systems.
942    
943     Multi-process execution is more ubiquitous.
944     In order to run code in a multi-process configuration a decomposition
945 cnh 1.3 specification ( see section \ref{sec:specifying_a_decomposition})
946     is given ( in which the at least one of the
947 cnh 1.1 parameters {\em nPx} or {\em nPy} will be greater than one)
948     and then, as for multi-threaded operation,
949     appropriate compile time and run time steps must be taken.
950    
951     \paragraph{Compilation} Multi-process execution under the WRAPPER
952     assumes that the portable, MPI libraries are available
953     for controlling the start-up of multiple processes. The MPI libraries
954     are not required, although they are usually used, for performance
955     critical communication. However, in order to simplify the task
956     of controlling and coordinating the start up of a large number
957     (hundreds and possibly even thousands) of copies of the same
958     program, MPI is used. The calls to the MPI multi-process startup
959     routines must be activated at compile time. This is done
960     by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}
961     flags in the {\em CPP\_EEOPTIONS.h} file.\\
962    
963     \fbox{
964     \begin{minipage}{4.75in}
965     File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\
966     CPP: {\em ALLOW\_USE\_MPI}\\
967     CPP: {\em ALWAYS\_USE\_MPI}\\
968     Parameter: {\em nPx}\\
969     Parameter: {\em nPy}
970     \end{minipage}
971     } \\
972    
973     Additionally, compile time options are required to link in the
974     MPI libraries and header files. Examples of these options
975     can be found in the {\em genmake} script that creates makefiles
976     for compilation. When this script is executed with the {bf -mpi}
977     flag it will generate a makefile that includes
978     paths for search for MPI head files and for linking in
979     MPI libraries. For example the {\bf -mpi} flag on a
980     Silicon Graphics IRIX system causes a
981     Makefile with the compilation command
982     Graphics IRIX system \begin{verbatim}
983     mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI
984     \end{verbatim}
985     to be generated.
986     This is the correct set of options for using the MPICH open-source
987     version of MPI, when it has been installed under the subdirectory
988     /usr/local/mpi.
989     However, on many systems there may be several
990     versions of MPI installed. For example many systems have both
991     the open source MPICH set of libraries and a vendor specific native form
992     of the MPI libraries. The correct setup to use will depend on the
993     local configuration of your system.\\
994    
995     \fbox{
996     \begin{minipage}{4.75in}
997     File: {\em tools/genmake}
998     \end{minipage}
999     } \\
1000     \paragraph{\bf Execution} The mechanics of starting a program in
1001     multi-process mode under MPI is not standardized. Documentation
1002     associated with the distribution of MPI installed on a system will
1003     describe how to start a program using that distribution.
1004     For the free, open-source MPICH system the MITgcm program is started
1005     using a command such as
1006     \begin{verbatim}
1007     mpirun -np 64 -machinefile mf ./mitgcmuv
1008     \end{verbatim}
1009     In this example the text {\em -np 64} specifices the number of processes
1010     that will be created. The numeric value {\em 64} must be equal to the
1011     product of the processor grid settings of {\em nPx} and {\em nPy}
1012     in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
1013     called ``mf'' will be read to get a list of processor names on
1014     which the sixty-four processes will execute. The syntax of this file
1015     is specified by the MPI distribution
1016     \\
1017    
1018     \fbox{
1019     \begin{minipage}{4.75in}
1020     File: {\em SIZE.h}\\
1021     Parameter: {\em nPx}\\
1022     Parameter: {\em nPy}
1023     \end{minipage}
1024     } \\
1025    
1026 adcroft 1.2
1027     \paragraph{Environment variables}
1028     On most systems multi-threaded execution also requires the setting
1029     of a special environment variable. On many machines this variable
1030     is called PARALLEL and its values should be set to the number
1031     of parallel threads required. Generally the help pages associated
1032     with the multi-threaded compiler on a machine will explain
1033     how to set the required environment variables for that machines.
1034    
1035     \paragraph{Runtime input parameters}
1036     Finally the file {\em eedata} needs to be configured to indicate
1037     the number of threads to be used in the x and y directions.
1038     The variables {\em nTx} and {\em nTy} in this file are used to
1039     specify the information required. The product of {\em nTx} and
1040     {\em nTy} must be equal to the number of threads spawned i.e.
1041     the setting of the environment variable PARALLEL.
1042     The value of {\em nTx} must subdivide the number of sub-domains
1043     in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
1044     number of sub-domains in y ({\em nSy}) exactly.
1045 cnh 1.1 The multiprocess startup of the MITgcm executable {\em mitgcmuv}
1046     is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1047     {\em INI\_PROCS()}. The first routine performs basic steps required
1048     to make sure each process is started and has a textual output
1049     stream associated with it. By default two output files are opened
1050     for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.
1051     The {\bf NNNNN} part of the name is filled in with the process
1052     number so that process number 0 will create output files
1053     {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create
1054     output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
1055     are used for reporting status and configuration information and
1056     for reporting error conditions on a process by process basis.
1057 adcroft 1.2 The {\em EEBOOT\_MINIMAL()} procedure also sets the variables
1058 cnh 1.1 {\em myProcId} and {\em MPI\_COMM\_MODEL}.
1059     These variables are related
1060     to processor identification are are used later in the routine
1061     {\em INI\_PROCS()} to allocate tiles to processes.
1062    
1063     Allocation of processes to tiles in controlled by the routine
1064     {\em INI\_PROCS()}. For each process this routine sets
1065     the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
1066     These variables specify (in index space) the coordinate
1067     of the southern most and western most corner of the
1068     southern most and western most tile owned by this process.
1069     The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
1070     are also set in this routine. These are used to identify
1071     processes holding tiles to the west, east, south and north
1072     of this process. These values are stored in global storage
1073     in the header file {\em EESUPPORT.h} for use by
1074     communication routines.
1075     \\
1076    
1077     \fbox{
1078     \begin{minipage}{4.75in}
1079     File: {\em eesupp/src/eeboot\_minimal.F}\\
1080     File: {\em eesupp/src/ini\_procs.F} \\
1081     File: {\em eesupp/inc/EESUPPORT.h} \\
1082     Parameter: {\em myProcId} \\
1083     Parameter: {\em MPI\_COMM\_MODEL} \\
1084     Parameter: {\em myXGlobalLo} \\
1085     Parameter: {\em myYGlobalLo} \\
1086     Parameter: {\em pidW } \\
1087     Parameter: {\em pidE } \\
1088     Parameter: {\em pidS } \\
1089     Parameter: {\em pidN }
1090     \end{minipage}
1091     } \\
1092    
1093    
1094     \subsection{Controlling communication}
1095     The WRAPPER maintains internal information that is used for communication
1096     operations and that can be customized for different platforms. This section
1097     describes the information that is held and used.
1098 adcroft 1.2
1099 cnh 1.1 \begin{enumerate}
1100     \item {\bf Tile-tile connectivity information} For each tile the WRAPPER
1101     sets a flag that sets the tile number to the north, south, east and
1102     west of that tile. This number is unique over all tiles in a
1103     configuration. The number is held in the variables {\em tileNo}
1104     ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
1105     {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
1106     that specifies the type of communication that is used between tiles.
1107     This information is held in the variables {\em tileCommModeN},
1108     {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.
1109     This latter set of variables can take one of the following values
1110     {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
1111     A value of {\em COMM\_NONE} is used to indicate that a tile has no
1112 cnh 1.4 neighbor to communicate with on a particular face. A value
1113 cnh 1.1 of {\em COMM\_MSG} is used to indicated that some form of distributed
1114     memory communication is required to communicate between
1115     these tile faces ( see section \ref{sec:distributed_memory_communication}).
1116     A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1117     forms of shared memory communication ( see section
1118     \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1119     that a CPU should communicate by writing to data structures owned by another
1120     CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1121     from data structures owned by another CPU. These flags affect the behavior
1122     of the WRAPPER exchange primitive
1123     (see figure \ref{fig:communication_primitives}). The routine
1124     {\em ini\_communication\_patterns()} is responsible for setting the
1125     communication mode values for each tile.
1126     \\
1127    
1128     \fbox{
1129     \begin{minipage}{4.75in}
1130     File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1131     File: {\em eesupp/inc/EESUPPORT.h} \\
1132     Parameter: {\em tileNo} \\
1133     Parameter: {\em tileNoE} \\
1134     Parameter: {\em tileNoW} \\
1135     Parameter: {\em tileNoN} \\
1136     Parameter: {\em tileNoS} \\
1137     Parameter: {\em tileCommModeE} \\
1138     Parameter: {\em tileCommModeW} \\
1139     Parameter: {\em tileCommModeN} \\
1140     Parameter: {\em tileCommModeS} \\
1141     \end{minipage}
1142     } \\
1143    
1144     \item {\bf MP directives}
1145     The WRAPPER transfers control to numerical application code through
1146     the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1147     that allows for it to be invoked by several threads. Support for this
1148     is based on using multi-processing (MP) compiler directives.
1149     Most commercially available Fortran compilers support the generation
1150     of code to spawn multiple threads through some form of compiler directives.
1151     As this is generally much more convenient than writing code to interface
1152     to operating system libraries to explicitly spawn threads, and on some systems
1153     this may be the only method available the WRAPPER is distributed with
1154     template MP directives for a number of systems.
1155    
1156     These directives are inserted into the code just before and after the
1157     transfer of control to numerical algorithm code through the routine
1158     {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of
1159     the code that performs this process for a Silicon Graphics system.
1160     This code is extracted from the files {\em main.F} and
1161     {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies
1162     how many instances of the routine {\em THE\_MODEL\_MAIN} will
1163     be created. The value of {\em nThreads} is set in the routine
1164     {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the
1165     the product of the parameters {\em nTx} and {\em nTy} that
1166     are read from the file {\em eedata}. If the value of {\em nThreads}
1167     is inconsistent with the number of threads requested from the
1168     operating system (for example by using an environment
1169 cnh 1.4 variable as described in section \ref{sec:multi_threaded_execution})
1170 cnh 1.1 then usually an error will be reported by the routine
1171     {\em CHECK\_THREADS}.\\
1172    
1173     \fbox{
1174     \begin{minipage}{4.75in}
1175     File: {\em eesupp/src/ini\_threading\_environment.F}\\
1176     File: {\em eesupp/src/check\_threads.F} \\
1177     File: {\em eesupp/src/main.F} \\
1178     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1179     File: {\em eedata } \\
1180     Parameter: {\em nThreads} \\
1181     Parameter: {\em nTx} \\
1182     Parameter: {\em nTy} \\
1183     \end{minipage}
1184     }
1185    
1186     \item {\bf memsync flags}
1187     As discussed in section \ref{sec:memory_consistency}, when using shared memory,
1188     a low-level system function may be need to force memory consistency.
1189     The routine {\em MEMSYNC()} is used for this purpose. This routine should
1190     not need modifying and the information below is only provided for
1191     completeness. A logical parameter {\em exchNeedsMemSync} set
1192     in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether
1193     the {\em MEMSYNC()} primitive is called. In general this
1194     routine is only used for multi-threaded execution.
1195     The code that goes into the {\em MEMSYNC()}
1196     routine is specific to the compiler and
1197     processor being used for multi-threaded execution and in general
1198     must be written using a short code snippet of assembly language.
1199     For an Ultra Sparc system the following code snippet is used
1200     \begin{verbatim}
1201     asm("membar #LoadStore|#StoreStore");
1202     \end{verbatim}
1203 cnh 1.4 for an Alpha based system the equivalent code reads
1204 cnh 1.1 \begin{verbatim}
1205     asm("mb");
1206     \end{verbatim}
1207     while on an x86 system the following code is required
1208     \begin{verbatim}
1209     asm("lock; addl $0,0(%%esp)": : :"memory")
1210     \end{verbatim}
1211    
1212     \item {\bf Cache line size}
1213     As discussed in section \ref{sec:cache_effects_and_false_sharing},
1214     milti-threaded codes explicitly avoid penalties associated with excessive
1215     coherence traffic on an SMP system. To do this the sgared memory data structures
1216     used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1217     are padded. The variables that control the padding are set in the
1218     header file {\em EEPARAMS.h}. These variables are called
1219     {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and
1220     {\em lShare8}. The default values should not normally need changing.
1221     \item {\bf \_BARRIER}
1222     This is a CPP macro that is expanded to a call to a routine
1223     which synchronises all the logical processors running under the
1224     WRAPPER. Using a macro here preserves flexibility to insert
1225     a specialized call in-line into application code. By default this
1226     resolves to calling the procedure {\em BARRIER()}. The default
1227     setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1228    
1229     \item {\bf \_GSUM}
1230     This is a CPP macro that is expanded to a call to a routine
1231     which sums up a floating point numner
1232     over all the logical processors running under the
1233     WRAPPER. Using a macro here provides extra flexibility to insert
1234     a specialized call in-line into application code. By default this
1235     resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for
1236     84=bit floating point operands)
1237     or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default
1238     setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1239     The \_GSUM macro is a performance critical operation, especially for
1240     large processor count, small tile size configurations.
1241     The custom communication example discussed in section \ref{sec:jam_example}
1242     shows how the macro is used to invoke a custom global sum routine
1243     for a specific set of hardware.
1244    
1245     \item {\bf \_EXCH}
1246     The \_EXCH CPP macro is used to update tile overlap regions.
1247     It is qualified by a suffix indicating whether overlap updates are for
1248     two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1249     physical fields and whether fields are 32-bit floating point
1250     ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point
1251     ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined
1252     in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1253     \_EXCH operation plays a crucial role in scaling to small tile,
1254     large logical and physical processor count configurations.
1255     The example in section \ref{sec:jam_example} discusses defining an
1256     optimised and specialized form on the \_EXCH operation.
1257    
1258     The \_EXCH operation is also central to supporting grids such as
1259     the cube-sphere grid. In this class of grid a rotation may be required
1260     between tiles. Aligning the coordinate requiring rotation with the
1261     tile decomposistion, allows the coordinate transformation to
1262     be embedded within a custom form of the \_EXCH primitive.
1263    
1264     \item {\bf Reverse Mode}
1265     The communication primitives \_EXCH and \_GSUM both employ
1266     hand-written adjoint forms (or reverse mode) forms.
1267     These reverse mode forms can be found in the
1268     sourc code directory {\em pkg/autodiff}.
1269     For the global sum primitive the reverse mode form
1270     calls are to {\em GLOBAL\_ADSUM\_R4} and
1271     {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1272     exchamge primitives are found in routines
1273     prefixed {\em ADEXCH}. The exchange routines make calls to
1274     the same low-level communication primitives as the forward mode
1275     operations. However, the routine argument {\em simulationMode}
1276     is set to the value {\em REVERSE\_SIMULATION}. This signifies
1277     ti the low-level routines that the adjoint forms of the
1278     appropriate communication operation should be performed.
1279     \item {\bf MAX\_NO\_THREADS}
1280     The variable {\em MAX\_NO\_THREADS} is used to indicate the
1281     maximum number of OS threads that a code will use. This
1282     value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1283     For single threaded execution it can be reduced to one if required.
1284     The va;lue is largely private to the WRAPPER and application code
1285     will nor normally reference the value, except in the following scenario.
1286    
1287     For certain physical parametrization schemes it is necessary to have
1288     a substantial number of work arrays. Where these arrays are allocated
1289     in heap storage ( for example COMMON blocks ) multi-threaded
1290     execution will require multiple instances of the COMMON block data.
1291     This can be achieved using a Fortran 90 module construct, however,
1292     if this might be unavailable then the work arrays can be extended
1293     with dimensions use the tile dimensioning scheme of {\em nSx}
1294     and {\em nSy} ( as described in section
1295     \ref{sec:specifying_a_decomposition}). However, if the configuration
1296     being specified involves many more tiles than OS threads then
1297     it can save memory resources to reduce the variable
1298     {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1299     will be used and to declare the physical parameterisation
1300     work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.
1301     An example of this is given in the verification experiment
1302     {\em aim.5l\_cs}. Here the default setting of
1303     {\em MAX\_NO\_THREADS} is altered to
1304     \begin{verbatim}
1305     INTEGER MAX_NO_THREADS
1306     PARAMETER ( MAX_NO_THREADS = 6 )
1307     \end{verbatim}
1308     and several work arrays for storing intermediate calculations are
1309     created with declarations of the form.
1310     \begin{verbatim}
1311     common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1312     \end{verbatim}
1313     This declaration scheme is not used widely, becuase most global data
1314     is used for permanent not temporary storage of state information.
1315     In the case of permanent state information this approach cannot be used
1316     because there has to be enough storage allocated for all tiles.
1317     However, the technique can sometimes be a useful scheme for reducing memory
1318     requirements in complex physical paramterisations.
1319 adcroft 1.2 \end{enumerate}
1320    
1321     \begin{figure}
1322     \begin{verbatim}
1323     C--
1324     C-- Parallel directives for MIPS Pro Fortran compiler
1325     C--
1326     C Parallel compiler directives for SGI with IRIX
1327     C$PAR PARALLEL DO
1328     C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1329     C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1330     C
1331     DO I=1,nThreads
1332     myThid = I
1333    
1334     C-- Invoke nThreads instances of the numerical model
1335     CALL THE_MODEL_MAIN(myThid)
1336    
1337     ENDDO
1338     \end{verbatim}
1339     \caption{Prior to transferring control to
1340     the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
1341     MP directives to spawn multiple threads.
1342     } \label{fig:mp_directives}
1343     \end{figure}
1344 cnh 1.1
1345    
1346     \subsubsection{Specializing the Communication Code}
1347    
1348     The isolation of performance critical communication primitives and the
1349     sub-division of the simulation domain into tiles is a powerful tool.
1350     Here we show how it can be used to improve application performance and
1351     how it can be used to adapt to new gridding approaches.
1352    
1353     \subsubsection{JAM example}
1354     \label{sec:jam_example}
1355     On some platforms a big performance boost can be obtained by
1356     binding the communication routines {\em \_EXCH} and
1357     {\em \_GSUM} to specialized native libraries ) fro example the
1358     shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag
1359     is used as an illustration of a specialized communication configuration
1360     that substitutes for standard, portable forms of {\em \_EXCH} and
1361     {\em \_GSUM}. It affects three source files {\em eeboot.F},
1362     {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined
1363     is has the following effects.
1364     \begin{itemize}
1365     \item An extra phase is included at boot time to initialize the custom
1366     communications library ( see {\em ini\_jam.F}).
1367     \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1368     with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
1369     \item a highly specialized form of the exchange operator (optimized
1370 cnh 1.4 for overlap regions of width one) is substituted into the elliptic
1371 cnh 1.1 solver routine {\em cg2d.F}.
1372     \end{itemize}
1373     Developing specialized code for other libraries follows a similar
1374     pattern.
1375    
1376     \subsubsection{Cube sphere communication}
1377     \label{sec:cube_sphere_communication}
1378     Actual {\em \_EXCH} routine code is generated automatically from
1379     a series of template files, for example {\em exch\_rx.template}.
1380     This is done to allow a large number of variations on the exchange
1381     process to be maintained. One set of variations supports the
1382 cnh 1.4 cube sphere grid. Support for a cube sphere grid in MITgcm is based
1383 cnh 1.1 on having each face of the cube as a separate tile (or tiles).
1384 cnh 1.4 The exchange routines are then able to absorb much of the
1385 cnh 1.1 detailed rotation and reorientation required when moving around the
1386     cube grid. The set of {\em \_EXCH} routines that contain the
1387     word cube in their name perform these transformations.
1388     They are invoked when the run-time logical parameter
1389     {\em useCubedSphereExchange} is set true. To facilitate the
1390     transformations on a staggered C-grid, exchange operations are defined
1391 cnh 1.4 separately for both vector and scalar quantities and for
1392 cnh 1.1 grid-centered and for grid-face and corner quantities.
1393     Three sets of exchange routines are defined. Routines
1394     with names of the form {\em exch\_rx} are used to exchange
1395     cell centered scalar quantities. Routines with names of the form
1396     {\em exch\_uv\_rx} are used to exchange vector quantities located at
1397     the C-grid velocity points. The vector quantities exchanged by the
1398     {\em exch\_uv\_rx} routines can either be signed (for example velocity
1399     components) or un-signed (for example grid-cell separations).
1400     Routines with names of the form {\em exch\_z\_rx} are used to exchange
1401     quantities at the C-grid vorticity point locations.
1402    
1403    
1404    
1405    
1406     \section{MITgcm execution under WRAPPER}
1407    
1408     Fitting together the WRAPPER elements, package elements and
1409     MITgcm core equation elements of the source code produces calling
1410     sequence shown in section \ref{sec:calling_sequence}
1411    
1412     \subsection{Annotated call tree for MITgcm and WRAPPER}
1413     \label{sec:calling_sequence}
1414    
1415     WRAPPER layer.
1416    
1417     \begin{verbatim}
1418    
1419     MAIN
1420     |
1421     |--EEBOOT :: WRAPPER initialization
1422     | |
1423     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1424     | | allow basic I/O.
1425     | |-- EEINTRO_MSG :: Write startup greeting.
1426     | |
1427     | |-- EESET_PARMS :: Set WRAPPER parameters
1428     | |
1429     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1430     | |
1431     | |-- INI_PROCS :: Associate processes with grid regions.
1432     | |
1433     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1434     | |
1435     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1436     | :: communication data structures
1437     |
1438     |
1439     |--CHECK_THREADS :: Validate multiple thread start up.
1440     |
1441     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1442    
1443     \end{verbatim}
1444    
1445     Core equations plus packages.
1446    
1447     \begin{verbatim}
1448     C
1449     C
1450     C Invocation from WRAPPER level...
1451     C :
1452     C :
1453     C |
1454     C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1455     C | :: Called from WRAPPER level numerical
1456 cnh 1.4 C | :: code invocation routine. On entry
1457 cnh 1.1 C | :: to THE_MODEL_MAIN separate thread and
1458     C | :: separate processes will have been established.
1459     C | :: Each thread and process will have a unique ID
1460     C | :: but as yet it will not be associated with a
1461     C | :: specific region in decomposed discrete space.
1462     C |
1463     C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1464     C | | :: grid, solver matrices etc..
1465     C | |
1466     C | |-INI_PARMS :: Routine to set kernel model parameters.
1467     C | | :: By default kernel parameters are read from file
1468     C | | :: "data" in directory in which code executes.
1469     C | |
1470 cnh 1.4 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1471 cnh 1.1 C | |
1472 cnh 1.4 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1473 cnh 1.1 C | | | :: Grid arrays are held and described in GRID.h.
1474     C | | |
1475 cnh 1.4 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1476 cnh 1.1 C | | |
1477 cnh 1.4 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1478 cnh 1.1 C | | | :: (calculate grid from kernel parameters).
1479     C | | |
1480     C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1481 cnh 1.4 C | | | :: initialization (calculate grid from
1482 cnh 1.1 C | | | :: kernel parameters).
1483     C | | |
1484     C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1485 cnh 1.4 C | | :: grid initializations. ( input from raw
1486 cnh 1.1 C | | :: grid files, LONC.bin, DXF.bin etc... )
1487     C | |
1488     C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1489     C | |
1490     C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1491     C | | :: land masking for solid-fluid boundaries.
1492     C | |
1493     C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1494     C | |
1495     C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1496 cnh 1.4 C | | :: sphere options are coded.
1497 cnh 1.1 C | |
1498     C | |-PACAKGES_BOOT :: Start up the optional package environment.
1499     C | | :: Runtime selection of active packages.
1500     C | |
1501     C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1502     C | | |
1503     C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1504     C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1505     C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1506     C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1507     C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1508     C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1509     C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1510     C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1511     C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1512     C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1513     C | |
1514     C | |-PACKAGES_CHECK
1515     C | | |
1516     C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1517 cnh 1.4 C | | |-OBCS_CHECK :: Open bndy Package. pkg/obcs
1518 cnh 1.1 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1519     C | |
1520     C | |-PACKAGES_INIT_FIXED
1521     C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1522     C | | |-FLT_INIT :: Floats Package. see pkg/flt
1523     C | |
1524     C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1525     C | |
1526     C | |-INI_CG2D :: 2d con. grad solver initialisation.
1527     C | |
1528     C | |-INI_CG3D :: 3d con. grad solver initialisation.
1529     C | |
1530     C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1531     C | :: Includes annotated table of kernel
1532     C | :: parameter settings.
1533     C |
1534     C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1535     C |
1536     C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1537 cnh 1.4 C ! :: Auotmatically generated by TAMC/TAF.
1538 cnh 1.1 C |
1539     C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1540     C |
1541     C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1542     C |
1543     C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1544     C | |
1545     C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1546     C | | | :: variables
1547     C | | |
1548     C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1549     C | | |
1550     C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1551 cnh 1.4 C | | | :: sphere options are coded.
1552 cnh 1.1 C | | |
1553     C | | |-INI_CG2D :: 2d con. grad solver initialisation.
1554     C | | |-INI_CG3D :: 3d con. grad solver initialisation.
1555     C | | |-INI_MIXING :: Initialise diapycnal diffusivity.
1556     C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical
1557     C | | | :: fields).
1558     C | | |
1559 cnh 1.4 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1560 cnh 1.1 C | | | |-INI_VEL :: Initialize 3D flow field.
1561     C | | | |-INI_THETA :: Set model initial temperature field.
1562     C | | | |-INI_SALT :: Set model initial salinity field.
1563     C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1564     C | | |
1565     C | | |-INI_TR1 :: Set initial tracer 1 distribution.
1566     C | | |
1567     C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1568     C | | | | :: Here applied to move restart conditions
1569     C | | | | :: (saved in mid timestep) to correct level in
1570     C | | | | :: time (only used for pre-c35).
1571     C | | | |
1572     C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1573     C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1574     C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1575     C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1576     C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1577     C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1578     C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1579     C | | | | |-FIND_RHO :: Find adjacent densities.
1580     C | | | | |-CONVECT :: Mix static instability.
1581     C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1582     C | | | |
1583     C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1584     C | | |
1585     C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1586     C | | | | :: Extra time history interactions.
1587     C | | | |
1588     C | | | |-FIND_RHO :: Find adjacent densities.
1589     C | | | |-CONVECT :: Mix static instability.
1590     C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1591     C | | |
1592     C | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving
1593     C | | | | :: package data.
1594     C | | | |
1595     C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1596     C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1597     C | | | |-KPP_OPEN_DIAGS
1598     C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1599     C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1600     C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1601     C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1602     C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1603     C | | | |-INI_FORCING :: Set model initial forcing fields.
1604     C | | | | :: Either set in-line or from file as shown.
1605     C | | | |-READ_FLD_XY_RS(zonalWindFile)
1606     C | | | |-READ_FLD_XY_RS(meridWindFile)
1607     C | | | |-READ_FLD_XY_RS(surfQFile)
1608     C | | | |-READ_FLD_XY_RS(EmPmRfile)
1609     C | | | |-READ_FLD_XY_RS(thetaClimFile)
1610     C | | | |-READ_FLD_XY_RS(saltClimFile)
1611     C | | | |-READ_FLD_XY_RS(surfQswFile)
1612     C | | |
1613     C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1614     C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1615     C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1616     C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1617     C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1618     C | |
1619     C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1620     C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1621     C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1622     C | |
1623     C | |-MONITOR :: Monitor state ( see pkg/monitor )
1624     C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1625     C====|>|
1626     C====|>| ****************************
1627     C====|>| BEGIN MAIN TIMESTEPPING LOOP
1628     C====|>| ****************************
1629     C====|>|
1630     C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1631     C/\ | | |
1632     C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1633     C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1634     C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1635     C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1636     C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1637 cnh 1.4 C/\ | | | | :: Simple interpolation between end-points
1638 cnh 1.1 C/\ | | | | :: for forcing datasets.
1639     C/\ | | | |
1640     C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1641     C/\ | | |
1642     C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1643     C/\ | | | |
1644     C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1645     C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1646     C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-Rhonil] of a slice
1647     C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1648     C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1649     C/\ | | | |
1650     C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1651     C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1652     C/\ | | | | :: forcing terms.
1653     C/\ | | | |
1654     C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1655     C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1656     C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1657     C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1658     C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1659     C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1660     C/\ | | | | advection case) (see pkg/gad).
1661     C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1662     C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1663     C/\ | | | | |
1664     C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1665     C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1666     C/\ | | | |
1667     C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1668     C/\ | | | | |
1669     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1670     C/\ | | | | | :: ( see pkg/gad )
1671     C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1672     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1673     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1674     C/\ | | | |
1675     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1676     C/\ | | | |
1677     C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1678     C/\ | | | | |
1679     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1680     C/\ | | | | | :: ( see pkg/gad )
1681     C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1682     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1683     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1684     C/\ | | | |
1685     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1686     C/\ | | | |
1687     C/\ | | | |-CALC_GTR1 :: Calculate other tracer(s) tendency terms
1688     C/\ | | | | |
1689     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1690     C/\ | | | | | :: ( see pkg/gad )
1691     C/\ | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.
1692     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1693     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1694     C/\ | | | |
1695     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1696     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1697     C/\ | | | |-FREEZE :: Limit range of temperature.
1698     C/\ | | | |
1699     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1700     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1701     C/\ | | | |
1702     C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1703     C/\ | | | |-EXCH :: Update overlaps
1704     C/\ | | |
1705     C/\ | | |-DYNAMICS :: Momentum equations driver.
1706     C/\ | | | |
1707     C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1708     C/\ | | | | Potential anomaly.
1709     C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1710     C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1711     C/\ | | | |
1712     C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1713     C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1714     C/\ | | | | pkg/mom_fluxform ).
1715     C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1716     C/\ | | | | pkg/mom_vecinv ).
1717     C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1718     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1719     C/\ | | | |
1720     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1721     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1722     C/\ | | | |
1723     C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1724     C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1725     C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1726     C/\ | | |
1727     C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1728     C/\ | | |
1729     C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1730     C/\ | | |
1731     C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1732     C/\ | | |
1733     C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1734     C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1735     C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1736     C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1737     C/\ | | |
1738     C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1739     C/\ | | | |
1740     C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1741     C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1742     C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1743     C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1744     C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1745     C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1746     C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1747     C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1748     C/\ | | | | |-CONVECT :: Mix static instability.
1749     C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1750     C/\ | | | |
1751     C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1752     C/\ | | |
1753     C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1754     C/\ | | | |-EXCH
1755     C/\ | | |
1756     C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1757     C/\ | | |
1758     C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1759     C/\ | | |
1760     C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1761     C/\ | | | |-WRITE_STATE :: Core state I/O
1762     C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1763     C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1764     C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1765     C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1766     C/\ | | |
1767     C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1768     C/\ | |
1769     C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1770     C<===|=|
1771     C<===|=| **************************
1772     C<===|=| END MAIN TIMESTEPPING LOOP
1773     C<===|=| **************************
1774     C<===|=|
1775     C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1776     C |
1777     C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1778     C |
1779     C |-TIMER_PRINTALL :: Computational timing summary
1780     C |
1781     C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1782     C :: events.
1783     C
1784     \end{verbatim}
1785    
1786     \subsection{Measuring and Characterizing Performance}
1787    
1788     TO BE DONE (CNH)
1789    
1790     \subsection{Estimating Resource Requirements}
1791    
1792     TO BE DONE (CNH)
1793    
1794     \subsubsection{Atlantic 1/6 degree example}
1795     \subsubsection{Dry Run testing}
1796     \subsubsection{Adjoint Resource Requirements}
1797     \subsubsection{State Estimation Environment Resources}
1798    

  ViewVC Help
Powered by ViewVC 1.1.22