/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Annotation of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.1 - (hide annotations) (download) (as text)
Tue Oct 9 10:33:17 2001 UTC (23 years, 9 months ago) by cnh
Branch: MAIN
File MIME type: application/x-tex
Part 4 words and figs

1 cnh 1.1
2     In this chapter we describe the software architecture and
3     implementation strategy for the MITgcm code. The first part of this
4     chapter discusses the MITgcm architecture at an abstract level. In the second
5     part of the chapter we described practical details of the MITgcm implementation
6     and of current tools and operating system features that are employed.
7    
8     \section{Overall architectural goals}
9    
10     Broadly, the goals of the software architecture employed in MITgcm are
11     three-fold
12    
13     \begin{itemize}
14    
15     \item We wish to be able to study a very broad range
16     of interesting and challenging rotating fluids problems.
17    
18     \item We wish the model code to be readily targeted to
19     a wide range of platforms
20    
21     \item On any given platform we would like to be
22     able to achieve performance comparable to an implementation
23     developed and specialized specifically for that platform.
24    
25     \end{itemize}
26    
27     These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
28     which conveys the goals of the MITgcm design. The goals lead to
29     a software architecture which at the high-level can be viewed as consisting
30     of
31    
32     \begin{enumerate}
33    
34     \item A core set of numerical and support code. This is discussed in detail in
35     section \ref{sec:partII}.
36    
37     \item A scheme for supporting optional "pluggable" {\bf packages} (containing
38     for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
39     These packages are used both to overlay alternate dynamics and to introduce
40     specialized physical content onto the core numerical code. An overview of
41     the {\bf package} scheme is given at the start of part \ref{part:packages}.
42    
43    
44     \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
45     Programming Environment Resource), within which the core numerics and pluggable
46     packages operate.
47    
48     \end{enumerate}
49    
50     This chapter focuses on describing the {\bf WRAPPER} environment under which
51     both the core numerics and the pluggable packages function. The description
52     presented here is intended to be a detailed exposistion and contains significant
53     background material, as well as advanced details on working with the WRAPPER.
54     The examples section of this manual (part \ref{part:example}) contains more
55     succinct, step-by-step instructions on running basic numerical
56     experiments both sequentially and in parallel. For many projects simply
57     starting from an example code and adapting it to suit a particular situation
58     will be all that is required.
59    
60     \begin{figure}
61     \begin{center}
62     \resizebox{!}{2.5in}{
63     \includegraphics*[1.5in,2.4in][9.5in,6.3in]{part4/mitgcm_goals.eps}
64     }
65     \end{center}
66     \caption{The MITgcm architecture is designed to allow simulation of a wide
67     range of physical problems on a wide range of hardware. The computational
68     resource requirements of the applications targeted range from around
69     $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
70     ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
71     interest range from $10^{9}$ floating point operations to more than $10^{17}$
72     floating point operations.} \label{fig:mitgcm_architecture_goals}
73     \end{figure}
74    
75     \section{WRAPPER}
76    
77     A significant element of the software architecture utilized in
78     MITgcm is a software superstructure and substructure collectively
79     called the WRAPPER (Wrappable Application Parallel Programming
80     Environment Resource). All numerical and support code in MITgcm is written
81     to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
82     the WRAPPER means that coding has to follow certain, relatively
83     straightforward, rules and conventions ( these are discussed further in
84     section \ref{sec:specifying_a_decomposition} ).
85    
86     The approach taken by the WRAPPER is illustrated in figure
87     \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
88     that fits within it from architectural differences between hardware platforms
89     and operating systems. This allows numerical code to be easily retargetted.
90     \begin{figure}
91     \begin{center}
92     \resizebox{6in}{4.5in}{
93     \includegraphics*[0.6in,0.7in][9.0in,8.5in]{part4/fit_in_wrapper.eps}
94     }
95     \end{center}
96     \caption{ Numerical code is written too fit within a software support
97     infrastructure called WRAPPER. The WRAPPER is portable and
98     can be sepcialized for a wide range of specific target hardware and
99     programming environments, without impacting numerical code that fits
100     within the WRAPPER. Codes that fit within the WRAPPER can generally be
101     made to run as fast on a particular platform as codes specially
102     optimized for that platform.
103     } \label{fig:fit_in_wrapper}
104     \end{figure}
105    
106     \subsection{Target hardware}
107     \label{sec:target_hardware}
108    
109     The WRAPPER is designed to target as broad as possible a range of computer
110     systems. The original development of the WRAPPER took place on a
111     multi-processor, CRAY Y-MP system. On that system, numerical code performance
112     and scaling under the WRAPPER was in excess of that of an implementation that
113     was tightly bound to the CRAY systems proprietary multi-tasking and
114     micro-tasking approach. Later developments have been carried out on
115     uniprocessor and multi-processor Sun systems with both uniform memory access
116     (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
117     been undertaken on x86 cluster systems, Alpha processor based clustered SMP
118     systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
119     The MITgcm code, operating within the WRAPPER, is also used routinely used on
120     large scale MPP systems (for example T3E systems and IBM SP systems). In all
121     cases numerical code, operating within the WRAPPER, performs and scales very
122     competitively with equivalent numerical code that has been modified to contain
123     native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.
124    
125     \subsection{Supporting hardware neutrality}
126    
127     The different systems listed in section \ref{sec:target_hardware} can be
128     categorized in many different ways. For example, one common distinction is
129     between shared-memory parallel systems (SMP's, PVP's) and distributed memory
130     parallel systems (for example x86 clusters and large MPP systems). This is one
131     example of a difference between compute platforms that can impact an
132     application. Another common distinction is between vector processing systems
133     with highly specialized CPU's and memory subsystems and commodity
134     microprocessor based systems. There are numerous other differences, especially
135     in relation to how parallel execution is supported. To capture the essential
136     differences between different platforms the WRAPPER uses a {\it machine model}.
137    
138     \subsection{WRAPPER machine model}
139    
140     Applications using the WRAPPER are not written to target just one
141     particular machine (for example an IBM SP2) or just one particular family or
142     class of machines (for example Parallel Vector Processor Systems). Instead the
143     WRAPPER provides applications with an
144     abstract {\it machine model}. The machine model is very general, however, it can
145     easily be specialized to fit, in a computationally effificent manner, any
146     computer architecture currently available to the scientific computing community.
147    
148     \subsection{Machine model parallelism}
149    
150     Codes operating under the WRAPPER target an abstract machine that is assumed to
151     consist of one or more logical processors that can compute concurrently.
152     Computational work is divided amongst the logical
153     processors by allocating ``ownership'' to
154     each processor of a certain set (or sets) of calculations. Each set of
155     calculations owned by a particular processor is associated with a specific
156     region of the physical space that is being simulated, only one processor will
157     be associated with each such region (domain decomposition).
158    
159     In a strict sense the logical processors over which work is divided do not need
160     to correspond to physical processors. It is perfectly possible to execute a
161     configuration decomposed for multiple logical processors on a single physical
162     processor. This helps ensure that numerical code that is written to fit
163     within the WRAPPER will parallelize with no additional effort and is
164     also useful when debugging codes. Generally, however,
165     the computational domain will be subdivided over multiple logical
166     processors in order to then bind those logical processors to physical
167     processor resources that can compute in parallel.
168    
169     \subsubsection{Tiles}
170    
171     Computationally, associated with each region of physical
172     space allocated to a particular logical processor, there will be data
173     structures (arrays, scalar variables etc...) that hold the simulated state of
174     that region. We refer to these data structures as being {\bf owned} by the
175     pprocessor to which their
176     associated region of physical space has been allocated. Individual
177     regions that are allocated to processors are called {\bf tiles}. A
178     processor can own more
179     than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being
180     mapped to a set of logical processors, with each processors owning a single
181     region of the domain (a single tile). Except for periods of
182     communication and coordination, each processor computes autonomously, working
183     only with data from the tile (or tiles) that the processor owns. When multiple
184     tiles are alloted to a single processor, each tile is computed on
185     independently of the other tiles, in a sequential fashion.
186    
187     \begin{figure}
188     \begin{center}
189     \resizebox{7in}{3in}{
190     \includegraphics*[0.5in,2.7in][12.5in,6.4in]{part4/domain_decomp.eps}
191     }
192     \end{center}
193     \caption{ The WRAPPER provides support for one and two dimensional
194     decompositions of grid-point domains. The figure shows a hypothetical domain of
195     total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in
196     two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf
197     tiles} are {\bf owned} by different processors. The {\bf owning}
198     processors perform the
199     arithmetic operations associated with a {\bf tile}. Although not illustrated
200     here, a single processor can {\bf own} several {\bf tiles}.
201     Whenever a processor wishes to transfer data between tiles or
202     communicate with other processors it calls a WRAPPER supplied
203     function.
204     } \label{fig:domaindecomp}
205     \end{figure}
206    
207    
208     \subsubsection{Tile layout}
209    
210     Tiles consist of an interior region and an overlap region. The overlap region
211     of a tile corresponds to the interior region of an adjacent tile.
212     In figure \ref{fig:tiledworld} each tile would own the region
213     within the black square and hold duplicate information for overlap
214     regions extending into the tiles to the north, south, east and west.
215     During
216     computational phases a processor will reference data in an overlap region
217     whenever it requires values that outside the domain it owns. Periodically
218     processors will make calls to WRAPPER functions to communicate data between
219     tiles, in order to keep the overlap regions up to date (see section
220     \ref{sec:communication_primitives}). The WRAPPER functions can use a
221     variety of different mechanisms to communicate data between tiles.
222    
223     \begin{figure}
224     \begin{center}
225     \resizebox{7in}{3in}{
226     \includegraphics*[4.5in,3.7in][12.5in,6.7in]{part4/tiled-world.eps}
227     }
228     \end{center}
229     \caption{ A global grid subdivided into tiles.
230     Tiles contain a interior region and an overlap region.
231     Overlap regions are periodically updated from neighboring tiles.
232     } \label{fig:tiledworld}
233     \end{figure}
234    
235     \subsection{Communication mechanisms}
236    
237     Logical processors are assumed to be able to exchange information
238     between tiles and between each other using at least one of two possible
239     mechanisms.
240    
241     \begin{itemize}
242     \item {\bf Shared memory communication}.
243     Under this mode of communication data transfers are assumed to be possible
244     using direct addressing of regions of memory. In this case a CPU is able to read
245     (and write) directly to regions of memory "owned" by another CPU
246     using simple programming language level assignment operations of the
247     the sort shown in figure \ref{fig:simple_assign}. In this way one CPU
248     (CPU1 in the figure) can communicate information to another CPU (CPU2 in the
249     figure) by assigning a particular value to a particular memory location.
250    
251     \item {\bf Distributed memory communication}.
252     Under this mode of communication there is no mechanism, at the application code level,
253     for directly addressing regions of memory owned and visible to another CPU. Instead
254     a communication library must be used as illustrated in figure
255     \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the
256     communication library to communicate data from a tile that it owns to a tile
257     that another CPU owns. By default the WRAPPER binds to the MPI communication
258     library \ref{MPI} for this style of communication.
259     \end{itemize}
260    
261     The WRAPPER assumes that communication will use one of these two styles
262     of communication. The underlying hardware and operating system support
263     for the style used is not specified and can vary from system to system.
264    
265     \begin{figure}
266     \begin{verbatim}
267    
268     CPU1 | CPU2
269     ==== | ====
270     |
271     a(3) = 8 | WHILE ( a(3) .NE. 8 )
272     | WAIT
273     | END WHILE
274     |
275     \end{verbatim}
276     \caption{ In the WRAPPER shared memory communication model, simple writes to an
277     array can be made to be visible to other CPU's at the application code level.
278     So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
279     element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)
280     will be able to see the value $8$ when they read from $a(3)$.
281     This provides a very low latency and high bandwidth communication
282     mechanism.
283     } \label{fig:simple_assign}
284     \end{figure}
285    
286     \begin{figure}
287     \begin{verbatim}
288    
289     CPU1 | CPU2
290     ==== | ====
291     |
292     a(3) = 8 | WHILE ( a(3) .NE. 8 )
293     CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
294     | END WHILE
295     |
296     \end{verbatim}
297     \caption{ In the WRAPPER distributed memory communication model
298     data can not be made directly visible to other CPU's.
299     If one CPU writes the value $8$ to element $3$ of array $a$, then
300     at least one of CPU1 and/or CPU2 in the figure above will need
301     to call a bespoke communication library in order for the updated
302     value to be communicated between CPU's.
303     } \label{fig:comm_msg}
304     \end{figure}
305    
306     \subsection{Shared memory communication}
307     \label{sec:shared_memory_communication}
308    
309     Under shared communication independent CPU's are operating
310     on the exact same global address space at the application level.
311     This means that CPU 1 can directly write into global
312     data structures that CPU 2 ``owns'' using a simple
313     assignment at the application level.
314     This is the model of memory access is supported at the basic system
315     design level in ``shared-memory'' systems such as PVP systems, SMP systems,
316     and on distributed shared memory systems (the SGI Origin).
317     On such systems the WRAPPER will generally use simple read and write statements
318     to access directly application data structures when communicating between CPU's.
319    
320     In a system where assignments statements, like the one in figure
321     \ref{fig:simple_assign} map directly to
322     hardware instructions that transport data between CPU and memory banks, this
323     can be a very efficient mechanism for communication. In this case two CPU's,
324     CPU1 and CPU2, can communicate simply be reading and writing to an
325     agreed location and following a few basic rules. The latency of this sort
326     of communication is generally not that much higher than the hardware
327     latency of other memory accesses on the system. The bandwidth available
328     between CPU's communicating in this way can be close to the bandwidth of
329     the systems main-memory interconnect. This can make this method of
330     communication very efficient provided it is used appropriately.
331    
332     \subsubsection{Memory consistency}
333     \label{sec:memory_consistency}
334    
335     When using shared memory communication between
336     multiple processors the WRAPPER level shields user applications from
337     certain counter-intuitive system behaviors. In particular, one issue the
338     WRAPPER layer must deal with is a systems memory model. In general the order
339     of reads and writes expressed by the textual order of an application code may
340     not be the ordering of instructions executed by the processor performing the
341     application. The processor performing the application instructions will always
342     operate so that, for the application instructions the processor is executing,
343     any reordering is not apparent. However, in general machines are often
344     designed so that reordering of instructions is not hidden from other second
345     processors. This means that, in general, even on a shared memory system two
346     processors can observe inconsistent memory values.
347    
348     The issue of memory consistency between multiple processors is discussed at
349     length in many computer science papers, however, from a practical point of
350     view, in order to deal with this issue, shared memory machines all provide
351     some mechanism to enforce memory consistency when it is needed. The exact
352     mechanism employed will vary between systems. For communication using shared
353     memory, the WRAPPER provides a place to invoke the appropriate mechanism to
354     ensure memory consistency for a particular platform.
355    
356     \subsubsection{Cache effects and false sharing}
357     \label{sec:cache_effects_and_false_sharing}
358    
359     Shared-memory machines often have local to processor memory caches
360     which contain mirrored copies of main memory. Automatic cache-coherence
361     protocols are used to maintain consistency between caches on different
362     processors. These cache-coherence protocols typically enforce consistency
363     between regions of memory with large granularity (typically 128 or 256 byte
364     chunks). The coherency protocols employed can be expensive relative to other
365     memory accesses and so care is taken in the WRAPPER (by padding synchronization
366     structures appropriately) to avoid unnecessary coherence traffic.
367    
368     \subsubsection{Operating system support for shared memory.}
369    
370     Applications running under multiple threads within a single process can
371     use shared memory communication. In this case {\it all} the memory locations
372     in an application are potentially visible to all the compute threads. Multiple
373     threads operating within a single process is the standard mechanism for
374     supporting shared memory that the WRAPPER utilizes. Configuring and launching
375     code to run in multi-threaded mode on specific platforms is discussed in
376     section \ref{sec:running_with_threads}. However, on many systems, potentially
377     very efficient mechanisms for using shared memory communication between
378     multiple processes (in contrast to multiple threads within a single
379     process) also exist. In most cases this works by making a limited region of
380     memory shared between processes. The MMAP \ref{magicgarden} and
381     IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do
382     vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.
383     Extensions exist for the WRAPPER that allow these mechanisms
384     to be used for shared memory communication. However, these mechanisms are not
385     distributed with the default WRAPPER sources, because of their proprietary
386     nature.
387    
388     \subsection{Distributed memory communication}
389     \label{sec:distributed_memory_communication}
390     Many parallel systems are not constructed in a way where it is
391     possible or practical for an application to use shared memory
392     for communication. For example cluster systems consist of individual computers
393     connected by a fast network. On such systems their is no notion of shared memory
394     at the system level. For this sort of system the WRAPPER provides support
395     for communication based on a bespoke communication library
396     (see figure \ref{fig:comm_msg}). The default communication library used is MPI
397     \ref{mpi}. However, it is relatively straightforward to implement bindings to
398     optimized platform specific communication libraries. For example the work
399     described in \ref{hoe-hill:99} substituted standard MPI communication for a
400     highly optimized library.
401    
402     \subsection{Communication primitives}
403     \label{sec:communication_primitives}
404    
405     \begin{figure}
406     \begin{center}
407     \resizebox{5in}{3in}{
408     \includegraphics*[1.5in,0.7in][7.9in,4.4in]{part4/comm-primm.eps}
409     }
410     \end{center}
411     \caption{Three performance critical parallel primititives are provided
412     by the WRAPPER. These primititives are always used to communicate data
413     between tiles. The figure shows four tiles. The curved arrows indicate
414     exchange primitives which transfer data between the overlap regions at tile
415     edges and interior regions for nearest-neighbor tiles.
416     The straight arrows symbolize global sum operations which connect all tiles.
417     The global sum operation provides both a key arithmetic primitive and can
418     serve as a synchronization primitive. A third barrier primitive is also
419     provided, it behaves much like the global sum primitive.
420     } \label{fig:communication_primitives}
421     \end{figure}
422    
423    
424     Optimized communication support is assumed to be possibly available
425     for a small number of communication operations.
426     It is assumed that communication performance optimizations can
427     be achieved by optimizing a small number of communication primitives.
428     Three optimizable primitives are provided by the WRAPPER
429     \begin{itemize}
430     \item{\bf EXCHANGE} This operation is used to transfer data between interior
431     and overlap regions of neighboring tiles. A number of different forms of this
432     operation are supported. These different forms handle
433     \begin{itemize}
434     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
435     separately.
436     \item Bindings to different communication methods.
437     Exchange primitives select between using shared memory or distributed
438     memory communication.
439     \item Transformation operations required when transporting
440     data between different grid regions. Transferring data
441     between faces of a cube-sphere grid, for example, involves a rotation
442     of vector components.
443     \item Forward and reverse mode computations. Derivative calculations require
444     tangent linear and adjoint forms of the exchange primitives.
445    
446     \end{itemize}
447    
448     \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
449     operation for the pressure inversion phase of the MITgcm algorithm.
450     For certain configurations scaling can be highly sensitive to
451     the performance of the global sum primitive. This operation is a collective
452     operation involving all tiles of the simulated domain. Different forms
453     of the global sum primitive exist for handling
454     \begin{itemize}
455     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
456     separately.
457     \item Bindings to different communication methods.
458     Exchange primitives select between using shared memory or distributed
459     memory communication.
460     \item Forward and reverse mode computations. Derivative calculations require
461     tangent linear and adjoint forms of the exchange primitives.
462     \end{itemize}
463    
464     \item{\bf BARRIER} The WRAPPER provides a global synchronization function
465     called barrier. This is used to synchronize computations over all tiles.
466     The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in
467     some cases use the same underlying code.
468     \end{itemize}
469    
470    
471     \subsection{Memory architecture}
472    
473     The WRAPPER machine model is aimed to target efficiently systems with
474     highly pipelined memory architectures and systems with deep memory
475     hierarchies that favor memory reuse. This is achieved by supporting a
476     flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
477     Within a CPU computations are carried out sequentially on each tile
478     in turn. By reshaping tiles according to the target platform it is
479     possible to automatically tune code to improve memory performance.
480     On a vector machine a given domain might be sub-divided into a few
481     long, thin regions. On a commodity microprocessor based system, however,
482     the same region could be simulated use many more smaller
483     sub-domains.
484    
485    
486     \begin{figure}
487     \begin{center}
488     \resizebox{5in}{3in}{
489     \includegraphics*[0.5in,1.3in][7.9in,5.7in]{part4/tiling_detail.eps}
490     }
491     \end{center}
492     \caption{The tiling strategy that the WRAPPER supports allows tiles
493     to be shaped to suit the underlying system memory architecture.
494     Compact tiles that lead to greater memory reuse can be used on cache
495     based systems (upper half of figure) with deep memory hierarchies, long tiles
496     with large inner loops can be used to exploit vector systems having
497     highly pipelined memory systems.
498     } \label{fig:tiling-strategy}
499     \end{figure}
500    
501     \newpage
502     \subsection{Summary}
503     Following the discussion above, the machine model that the WRAPPER
504     presents to an application has the following characteristics
505    
506     \begin{itemize}
507     \item The machine consists of one or more logical processors. \vspace{-3mm}
508     \item Each processor operates on tiles that it owns.\vspace{-3mm}
509     \item A processor may own more than one tile.\vspace{-3mm}
510     \item Processors may compute concurrently.\vspace{-3mm}
511     \item Exchange of information between tiles is handled by the
512     machine (WRAPPER) not by the application.
513     \end{itemize}
514     Behind the scenes this allows the WRAPPER to adapt the machine model
515     functions to exploit hardware on which
516     \begin{itemize}
517     \item Processors may be able to communicate very efficiently with each other
518     using shared memory. \vspace{-3mm}
519     \item An alternative communication mechanism based on a relatively
520     simple inter-process communication API may be required.\vspace{-3mm}
521     \item Shared memory may not necessarily obey sequential consistency,
522     however some mechanism will exist for enforcing memory consistency.
523     \vspace{-3mm}
524     \item Memory consistency that is enforced at the hardware level
525     may be expensive. Unnecessary triggering of consistency protocols
526     should be avoided. \vspace{-3mm}
527     \item Memory access patterns may need to either repetitive or highly
528     pipelined for optimum hardware performance. \vspace{-3mm}
529     \end{itemize}
530    
531     This generic model captures the essential hardware ingredients
532     of almost all successful scientific computer systems designed in the
533     last 50 years.
534    
535     \section{Using the WRAPPER}
536    
537     In order to support maximum portability the WRAPPER is implemented primarily
538     in sequential Fortran 77. At a practical level the key steps provided by the
539     WRAPPER are
540     \begin{enumerate}
541     \item specifying how a domain will be decomposed
542     \item starting a code in either sequential or parallel modes of operations
543     \item controlling communication between tiles and between concurrently
544     computing CPU's.
545     \end{enumerate}
546     This section describes the details of each of these operations.
547     Section \ref{sec:specifying_a_decomposition} explains how the way in which
548     a domain is decomposed (or composed) is expressed. Section
549     \ref{sec:starting_a_code} describes practical details of running codes
550     in various different parallel modes on contemporary computer systems.
551     Section \ref{sec:controlling_communication} explains the internal information
552     that the WRAPPER uses to control how information is communicated between
553     tiles.
554    
555     \subsection{Specifying a domain decomposition}
556     \label{sec:specifying_a_decomposition}
557    
558     At its heart much of the WRAPPER works only in terms of a collection of tiles
559     which are interconnected to each other. This is also true of application
560     code operating within the WRAPPER. Application code is written as a series
561     of compute operations, each of which operates on a single tile. If
562     application code needs to perform operations involving data
563     associated with another tile, it uses a WRAPPER function to obtain
564     that data.
565     The specification of how a global domain is constructed from tiles or alternatively
566     how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
567     This file defines the following parameters \\
568    
569     \fbox{
570     \begin{minipage}{4.75in}
571     Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
572     File: {\em model/inc/SIZE.h}
573     \end{minipage}
574     } \\
575    
576     Together these parameters define a tiling decomposition of the style shown in
577     figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
578     the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
579     maximum size of the overlap extent. This must be set to the maximum width
580     of the computation stencil that the numerical code finite-difference operations
581     require between overlap region updates. The maximum overlap required
582     by any of the operations in the MITgcm code distributed with this release is three grid
583     points. This is set by the requirements of the $\nabla^4$ dissipation and
584     diffusion operator. Code modifications and enhancements that involve adding wide
585     finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
586     Setting {\em OLx} and {\em OLy} to a too large value will decrease code
587     performance (because redundant computations will be performed), however it will
588     not cause any other problems.
589    
590     \begin{figure}
591     \begin{center}
592     \resizebox{5in}{7in}{
593     \includegraphics*[0.5in,0.3in][7.9in,10.7in]{part4/size_h.eps}
594     }
595     \end{center}
596     \caption{ The three level domain decomposition hierarchy employed by the
597     WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
598     to a single process. Multiple processes can exist, each with multiple tiles.
599     Tiles within a process can be spread over multiple compute threads.
600     } \label{fig:labelled_tile}
601     \end{figure}
602    
603     The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
604     be created within a single process. Each of these tiles will have internal
605     dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
606     allocated to different threads of a process that are then bound to
607     different physical processors ( see the multi-threaded
608     execution discussion in section \ref{sec:starting_the_code} ) then
609     computation will be performed concurrently on each tile. However, it is also
610     possible to run the same decomposition within a process running a single thread on
611     a single processor. In this case the tiles will be computed over sequentially.
612     If the decomposition is run in a single process running multiple threads
613     but attached to a single physical processor, then, in general, the computation
614     for different tiles will be interleaved by system level software.
615     This too is a valid mode of operation.
616    
617     The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
618     numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
619     are used to form the loop ranges for many numerical calculations and
620     to provide dimensions for arrays holding numerical state.
621     The {\em nSx} and{\em nSy} are used in conjunction with the thread number
622     parameter {\em myThid}. Much of the numerical code operating within the
623     WRAPPER takes the form
624     \begin{verbatim}
625     DO bj=myByLo(myThid),myByHi(myThid)
626     DO bi=myBxLo(myThid),myBxHi(myThid)
627     :
628     a block of computations ranging
629     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
630     :
631     ENDDO
632     ENDDO
633    
634     communication code to sum a number or maybe update
635     tile overlap regions
636    
637     DO bj=myByLo(myThid),myByHi(myThid)
638     DO bi=myBxLo(myThid),myBxHi(myThid)
639     :
640     another block of computations ranging
641     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
642     :
643     ENDDO
644     ENDDO
645     \end{verbatim}
646     The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
647     myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
648     schematic. These variables specify the subset of the tiles in
649     the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
650     thread number {\em myThid} owns. The thread number variable {\em myThid}
651     ranges from 1 to the total number of threads requested at execution time.
652     For each value of {\em myThid} the loop scheme above will step sequentially
653     through the tiles owned by that thread. However, different threads will
654     have different ranges of tiles assigned to them, so that separate threads can
655     compute iterations of the {\em bi}, {\em bj} loop concurrently.
656     Within a {\em bi}, {\em bj} loop
657     computation is performed concurrently over as many processes and threads
658     as there are physical processors available to compute.
659    
660     The amount of computation that can be embedded
661     a single loop over {\em bi} and {\em bj} varies for different parts of the
662     MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
663     from the two-dimensional implicit elliptic solver. This portion of the
664     code computes the l2Norm of a vector whose elements are held in
665     the array {\em cg2d\_r} writing the final result to scalar variable
666     {\em err}. In this case, because the l2norm requires a global reduction,
667     the {\em bi},{\em bj} loop only contains one statement. This computation
668     phase is then followed by a communication phase in which all threads and
669     processes must participate. However,
670     in other areas of the MITgcm code entries subsections of code are within
671     a single {\em bi},{\em bj} loop. For example the evaluation of all
672     the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
673     is within a single {\em bi},{\em bj} loop.
674    
675     \begin{figure}
676     \begin{verbatim}
677     REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
678     REAL*8 err
679     :
680     :
681     other computations
682     :
683     :
684     err = 0.
685     DO bj=myByLo(myThid),myByHi(myThid)
686     DO bi=myBxLo(myThid),myBxHi(myThid)
687     DO J=1,sNy
688     DO I=1,sNx
689     err = err +
690     & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
691     ENDDO
692     ENDDO
693     ENDDO
694     ENDDO
695    
696     CALL GLOBAL_SUM_R8( err , myThid )
697     err = SQRT(err)
698    
699     \end{verbatim}
700     \caption{Example of numerical code for calculating
701     the l2-norm of a vector within the WRAPPER. Notice that
702     under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
703     dimensions. These right most indices are tile indexes. Different
704     threads with a single process operate on different ranges of tile
705     index, as controlled by the settings of
706     {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
707     } \label{fig:bibj_extract}
708     \end{figure}
709    
710     The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
711     are used to indicate to the WRAPPER level how many processes (each with
712     {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
713     This information is needed during initialization and during I/O phases.
714     However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
715     the values of {\em nPx} and {\em nPy} are absent
716     from the core numerical and support code.
717    
718     \subsubsection{Examples of {\em SIZE.h} specifications}
719    
720     The following different {\em SIZE.h} parameter setting illustrate how to
721     interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
722     and {\em nPy}.
723     \begin{enumerate}
724     \item
725     \begin{verbatim}
726     PARAMETER (
727     & sNx = 90,
728     & sNy = 40,
729     & OLx = 3,
730     & OLy = 3,
731     & nSx = 1,
732     & nSy = 1,
733     & nPx = 1,
734     & nPy = 1)
735     \end{verbatim}
736     This sets up a single tile with x-dimension of ninety grid points, y-dimension of
737     forty grid points, and x and y overlaps of three grid points each.
738     \item
739     \begin{verbatim}
740     PARAMETER (
741     & sNx = 45,
742     & sNy = 20,
743     & OLx = 3,
744     & OLy = 3,
745     & nSx = 1,
746     & nSy = 1,
747     & nPx = 2,
748     & nPy = 2)
749     \end{verbatim}
750     This sets up tiles with x-dimension of forty-five grid points, y-dimension of
751     twenty grid points, and x and y overlaps of three grid points each. There are
752     four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
753     arranged so that the global domain size is again ninety grid points in x and
754     forty grid points in y. In general the formula for global grid size (held in
755     model variables {\em Nx} and {\em Ny}) is
756     \begin{verbatim}
757     Nx = sNx*nSx*nPx
758     Ny = sNy*nSy*nPy
759     \end{verbatim}
760     \item
761     \begin{verbatim}
762     PARAMETER (
763     & sNx = 90,
764     & sNy = 10,
765     & OLx = 3,
766     & OLy = 3,
767     & nSx = 1,
768     & nSy = 2,
769     & nPx = 1,
770     & nPy = 2)
771     \end{verbatim}
772     This sets up tiles with x-dimension of ninety grid points, y-dimension of
773     ten grid points, and x and y overlaps of three grid points each. There are
774     four tiles allocated to two separate processes ({\em nPy=2}) each of which
775     has two separate sub-domains {\em nSy=2},
776     The global domain size is again ninety grid points in x and
777     forty grid points in y. The two sub-domains in each process will be computed
778     sequentially if they are given to a single thread within a single process.
779     Alternatively if the code is invoked with multiple threads per process
780     the two domains in y may be computed on concurrently.
781     \item
782     \begin{verbatim}
783     PARAMETER (
784     & sNx = 32,
785     & sNy = 32,
786     & OLx = 3,
787     & OLy = 3,
788     & nSx = 6,
789     & nSy = 1,
790     & nPx = 1,
791     & nPy = 1)
792     \end{verbatim}
793     This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
794     thirty-two grid points, and x and y overlaps of three grid points each.
795     There are six tiles allocated to six separate logical processors ({\em nSx=6}).
796     This set of values can be used for a cube sphere calculation.
797     Each tile of size $32 \times 32$ represents a face of the
798     cube. Initialising the tile connectivity correctly ( see section
799     \ref{sec:cube_sphere_communication}. allows the rotations associated with
800     moving between the six cube faces to be embedded within the
801     tile-tile communication code.
802     \end{enumerate}
803    
804    
805     \subsection{Starting the code}
806     \label{sec:starting_the_code}
807     When code is started under the WRAPPER, execution begins in a main routine {\em
808     eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
809     to the application through a routine called {\em THE\_MODEL\_MAIN()}
810     once the WRAPPER has initialized correctly and has created the necessary variables
811     to support subsequent calls to communication routines
812     by the application code. The startup calling sequence followed by the
813     WRAPPER is shown in figure \ref{fig:wrapper_startup}.
814    
815    
816     \begin{figure}
817     \begin{verbatim}
818    
819     MAIN
820     |
821     |--EEBOOT :: WRAPPER initialization
822     | |
823     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
824     | | allow basic I/O.
825     | |-- EEINTRO_MSG :: Write startup greeting.
826     | |
827     | |-- EESET_PARMS :: Set WRAPPER parameters
828     | |
829     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
830     | |
831     | |-- INI_PROCS :: Associate processes with grid regions.
832     | |
833     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
834     | |
835     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
836     | :: communication data structures
837     |
838     |
839     |--CHECK_THREADS :: Validate multiple thread start up.
840     |
841     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
842    
843    
844     \end{verbatim}
845     \caption{Main stages of the WRAPPER startup procedure.
846     This process proceeds transfer of control to application code, which
847     occurs through the procedure {\em THE\_MODEL\_MAIN()}.
848     } \label{fig:wrapper_startup}
849     \end{figure}
850    
851     \subsubsection{Multi-threaded execution}
852     Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
853     WRAPPER may cause several coarse grain threads to be initialized. The routine
854     {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
855     stack argument which is the thread number, stored in the
856     variable {\em myThid}. In addition to specifying a decomposition with
857     multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})
858     configuring and starting a code to run using multiple threads requires the following
859     steps.\\
860    
861     \paragraph{Compilation}
862     First the code must be compiled with appropriate multi-threading directives
863     active in the file {\em main.F} and with appropriate compiler flags
864     to request multi-threading support. The header files
865     {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
866     contain directives compatible with compilers for Sun, Compaq, SGI,
867     Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
868     activated by using compile time
869     directives {\em -DTARGET\_SUN},
870     {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
871     or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
872     for invoking multi-threaded compilation vary from system to system
873     and from compiler to compiler. The options will be described
874     in the individual compiler documentation. For the Fortran compiler
875     from Sun the following options are needed to correctly compile
876     multi-threaded code
877     \begin{verbatim}
878     -stackvar -explicitpar -vpara -noautopar
879     \end{verbatim}
880     These options are specific to the Sun compiler. Other compilers
881     will use different syntax that will be described in their
882     documentation. The effect of these options is as follows
883     \begin{enumerate}
884     \item {\bf -stackvar} Causes all local variables to be allocated in stack
885     storage. This is necessary for local variables to ensure that they are private
886     to their thread. Note, when using this option it may be necessary to override
887     the default limit on stack-size that the operating system assigns to a process.
888     This can normally be done by changing the settings of the command shells
889     {\em stack-size} limit variable. However, on some systems changing this limit
890     will require privileged administrator access to modify system parameters.
891    
892     \item {\bf -explicitpar} Requests that multiple threads be spawned
893     in response to explicit directives in the application code. These
894     directives are inserted with syntax appropriate to the particular target
895     platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
896    
897     \item {\bf -vpara} This causes the compiler to describe the multi-threaded
898     configuration it is creating. This is not required
899     but it can be useful when trouble shooting.
900    
901     \item {\bf -noautopar} This inhibits any automatic multi-threaded
902     parallelization the compiler may otherwise generate.
903    
904     \end{enumerate}
905    
906    
907     \paragraph{Environment variables}
908     On most systems multi-threaded execution also requires the setting
909     of a special environment variable. On many machines this variable
910     is called PARALLEL and its values should be set to the number
911     of parallel threads required. Generally the help pages associated
912     with the multi-threaded compiler on a machine will explain
913     how to set the required environment variables for that machines.
914    
915     \paragraph{Runtime input parameters}
916     Finally the file {\em eedata} needs to be configured to indicate
917     the number of threads to be used in the x and y directions.
918     The variables {\em nTx} and {\em nTy} in this file are used to
919     specify the information required. The product of {\em nTx} and
920     {\em nTy} must be equal to the number of threads spawned i.e.
921     the setting of the environment variable PARALLEL.
922     The value of {\em nTx} must subdivide the number of sub-domains
923     in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
924     number of sub-domains in y ({\em nSy}) exactly.
925    
926     An example of valid settings for the {\em eedata} file for a
927     domain with two subdomains in y and running with two threads is shown
928     below
929     \begin{verbatim}
930     nTx=1,nTy=2
931     \end{verbatim}
932     This set of values will cause computations to stay within a single
933     thread when moving across the {\em nSx} sub-domains. In the y-direction,
934     however, sub-domains will be split equally between two threads.
935    
936     \paragraph{Multi-threading files and parameters} The following
937     files and variables are used in setting up multi-threaded execution.\\
938    
939     \fbox{
940     \begin{minipage}{4.75in}
941     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
942     File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
943     File: {\em model/src/THE\_MODEL\_MAIN.F}\\
944     File: {\em eesupp/src/MAIN.F}\\
945     File: {\em tools/genmake}\\
946     File: {\em eedata}\\
947     CPP: {\em TARGET\_SUN}\\
948     CPP: {\em TARGET\_DEC}\\
949     CPP: {\em TARGET\_HP }\\
950     CPP: {\em TARGET\_SGI}\\
951     CPP: {\em TARGET\_CRAY\_VECTOR}\\
952     Parameter: {\em nTx}\\
953     Parameter: {\em nTy}
954     \end{minipage}
955     } \\
956    
957     \subsubsection{Multi-process execution}
958    
959     Despite its appealing programming model, multi-threaded execution remains
960     less common then multi-process execution. One major reason for this
961     is that many system libraries are still not ``thread-safe''. This means that for
962     example on some systems it is not safe to call system routines to
963     do I/O when running in multi-threaded mode, except for in a limited set of
964     circumstances. Another reason is that support for multi-threaded programming
965     models varies between systems.
966    
967     Multi-process execution is more ubiquitous.
968     In order to run code in a multi-process configuration a decomposition
969     specification is given ( in which the at least one of the
970     parameters {\em nPx} or {\em nPy} will be greater than one)
971     and then, as for multi-threaded operation,
972     appropriate compile time and run time steps must be taken.
973    
974     \paragraph{Compilation} Multi-process execution under the WRAPPER
975     assumes that the portable, MPI libraries are available
976     for controlling the start-up of multiple processes. The MPI libraries
977     are not required, although they are usually used, for performance
978     critical communication. However, in order to simplify the task
979     of controlling and coordinating the start up of a large number
980     (hundreds and possibly even thousands) of copies of the same
981     program, MPI is used. The calls to the MPI multi-process startup
982     routines must be activated at compile time. This is done
983     by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}
984     flags in the {\em CPP\_EEOPTIONS.h} file.\\
985    
986     \fbox{
987     \begin{minipage}{4.75in}
988     File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\
989     CPP: {\em ALLOW\_USE\_MPI}\\
990     CPP: {\em ALWAYS\_USE\_MPI}\\
991     Parameter: {\em nPx}\\
992     Parameter: {\em nPy}
993     \end{minipage}
994     } \\
995    
996     Additionally, compile time options are required to link in the
997     MPI libraries and header files. Examples of these options
998     can be found in the {\em genmake} script that creates makefiles
999     for compilation. When this script is executed with the {bf -mpi}
1000     flag it will generate a makefile that includes
1001     paths for search for MPI head files and for linking in
1002     MPI libraries. For example the {\bf -mpi} flag on a
1003     Silicon Graphics IRIX system causes a
1004     Makefile with the compilation command
1005     Graphics IRIX system \begin{verbatim}
1006     mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI
1007     \end{verbatim}
1008     to be generated.
1009     This is the correct set of options for using the MPICH open-source
1010     version of MPI, when it has been installed under the subdirectory
1011     /usr/local/mpi.
1012     However, on many systems there may be several
1013     versions of MPI installed. For example many systems have both
1014     the open source MPICH set of libraries and a vendor specific native form
1015     of the MPI libraries. The correct setup to use will depend on the
1016     local configuration of your system.\\
1017    
1018     \fbox{
1019     \begin{minipage}{4.75in}
1020     File: {\em tools/genmake}
1021     \end{minipage}
1022     } \\
1023     \paragraph{\bf Execution} The mechanics of starting a program in
1024     multi-process mode under MPI is not standardized. Documentation
1025     associated with the distribution of MPI installed on a system will
1026     describe how to start a program using that distribution.
1027     For the free, open-source MPICH system the MITgcm program is started
1028     using a command such as
1029     \begin{verbatim}
1030     mpirun -np 64 -machinefile mf ./mitgcmuv
1031     \end{verbatim}
1032     In this example the text {\em -np 64} specifices the number of processes
1033     that will be created. The numeric value {\em 64} must be equal to the
1034     product of the processor grid settings of {\em nPx} and {\em nPy}
1035     in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
1036     called ``mf'' will be read to get a list of processor names on
1037     which the sixty-four processes will execute. The syntax of this file
1038     is specified by the MPI distribution
1039     \\
1040    
1041     \fbox{
1042     \begin{minipage}{4.75in}
1043     File: {\em SIZE.h}\\
1044     Parameter: {\em nPx}\\
1045     Parameter: {\em nPy}
1046     \end{minipage}
1047     } \\
1048    
1049     The multiprocess startup of the MITgcm executable {\em mitgcmuv}
1050     is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1051     {\em INI\_PROCS()}. The first routine performs basic steps required
1052     to make sure each process is started and has a textual output
1053     stream associated with it. By default two output files are opened
1054     for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.
1055     The {\bf NNNNN} part of the name is filled in with the process
1056     number so that process number 0 will create output files
1057     {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create
1058     output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
1059     are used for reporting status and configuration information and
1060     for reporting error conditions on a process by process basis.
1061     The {{\em EEBOOT\_MINIMAL()} procedure also sets the variables
1062     {\em myProcId} and {\em MPI\_COMM\_MODEL}.
1063     These variables are related
1064     to processor identification are are used later in the routine
1065     {\em INI\_PROCS()} to allocate tiles to processes.
1066    
1067     Allocation of processes to tiles in controlled by the routine
1068     {\em INI\_PROCS()}. For each process this routine sets
1069     the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
1070     These variables specify (in index space) the coordinate
1071     of the southern most and western most corner of the
1072     southern most and western most tile owned by this process.
1073     The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
1074     are also set in this routine. These are used to identify
1075     processes holding tiles to the west, east, south and north
1076     of this process. These values are stored in global storage
1077     in the header file {\em EESUPPORT.h} for use by
1078     communication routines.
1079     \\
1080    
1081     \fbox{
1082     \begin{minipage}{4.75in}
1083     File: {\em eesupp/src/eeboot\_minimal.F}\\
1084     File: {\em eesupp/src/ini\_procs.F} \\
1085     File: {\em eesupp/inc/EESUPPORT.h} \\
1086     Parameter: {\em myProcId} \\
1087     Parameter: {\em MPI\_COMM\_MODEL} \\
1088     Parameter: {\em myXGlobalLo} \\
1089     Parameter: {\em myYGlobalLo} \\
1090     Parameter: {\em pidW } \\
1091     Parameter: {\em pidE } \\
1092     Parameter: {\em pidS } \\
1093     Parameter: {\em pidN }
1094     \end{minipage}
1095     } \\
1096    
1097    
1098     \subsection{Controlling communication}
1099     The WRAPPER maintains internal information that is used for communication
1100     operations and that can be customized for different platforms. This section
1101     describes the information that is held and used.
1102     \begin{enumerate}
1103     \item {\bf Tile-tile connectivity information} For each tile the WRAPPER
1104     sets a flag that sets the tile number to the north, south, east and
1105     west of that tile. This number is unique over all tiles in a
1106     configuration. The number is held in the variables {\em tileNo}
1107     ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
1108     {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
1109     that specifies the type of communication that is used between tiles.
1110     This information is held in the variables {\em tileCommModeN},
1111     {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.
1112     This latter set of variables can take one of the following values
1113     {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
1114     A value of {\em COMM\_NONE} is used to indicate that a tile has no
1115     neighbor to cummnicate with on a particular face. A value
1116     of {\em COMM\_MSG} is used to indicated that some form of distributed
1117     memory communication is required to communicate between
1118     these tile faces ( see section \ref{sec:distributed_memory_communication}).
1119     A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1120     forms of shared memory communication ( see section
1121     \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1122     that a CPU should communicate by writing to data structures owned by another
1123     CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1124     from data structures owned by another CPU. These flags affect the behavior
1125     of the WRAPPER exchange primitive
1126     (see figure \ref{fig:communication_primitives}). The routine
1127     {\em ini\_communication\_patterns()} is responsible for setting the
1128     communication mode values for each tile.
1129     \\
1130    
1131     \fbox{
1132     \begin{minipage}{4.75in}
1133     File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1134     File: {\em eesupp/inc/EESUPPORT.h} \\
1135     Parameter: {\em tileNo} \\
1136     Parameter: {\em tileNoE} \\
1137     Parameter: {\em tileNoW} \\
1138     Parameter: {\em tileNoN} \\
1139     Parameter: {\em tileNoS} \\
1140     Parameter: {\em tileCommModeE} \\
1141     Parameter: {\em tileCommModeW} \\
1142     Parameter: {\em tileCommModeN} \\
1143     Parameter: {\em tileCommModeS} \\
1144     \end{minipage}
1145     } \\
1146    
1147     \item {\bf MP directives}
1148     The WRAPPER transfers control to numerical application code through
1149     the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1150     that allows for it to be invoked by several threads. Support for this
1151     is based on using multi-processing (MP) compiler directives.
1152     Most commercially available Fortran compilers support the generation
1153     of code to spawn multiple threads through some form of compiler directives.
1154     As this is generally much more convenient than writing code to interface
1155     to operating system libraries to explicitly spawn threads, and on some systems
1156     this may be the only method available the WRAPPER is distributed with
1157     template MP directives for a number of systems.
1158    
1159     These directives are inserted into the code just before and after the
1160     transfer of control to numerical algorithm code through the routine
1161     {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of
1162     the code that performs this process for a Silicon Graphics system.
1163     This code is extracted from the files {\em main.F} and
1164     {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies
1165     how many instances of the routine {\em THE\_MODEL\_MAIN} will
1166     be created. The value of {\em nThreads} is set in the routine
1167     {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the
1168     the product of the parameters {\em nTx} and {\em nTy} that
1169     are read from the file {\em eedata}. If the value of {\em nThreads}
1170     is inconsistent with the number of threads requested from the
1171     operating system (for example by using an environment
1172     varialble as described in section \ref{sec:multi_threaded_execution})
1173     then usually an error will be reported by the routine
1174     {\em CHECK\_THREADS}.\\
1175    
1176     \fbox{
1177     \begin{minipage}{4.75in}
1178     File: {\em eesupp/src/ini\_threading\_environment.F}\\
1179     File: {\em eesupp/src/check\_threads.F} \\
1180     File: {\em eesupp/src/main.F} \\
1181     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1182     File: {\em eedata } \\
1183     Parameter: {\em nThreads} \\
1184     Parameter: {\em nTx} \\
1185     Parameter: {\em nTy} \\
1186     \end{minipage}
1187     }
1188    
1189     \begin{figure}
1190     \begin{verbatim}
1191     C--
1192     C-- Parallel directives for MIPS Pro Fortran compiler
1193     C--
1194     C Parallel compiler directives for SGI with IRIX
1195     C$PAR PARALLEL DO
1196     C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1197     C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1198     C
1199     DO I=1,nThreads
1200     myThid = I
1201    
1202     C-- Invoke nThreads instances of the numerical model
1203     CALL THE_MODEL_MAIN(myThid)
1204    
1205     ENDDO
1206     \end{verbatim}
1207     \caption{Prior to transferring control to
1208     the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
1209     MP directives to spawn multiple threads.
1210     } \label{fig:mp_directives}
1211     \end{figure}
1212    
1213    
1214     \item {\bf memsync flags}
1215     As discussed in section \ref{sec:memory_consistency}, when using shared memory,
1216     a low-level system function may be need to force memory consistency.
1217     The routine {\em MEMSYNC()} is used for this purpose. This routine should
1218     not need modifying and the information below is only provided for
1219     completeness. A logical parameter {\em exchNeedsMemSync} set
1220     in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether
1221     the {\em MEMSYNC()} primitive is called. In general this
1222     routine is only used for multi-threaded execution.
1223     The code that goes into the {\em MEMSYNC()}
1224     routine is specific to the compiler and
1225     processor being used for multi-threaded execution and in general
1226     must be written using a short code snippet of assembly language.
1227     For an Ultra Sparc system the following code snippet is used
1228     \begin{verbatim}
1229     asm("membar #LoadStore|#StoreStore");
1230     \end{verbatim}
1231     for an Alpha based sytem the euivalent code reads
1232     \begin{verbatim}
1233     asm("mb");
1234     \end{verbatim}
1235     while on an x86 system the following code is required
1236     \begin{verbatim}
1237     asm("lock; addl $0,0(%%esp)": : :"memory")
1238     \end{verbatim}
1239    
1240     \item {\bf Cache line size}
1241     As discussed in section \ref{sec:cache_effects_and_false_sharing},
1242     milti-threaded codes explicitly avoid penalties associated with excessive
1243     coherence traffic on an SMP system. To do this the sgared memory data structures
1244     used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1245     are padded. The variables that control the padding are set in the
1246     header file {\em EEPARAMS.h}. These variables are called
1247     {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and
1248     {\em lShare8}. The default values should not normally need changing.
1249     \item {\bf \_BARRIER}
1250     This is a CPP macro that is expanded to a call to a routine
1251     which synchronises all the logical processors running under the
1252     WRAPPER. Using a macro here preserves flexibility to insert
1253     a specialized call in-line into application code. By default this
1254     resolves to calling the procedure {\em BARRIER()}. The default
1255     setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1256    
1257     \item {\bf \_GSUM}
1258     This is a CPP macro that is expanded to a call to a routine
1259     which sums up a floating point numner
1260     over all the logical processors running under the
1261     WRAPPER. Using a macro here provides extra flexibility to insert
1262     a specialized call in-line into application code. By default this
1263     resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for
1264     84=bit floating point operands)
1265     or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default
1266     setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1267     The \_GSUM macro is a performance critical operation, especially for
1268     large processor count, small tile size configurations.
1269     The custom communication example discussed in section \ref{sec:jam_example}
1270     shows how the macro is used to invoke a custom global sum routine
1271     for a specific set of hardware.
1272    
1273     \item {\bf \_EXCH}
1274     The \_EXCH CPP macro is used to update tile overlap regions.
1275     It is qualified by a suffix indicating whether overlap updates are for
1276     two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1277     physical fields and whether fields are 32-bit floating point
1278     ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point
1279     ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined
1280     in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1281     \_EXCH operation plays a crucial role in scaling to small tile,
1282     large logical and physical processor count configurations.
1283     The example in section \ref{sec:jam_example} discusses defining an
1284     optimised and specialized form on the \_EXCH operation.
1285    
1286     The \_EXCH operation is also central to supporting grids such as
1287     the cube-sphere grid. In this class of grid a rotation may be required
1288     between tiles. Aligning the coordinate requiring rotation with the
1289     tile decomposistion, allows the coordinate transformation to
1290     be embedded within a custom form of the \_EXCH primitive.
1291    
1292     \item {\bf Reverse Mode}
1293     The communication primitives \_EXCH and \_GSUM both employ
1294     hand-written adjoint forms (or reverse mode) forms.
1295     These reverse mode forms can be found in the
1296     sourc code directory {\em pkg/autodiff}.
1297     For the global sum primitive the reverse mode form
1298     calls are to {\em GLOBAL\_ADSUM\_R4} and
1299     {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1300     exchamge primitives are found in routines
1301     prefixed {\em ADEXCH}. The exchange routines make calls to
1302     the same low-level communication primitives as the forward mode
1303     operations. However, the routine argument {\em simulationMode}
1304     is set to the value {\em REVERSE\_SIMULATION}. This signifies
1305     ti the low-level routines that the adjoint forms of the
1306     appropriate communication operation should be performed.
1307     \item {\bf MAX\_NO\_THREADS}
1308     The variable {\em MAX\_NO\_THREADS} is used to indicate the
1309     maximum number of OS threads that a code will use. This
1310     value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1311     For single threaded execution it can be reduced to one if required.
1312     The va;lue is largely private to the WRAPPER and application code
1313     will nor normally reference the value, except in the following scenario.
1314    
1315     For certain physical parametrization schemes it is necessary to have
1316     a substantial number of work arrays. Where these arrays are allocated
1317     in heap storage ( for example COMMON blocks ) multi-threaded
1318     execution will require multiple instances of the COMMON block data.
1319     This can be achieved using a Fortran 90 module construct, however,
1320     if this might be unavailable then the work arrays can be extended
1321     with dimensions use the tile dimensioning scheme of {\em nSx}
1322     and {\em nSy} ( as described in section
1323     \ref{sec:specifying_a_decomposition}). However, if the configuration
1324     being specified involves many more tiles than OS threads then
1325     it can save memory resources to reduce the variable
1326     {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1327     will be used and to declare the physical parameterisation
1328     work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.
1329     An example of this is given in the verification experiment
1330     {\em aim.5l\_cs}. Here the default setting of
1331     {\em MAX\_NO\_THREADS} is altered to
1332     \begin{verbatim}
1333     INTEGER MAX_NO_THREADS
1334     PARAMETER ( MAX_NO_THREADS = 6 )
1335     \end{verbatim}
1336     and several work arrays for storing intermediate calculations are
1337     created with declarations of the form.
1338     \begin{verbatim}
1339     common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1340     \end{verbatim}
1341     This declaration scheme is not used widely, becuase most global data
1342     is used for permanent not temporary storage of state information.
1343     In the case of permanent state information this approach cannot be used
1344     because there has to be enough storage allocated for all tiles.
1345     However, the technique can sometimes be a useful scheme for reducing memory
1346     requirements in complex physical paramterisations.
1347    
1348     \end{enumerate}
1349    
1350     \subsubsection{Specializing the Communication Code}
1351    
1352     The isolation of performance critical communication primitives and the
1353     sub-division of the simulation domain into tiles is a powerful tool.
1354     Here we show how it can be used to improve application performance and
1355     how it can be used to adapt to new gridding approaches.
1356    
1357     \subsubsection{JAM example}
1358     \label{sec:jam_example}
1359     On some platforms a big performance boost can be obtained by
1360     binding the communication routines {\em \_EXCH} and
1361     {\em \_GSUM} to specialized native libraries ) fro example the
1362     shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag
1363     is used as an illustration of a specialized communication configuration
1364     that substitutes for standard, portable forms of {\em \_EXCH} and
1365     {\em \_GSUM}. It affects three source files {\em eeboot.F},
1366     {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined
1367     is has the following effects.
1368     \begin{itemize}
1369     \item An extra phase is included at boot time to initialize the custom
1370     communications library ( see {\em ini\_jam.F}).
1371     \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1372     with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
1373     \item a highly specialized form of the exchange operator (optimized
1374     for overlap regions of width one) is substitued into the elliptic
1375     solver routine {\em cg2d.F}.
1376     \end{itemize}
1377     Developing specialized code for other libraries follows a similar
1378     pattern.
1379    
1380     \subsubsection{Cube sphere communication}
1381     \label{sec:cube_sphere_communication}
1382     Actual {\em \_EXCH} routine code is generated automatically from
1383     a series of template files, for example {\em exch\_rx.template}.
1384     This is done to allow a large number of variations on the exchange
1385     process to be maintained. One set of variations supports the
1386     cube sphere grid. Support for a cube sphere gris in MITgcm is based
1387     on having each face of the cube as a separate tile (or tiles).
1388     The exchage routines are then able to absorb much of the
1389     detailed rotation and reorientation required when moving around the
1390     cube grid. The set of {\em \_EXCH} routines that contain the
1391     word cube in their name perform these transformations.
1392     They are invoked when the run-time logical parameter
1393     {\em useCubedSphereExchange} is set true. To facilitate the
1394     transformations on a staggered C-grid, exchange operations are defined
1395     separately for both vector and scalar quantitities and for
1396     grid-centered and for grid-face and corner quantities.
1397     Three sets of exchange routines are defined. Routines
1398     with names of the form {\em exch\_rx} are used to exchange
1399     cell centered scalar quantities. Routines with names of the form
1400     {\em exch\_uv\_rx} are used to exchange vector quantities located at
1401     the C-grid velocity points. The vector quantities exchanged by the
1402     {\em exch\_uv\_rx} routines can either be signed (for example velocity
1403     components) or un-signed (for example grid-cell separations).
1404     Routines with names of the form {\em exch\_z\_rx} are used to exchange
1405     quantities at the C-grid vorticity point locations.
1406    
1407    
1408    
1409    
1410     \section{MITgcm execution under WRAPPER}
1411    
1412     Fitting together the WRAPPER elements, package elements and
1413     MITgcm core equation elements of the source code produces calling
1414     sequence shown in section \ref{sec:calling_sequence}
1415    
1416     \subsection{Annotated call tree for MITgcm and WRAPPER}
1417     \label{sec:calling_sequence}
1418    
1419     WRAPPER layer.
1420    
1421     \begin{verbatim}
1422    
1423     MAIN
1424     |
1425     |--EEBOOT :: WRAPPER initialization
1426     | |
1427     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1428     | | allow basic I/O.
1429     | |-- EEINTRO_MSG :: Write startup greeting.
1430     | |
1431     | |-- EESET_PARMS :: Set WRAPPER parameters
1432     | |
1433     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1434     | |
1435     | |-- INI_PROCS :: Associate processes with grid regions.
1436     | |
1437     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1438     | |
1439     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1440     | :: communication data structures
1441     |
1442     |
1443     |--CHECK_THREADS :: Validate multiple thread start up.
1444     |
1445     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1446    
1447     \end{verbatim}
1448    
1449     Core equations plus packages.
1450    
1451     \begin{verbatim}
1452     C
1453     C
1454     C Invocation from WRAPPER level...
1455     C :
1456     C :
1457     C |
1458     C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1459     C | :: Called from WRAPPER level numerical
1460     C | :: code innvocation routine. On entry
1461     C | :: to THE_MODEL_MAIN separate thread and
1462     C | :: separate processes will have been established.
1463     C | :: Each thread and process will have a unique ID
1464     C | :: but as yet it will not be associated with a
1465     C | :: specific region in decomposed discrete space.
1466     C |
1467     C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1468     C | | :: grid, solver matrices etc..
1469     C | |
1470     C | |-INI_PARMS :: Routine to set kernel model parameters.
1471     C | | :: By default kernel parameters are read from file
1472     C | | :: "data" in directory in which code executes.
1473     C | |
1474     C | |-MON_INIT :: Initialises monitor pacakge ( see pkg/monitor )
1475     C | |
1476     C | |-INI_GRID :: Control grid array (vert. and hori.) initialisation.
1477     C | | | :: Grid arrays are held and described in GRID.h.
1478     C | | |
1479     C | | |-INI_VERTICAL_GRID :: Initialise vertical grid arrays.
1480     C | | |
1481     C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialisation
1482     C | | | :: (calculate grid from kernel parameters).
1483     C | | |
1484     C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1485     C | | | :: initialisation (calculate grid from
1486     C | | | :: kernel parameters).
1487     C | | |
1488     C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1489     C | | :: grid initialisations. ( input from raw
1490     C | | :: grid files, LONC.bin, DXF.bin etc... )
1491     C | |
1492     C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1493     C | |
1494     C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1495     C | | :: land masking for solid-fluid boundaries.
1496     C | |
1497     C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1498     C | |
1499     C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1500     C | | :: sphere optins are coded.
1501     C | |
1502     C | |-PACAKGES_BOOT :: Start up the optional package environment.
1503     C | | :: Runtime selection of active packages.
1504     C | |
1505     C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1506     C | | |
1507     C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1508     C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1509     C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1510     C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1511     C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1512     C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1513     C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1514     C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1515     C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1516     C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1517     C | |
1518     C | |-PACKAGES_CHECK
1519     C | | |
1520     C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1521     C | | |-OBCS_CHECK :: Open bndy Pacakge. pkg/obcs
1522     C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1523     C | |
1524     C | |-PACKAGES_INIT_FIXED
1525     C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1526     C | | |-FLT_INIT :: Floats Package. see pkg/flt
1527     C | |
1528     C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1529     C | |
1530     C | |-INI_CG2D :: 2d con. grad solver initialisation.
1531     C | |
1532     C | |-INI_CG3D :: 3d con. grad solver initialisation.
1533     C | |
1534     C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1535     C | :: Includes annotated table of kernel
1536     C | :: parameter settings.
1537     C |
1538     C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1539     C |
1540     C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1541     C ! :: Auotmatically gerenrated by TAMC/TAF.
1542     C |
1543     C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1544     C |
1545     C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1546     C |
1547     C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1548     C | |
1549     C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1550     C | | | :: variables
1551     C | | |
1552     C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1553     C | | |
1554     C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1555     C | | | :: sphere optins are coded.
1556     C | | |
1557     C | | |-INI_CG2D :: 2d con. grad solver initialisation.
1558     C | | |-INI_CG3D :: 3d con. grad solver initialisation.
1559     C | | |-INI_MIXING :: Initialise diapycnal diffusivity.
1560     C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical
1561     C | | | :: fields).
1562     C | | |
1563     C | | |-INI_FIELDS :: Control initialising model fields to non-zero
1564     C | | | |-INI_VEL :: Initialize 3D flow field.
1565     C | | | |-INI_THETA :: Set model initial temperature field.
1566     C | | | |-INI_SALT :: Set model initial salinity field.
1567     C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1568     C | | |
1569     C | | |-INI_TR1 :: Set initial tracer 1 distribution.
1570     C | | |
1571     C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1572     C | | | | :: Here applied to move restart conditions
1573     C | | | | :: (saved in mid timestep) to correct level in
1574     C | | | | :: time (only used for pre-c35).
1575     C | | | |
1576     C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1577     C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1578     C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1579     C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1580     C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1581     C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1582     C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1583     C | | | | |-FIND_RHO :: Find adjacent densities.
1584     C | | | | |-CONVECT :: Mix static instability.
1585     C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1586     C | | | |
1587     C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1588     C | | |
1589     C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1590     C | | | | :: Extra time history interactions.
1591     C | | | |
1592     C | | | |-FIND_RHO :: Find adjacent densities.
1593     C | | | |-CONVECT :: Mix static instability.
1594     C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1595     C | | |
1596     C | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving
1597     C | | | | :: package data.
1598     C | | | |
1599     C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1600     C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1601     C | | | |-KPP_OPEN_DIAGS
1602     C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1603     C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1604     C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1605     C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1606     C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1607     C | | | |-INI_FORCING :: Set model initial forcing fields.
1608     C | | | | :: Either set in-line or from file as shown.
1609     C | | | |-READ_FLD_XY_RS(zonalWindFile)
1610     C | | | |-READ_FLD_XY_RS(meridWindFile)
1611     C | | | |-READ_FLD_XY_RS(surfQFile)
1612     C | | | |-READ_FLD_XY_RS(EmPmRfile)
1613     C | | | |-READ_FLD_XY_RS(thetaClimFile)
1614     C | | | |-READ_FLD_XY_RS(saltClimFile)
1615     C | | | |-READ_FLD_XY_RS(surfQswFile)
1616     C | | |
1617     C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1618     C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1619     C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1620     C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1621     C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1622     C | |
1623     C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1624     C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1625     C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1626     C | |
1627     C | |-MONITOR :: Monitor state ( see pkg/monitor )
1628     C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1629     C====|>|
1630     C====|>| ****************************
1631     C====|>| BEGIN MAIN TIMESTEPPING LOOP
1632     C====|>| ****************************
1633     C====|>|
1634     C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1635     C/\ | | |
1636     C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1637     C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1638     C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1639     C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1640     C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1641     C/\ | | | | :: Simple interpolcation between end-points
1642     C/\ | | | | :: for forcing datasets.
1643     C/\ | | | |
1644     C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1645     C/\ | | |
1646     C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1647     C/\ | | | |
1648     C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1649     C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1650     C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-Rhonil] of a slice
1651     C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1652     C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1653     C/\ | | | |
1654     C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1655     C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1656     C/\ | | | | :: forcing terms.
1657     C/\ | | | |
1658     C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1659     C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1660     C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1661     C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1662     C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1663     C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1664     C/\ | | | | advection case) (see pkg/gad).
1665     C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1666     C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1667     C/\ | | | | |
1668     C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1669     C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1670     C/\ | | | |
1671     C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1672     C/\ | | | | |
1673     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1674     C/\ | | | | | :: ( see pkg/gad )
1675     C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1676     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1677     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1678     C/\ | | | |
1679     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1680     C/\ | | | |
1681     C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1682     C/\ | | | | |
1683     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1684     C/\ | | | | | :: ( see pkg/gad )
1685     C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1686     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1687     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1688     C/\ | | | |
1689     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1690     C/\ | | | |
1691     C/\ | | | |-CALC_GTR1 :: Calculate other tracer(s) tendency terms
1692     C/\ | | | | |
1693     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1694     C/\ | | | | | :: ( see pkg/gad )
1695     C/\ | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.
1696     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1697     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1698     C/\ | | | |
1699     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1700     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1701     C/\ | | | |-FREEZE :: Limit range of temperature.
1702     C/\ | | | |
1703     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1704     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1705     C/\ | | | |
1706     C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1707     C/\ | | | |-EXCH :: Update overlaps
1708     C/\ | | |
1709     C/\ | | |-DYNAMICS :: Momentum equations driver.
1710     C/\ | | | |
1711     C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1712     C/\ | | | | Potential anomaly.
1713     C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1714     C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1715     C/\ | | | |
1716     C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1717     C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1718     C/\ | | | | pkg/mom_fluxform ).
1719     C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1720     C/\ | | | | pkg/mom_vecinv ).
1721     C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1722     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1723     C/\ | | | |
1724     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1725     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1726     C/\ | | | |
1727     C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1728     C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1729     C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1730     C/\ | | |
1731     C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1732     C/\ | | |
1733     C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1734     C/\ | | |
1735     C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1736     C/\ | | |
1737     C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1738     C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1739     C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1740     C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1741     C/\ | | |
1742     C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1743     C/\ | | | |
1744     C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1745     C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1746     C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1747     C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1748     C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1749     C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1750     C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1751     C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1752     C/\ | | | | |-CONVECT :: Mix static instability.
1753     C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1754     C/\ | | | |
1755     C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1756     C/\ | | |
1757     C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1758     C/\ | | | |-EXCH
1759     C/\ | | |
1760     C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1761     C/\ | | |
1762     C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1763     C/\ | | |
1764     C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1765     C/\ | | | |-WRITE_STATE :: Core state I/O
1766     C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1767     C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1768     C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1769     C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1770     C/\ | | |
1771     C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1772     C/\ | |
1773     C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1774     C<===|=|
1775     C<===|=| **************************
1776     C<===|=| END MAIN TIMESTEPPING LOOP
1777     C<===|=| **************************
1778     C<===|=|
1779     C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1780     C |
1781     C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1782     C |
1783     C |-TIMER_PRINTALL :: Computational timing summary
1784     C |
1785     C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1786     C :: events.
1787     C
1788     \end{verbatim}
1789    
1790     \subsection{Measuring and Characterizing Performance}
1791    
1792     TO BE DONE (CNH)
1793    
1794     \subsection{Estimating Resource Requirements}
1795    
1796     TO BE DONE (CNH)
1797    
1798     \subsubsection{Atlantic 1/6 degree example}
1799     \subsubsection{Dry Run testing}
1800     \subsubsection{Adjoint Resource Requirements}
1801     \subsubsection{State Estimation Environment Resources}
1802    

  ViewVC Help
Powered by ViewVC 1.1.22