/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Annotation of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.16 - (hide annotations) (download) (as text)
Thu Jan 29 16:02:58 2004 UTC (21 years, 5 months ago) by afe
Branch: MAIN
Changes since 1.15: +5 -2 lines
File MIME type: application/x-tex
o final tweaks to chapter 4 for now

1 afe 1.15 % $Header: /u/u3/gcmpack/manual/part4/sarch.tex,v 1.14 2004/01/28 20:38:28 afe Exp $
2 cnh 1.1
3 cnh 1.7 This chapter focuses on describing the {\bf WRAPPER} environment within which
4     both the core numerics and the pluggable packages operate. The description
5     presented here is intended to be a detailed exposition and contains significant
6     background material, as well as advanced details on working with the WRAPPER.
7 adcroft 1.9 The tutorial sections of this manual (see sections
8     \ref{sect:tutorials} and \ref{sect:tutorialIII})
9 cnh 1.7 contain more succinct, step-by-step instructions on running basic numerical
10     experiments, of varous types, both sequentially and in parallel. For many
11     projects simply starting from an example code and adapting it to suit a
12     particular situation
13     will be all that is required.
14     The first part of this chapter discusses the MITgcm architecture at an
15     abstract level. In the second part of the chapter we described practical
16     details of the MITgcm implementation and of current tools and operating system
17     features that are employed.
18 cnh 1.1
19     \section{Overall architectural goals}
20    
21     Broadly, the goals of the software architecture employed in MITgcm are
22     three-fold
23    
24     \begin{itemize}
25     \item We wish to be able to study a very broad range
26     of interesting and challenging rotating fluids problems.
27     \item We wish the model code to be readily targeted to
28     a wide range of platforms
29     \item On any given platform we would like to be
30     able to achieve performance comparable to an implementation
31     developed and specialized specifically for that platform.
32     \end{itemize}
33    
34     These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
35     which conveys the goals of the MITgcm design. The goals lead to
36     a software architecture which at the high-level can be viewed as consisting
37     of
38    
39     \begin{enumerate}
40     \item A core set of numerical and support code. This is discussed in detail in
41 adcroft 1.6 section \ref{sect:partII}.
42 cnh 1.1 \item A scheme for supporting optional "pluggable" {\bf packages} (containing
43     for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
44     These packages are used both to overlay alternate dynamics and to introduce
45     specialized physical content onto the core numerical code. An overview of
46     the {\bf package} scheme is given at the start of part \ref{part:packages}.
47     \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
48     Programming Environment Resource), within which the core numerics and pluggable
49     packages operate.
50     \end{enumerate}
51    
52     This chapter focuses on describing the {\bf WRAPPER} environment under which
53     both the core numerics and the pluggable packages function. The description
54 cnh 1.4 presented here is intended to be a detailed exposition and contains significant
55 cnh 1.1 background material, as well as advanced details on working with the WRAPPER.
56     The examples section of this manual (part \ref{part:example}) contains more
57     succinct, step-by-step instructions on running basic numerical
58     experiments both sequentially and in parallel. For many projects simply
59     starting from an example code and adapting it to suit a particular situation
60     will be all that is required.
61    
62 adcroft 1.2
63 cnh 1.1 \begin{figure}
64     \begin{center}
65 adcroft 1.2 \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
66 cnh 1.1 \end{center}
67 adcroft 1.2 \caption{
68     The MITgcm architecture is designed to allow simulation of a wide
69 cnh 1.1 range of physical problems on a wide range of hardware. The computational
70     resource requirements of the applications targeted range from around
71     $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
72     ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
73     interest range from $10^{9}$ floating point operations to more than $10^{17}$
74 adcroft 1.2 floating point operations.}
75     \label{fig:mitgcm_architecture_goals}
76 cnh 1.1 \end{figure}
77    
78     \section{WRAPPER}
79    
80     A significant element of the software architecture utilized in
81     MITgcm is a software superstructure and substructure collectively
82     called the WRAPPER (Wrappable Application Parallel Programming
83     Environment Resource). All numerical and support code in MITgcm is written
84     to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
85     the WRAPPER means that coding has to follow certain, relatively
86     straightforward, rules and conventions ( these are discussed further in
87 adcroft 1.6 section \ref{sect:specifying_a_decomposition} ).
88 cnh 1.1
89     The approach taken by the WRAPPER is illustrated in figure
90     \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
91     that fits within it from architectural differences between hardware platforms
92     and operating systems. This allows numerical code to be easily retargetted.
93 adcroft 1.2
94    
95 cnh 1.1 \begin{figure}
96     \begin{center}
97 adcroft 1.2 \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
98 cnh 1.1 \end{center}
99 adcroft 1.2 \caption{
100 edhill 1.10 Numerical code is written to fit within a software support
101 cnh 1.1 infrastructure called WRAPPER. The WRAPPER is portable and
102 cnh 1.4 can be specialized for a wide range of specific target hardware and
103 cnh 1.1 programming environments, without impacting numerical code that fits
104     within the WRAPPER. Codes that fit within the WRAPPER can generally be
105     made to run as fast on a particular platform as codes specially
106 adcroft 1.2 optimized for that platform.}
107     \label{fig:fit_in_wrapper}
108 cnh 1.1 \end{figure}
109    
110     \subsection{Target hardware}
111 adcroft 1.6 \label{sect:target_hardware}
112 cnh 1.1
113     The WRAPPER is designed to target as broad as possible a range of computer
114     systems. The original development of the WRAPPER took place on a
115     multi-processor, CRAY Y-MP system. On that system, numerical code performance
116     and scaling under the WRAPPER was in excess of that of an implementation that
117     was tightly bound to the CRAY systems proprietary multi-tasking and
118     micro-tasking approach. Later developments have been carried out on
119     uniprocessor and multi-processor Sun systems with both uniform memory access
120     (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
121     been undertaken on x86 cluster systems, Alpha processor based clustered SMP
122     systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
123 edhill 1.10 The MITgcm code, operating within the WRAPPER, is also routinely used on
124 cnh 1.1 large scale MPP systems (for example T3E systems and IBM SP systems). In all
125     cases numerical code, operating within the WRAPPER, performs and scales very
126     competitively with equivalent numerical code that has been modified to contain
127     native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.
128    
129     \subsection{Supporting hardware neutrality}
130    
131 adcroft 1.6 The different systems listed in section \ref{sect:target_hardware} can be
132 cnh 1.1 categorized in many different ways. For example, one common distinction is
133     between shared-memory parallel systems (SMP's, PVP's) and distributed memory
134     parallel systems (for example x86 clusters and large MPP systems). This is one
135     example of a difference between compute platforms that can impact an
136     application. Another common distinction is between vector processing systems
137     with highly specialized CPU's and memory subsystems and commodity
138     microprocessor based systems. There are numerous other differences, especially
139     in relation to how parallel execution is supported. To capture the essential
140     differences between different platforms the WRAPPER uses a {\it machine model}.
141    
142     \subsection{WRAPPER machine model}
143    
144     Applications using the WRAPPER are not written to target just one
145     particular machine (for example an IBM SP2) or just one particular family or
146     class of machines (for example Parallel Vector Processor Systems). Instead the
147     WRAPPER provides applications with an
148     abstract {\it machine model}. The machine model is very general, however, it can
149 cnh 1.5 easily be specialized to fit, in a computationally efficient manner, any
150 cnh 1.1 computer architecture currently available to the scientific computing community.
151    
152     \subsection{Machine model parallelism}
153    
154     Codes operating under the WRAPPER target an abstract machine that is assumed to
155     consist of one or more logical processors that can compute concurrently.
156 cnh 1.5 Computational work is divided among the logical
157 cnh 1.1 processors by allocating ``ownership'' to
158     each processor of a certain set (or sets) of calculations. Each set of
159     calculations owned by a particular processor is associated with a specific
160     region of the physical space that is being simulated, only one processor will
161     be associated with each such region (domain decomposition).
162    
163     In a strict sense the logical processors over which work is divided do not need
164     to correspond to physical processors. It is perfectly possible to execute a
165     configuration decomposed for multiple logical processors on a single physical
166     processor. This helps ensure that numerical code that is written to fit
167     within the WRAPPER will parallelize with no additional effort and is
168     also useful when debugging codes. Generally, however,
169     the computational domain will be subdivided over multiple logical
170     processors in order to then bind those logical processors to physical
171     processor resources that can compute in parallel.
172    
173     \subsubsection{Tiles}
174    
175     Computationally, associated with each region of physical
176     space allocated to a particular logical processor, there will be data
177     structures (arrays, scalar variables etc...) that hold the simulated state of
178     that region. We refer to these data structures as being {\bf owned} by the
179 cnh 1.4 processor to which their
180 cnh 1.1 associated region of physical space has been allocated. Individual
181     regions that are allocated to processors are called {\bf tiles}. A
182     processor can own more
183     than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being
184     mapped to a set of logical processors, with each processors owning a single
185     region of the domain (a single tile). Except for periods of
186     communication and coordination, each processor computes autonomously, working
187     only with data from the tile (or tiles) that the processor owns. When multiple
188     tiles are alloted to a single processor, each tile is computed on
189     independently of the other tiles, in a sequential fashion.
190    
191     \begin{figure}
192     \begin{center}
193 adcroft 1.2 \resizebox{5in}{!}{
194     \includegraphics{part4/domain_decomp.eps}
195 cnh 1.1 }
196     \end{center}
197     \caption{ The WRAPPER provides support for one and two dimensional
198     decompositions of grid-point domains. The figure shows a hypothetical domain of
199     total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in
200     two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf
201     tiles} are {\bf owned} by different processors. The {\bf owning}
202     processors perform the
203     arithmetic operations associated with a {\bf tile}. Although not illustrated
204     here, a single processor can {\bf own} several {\bf tiles}.
205     Whenever a processor wishes to transfer data between tiles or
206     communicate with other processors it calls a WRAPPER supplied
207     function.
208     } \label{fig:domaindecomp}
209     \end{figure}
210    
211    
212     \subsubsection{Tile layout}
213    
214     Tiles consist of an interior region and an overlap region. The overlap region
215     of a tile corresponds to the interior region of an adjacent tile.
216     In figure \ref{fig:tiledworld} each tile would own the region
217     within the black square and hold duplicate information for overlap
218     regions extending into the tiles to the north, south, east and west.
219     During
220     computational phases a processor will reference data in an overlap region
221     whenever it requires values that outside the domain it owns. Periodically
222     processors will make calls to WRAPPER functions to communicate data between
223     tiles, in order to keep the overlap regions up to date (see section
224 adcroft 1.6 \ref{sect:communication_primitives}). The WRAPPER functions can use a
225 cnh 1.1 variety of different mechanisms to communicate data between tiles.
226    
227     \begin{figure}
228     \begin{center}
229 adcroft 1.2 \resizebox{5in}{!}{
230     \includegraphics{part4/tiled-world.eps}
231 cnh 1.1 }
232     \end{center}
233     \caption{ A global grid subdivided into tiles.
234     Tiles contain a interior region and an overlap region.
235     Overlap regions are periodically updated from neighboring tiles.
236     } \label{fig:tiledworld}
237     \end{figure}
238    
239     \subsection{Communication mechanisms}
240    
241     Logical processors are assumed to be able to exchange information
242     between tiles and between each other using at least one of two possible
243     mechanisms.
244    
245     \begin{itemize}
246     \item {\bf Shared memory communication}.
247     Under this mode of communication data transfers are assumed to be possible
248     using direct addressing of regions of memory. In this case a CPU is able to read
249     (and write) directly to regions of memory "owned" by another CPU
250     using simple programming language level assignment operations of the
251     the sort shown in figure \ref{fig:simple_assign}. In this way one CPU
252     (CPU1 in the figure) can communicate information to another CPU (CPU2 in the
253     figure) by assigning a particular value to a particular memory location.
254    
255     \item {\bf Distributed memory communication}.
256     Under this mode of communication there is no mechanism, at the application code level,
257     for directly addressing regions of memory owned and visible to another CPU. Instead
258     a communication library must be used as illustrated in figure
259     \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the
260     communication library to communicate data from a tile that it owns to a tile
261     that another CPU owns. By default the WRAPPER binds to the MPI communication
262     library \ref{MPI} for this style of communication.
263     \end{itemize}
264    
265     The WRAPPER assumes that communication will use one of these two styles
266     of communication. The underlying hardware and operating system support
267     for the style used is not specified and can vary from system to system.
268    
269     \begin{figure}
270     \begin{verbatim}
271    
272     CPU1 | CPU2
273     ==== | ====
274     |
275     a(3) = 8 | WHILE ( a(3) .NE. 8 )
276     | WAIT
277     | END WHILE
278     |
279     \end{verbatim}
280     \caption{ In the WRAPPER shared memory communication model, simple writes to an
281     array can be made to be visible to other CPU's at the application code level.
282     So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
283     element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)
284     will be able to see the value $8$ when they read from $a(3)$.
285     This provides a very low latency and high bandwidth communication
286     mechanism.
287     } \label{fig:simple_assign}
288     \end{figure}
289    
290     \begin{figure}
291     \begin{verbatim}
292    
293     CPU1 | CPU2
294     ==== | ====
295     |
296     a(3) = 8 | WHILE ( a(3) .NE. 8 )
297     CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
298     | END WHILE
299     |
300     \end{verbatim}
301     \caption{ In the WRAPPER distributed memory communication model
302     data can not be made directly visible to other CPU's.
303     If one CPU writes the value $8$ to element $3$ of array $a$, then
304     at least one of CPU1 and/or CPU2 in the figure above will need
305     to call a bespoke communication library in order for the updated
306     value to be communicated between CPU's.
307     } \label{fig:comm_msg}
308     \end{figure}
309    
310     \subsection{Shared memory communication}
311 adcroft 1.6 \label{sect:shared_memory_communication}
312 cnh 1.1
313     Under shared communication independent CPU's are operating
314     on the exact same global address space at the application level.
315     This means that CPU 1 can directly write into global
316     data structures that CPU 2 ``owns'' using a simple
317     assignment at the application level.
318     This is the model of memory access is supported at the basic system
319     design level in ``shared-memory'' systems such as PVP systems, SMP systems,
320     and on distributed shared memory systems (the SGI Origin).
321     On such systems the WRAPPER will generally use simple read and write statements
322     to access directly application data structures when communicating between CPU's.
323    
324     In a system where assignments statements, like the one in figure
325     \ref{fig:simple_assign} map directly to
326     hardware instructions that transport data between CPU and memory banks, this
327     can be a very efficient mechanism for communication. In this case two CPU's,
328     CPU1 and CPU2, can communicate simply be reading and writing to an
329     agreed location and following a few basic rules. The latency of this sort
330     of communication is generally not that much higher than the hardware
331     latency of other memory accesses on the system. The bandwidth available
332     between CPU's communicating in this way can be close to the bandwidth of
333     the systems main-memory interconnect. This can make this method of
334     communication very efficient provided it is used appropriately.
335    
336     \subsubsection{Memory consistency}
337 adcroft 1.6 \label{sect:memory_consistency}
338 cnh 1.1
339     When using shared memory communication between
340     multiple processors the WRAPPER level shields user applications from
341     certain counter-intuitive system behaviors. In particular, one issue the
342     WRAPPER layer must deal with is a systems memory model. In general the order
343     of reads and writes expressed by the textual order of an application code may
344     not be the ordering of instructions executed by the processor performing the
345     application. The processor performing the application instructions will always
346     operate so that, for the application instructions the processor is executing,
347     any reordering is not apparent. However, in general machines are often
348     designed so that reordering of instructions is not hidden from other second
349     processors. This means that, in general, even on a shared memory system two
350     processors can observe inconsistent memory values.
351    
352     The issue of memory consistency between multiple processors is discussed at
353     length in many computer science papers, however, from a practical point of
354     view, in order to deal with this issue, shared memory machines all provide
355     some mechanism to enforce memory consistency when it is needed. The exact
356     mechanism employed will vary between systems. For communication using shared
357     memory, the WRAPPER provides a place to invoke the appropriate mechanism to
358     ensure memory consistency for a particular platform.
359    
360     \subsubsection{Cache effects and false sharing}
361 adcroft 1.6 \label{sect:cache_effects_and_false_sharing}
362 cnh 1.1
363     Shared-memory machines often have local to processor memory caches
364     which contain mirrored copies of main memory. Automatic cache-coherence
365     protocols are used to maintain consistency between caches on different
366     processors. These cache-coherence protocols typically enforce consistency
367     between regions of memory with large granularity (typically 128 or 256 byte
368     chunks). The coherency protocols employed can be expensive relative to other
369     memory accesses and so care is taken in the WRAPPER (by padding synchronization
370     structures appropriately) to avoid unnecessary coherence traffic.
371    
372     \subsubsection{Operating system support for shared memory.}
373    
374     Applications running under multiple threads within a single process can
375     use shared memory communication. In this case {\it all} the memory locations
376     in an application are potentially visible to all the compute threads. Multiple
377     threads operating within a single process is the standard mechanism for
378     supporting shared memory that the WRAPPER utilizes. Configuring and launching
379     code to run in multi-threaded mode on specific platforms is discussed in
380 adcroft 1.6 section \ref{sect:running_with_threads}. However, on many systems, potentially
381 cnh 1.1 very efficient mechanisms for using shared memory communication between
382     multiple processes (in contrast to multiple threads within a single
383     process) also exist. In most cases this works by making a limited region of
384     memory shared between processes. The MMAP \ref{magicgarden} and
385     IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do
386     vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.
387     Extensions exist for the WRAPPER that allow these mechanisms
388     to be used for shared memory communication. However, these mechanisms are not
389     distributed with the default WRAPPER sources, because of their proprietary
390     nature.
391    
392     \subsection{Distributed memory communication}
393 adcroft 1.6 \label{sect:distributed_memory_communication}
394 cnh 1.1 Many parallel systems are not constructed in a way where it is
395     possible or practical for an application to use shared memory
396     for communication. For example cluster systems consist of individual computers
397     connected by a fast network. On such systems their is no notion of shared memory
398     at the system level. For this sort of system the WRAPPER provides support
399     for communication based on a bespoke communication library
400     (see figure \ref{fig:comm_msg}). The default communication library used is MPI
401     \ref{mpi}. However, it is relatively straightforward to implement bindings to
402     optimized platform specific communication libraries. For example the work
403     described in \ref{hoe-hill:99} substituted standard MPI communication for a
404     highly optimized library.
405    
406     \subsection{Communication primitives}
407 adcroft 1.6 \label{sect:communication_primitives}
408 cnh 1.1
409     \begin{figure}
410     \begin{center}
411 adcroft 1.2 \resizebox{5in}{!}{
412     \includegraphics{part4/comm-primm.eps}
413 cnh 1.1 }
414     \end{center}
415 cnh 1.5 \caption{Three performance critical parallel primitives are provided
416     by the WRAPPER. These primitives are always used to communicate data
417 cnh 1.1 between tiles. The figure shows four tiles. The curved arrows indicate
418     exchange primitives which transfer data between the overlap regions at tile
419     edges and interior regions for nearest-neighbor tiles.
420     The straight arrows symbolize global sum operations which connect all tiles.
421     The global sum operation provides both a key arithmetic primitive and can
422     serve as a synchronization primitive. A third barrier primitive is also
423     provided, it behaves much like the global sum primitive.
424     } \label{fig:communication_primitives}
425     \end{figure}
426    
427    
428     Optimized communication support is assumed to be possibly available
429     for a small number of communication operations.
430     It is assumed that communication performance optimizations can
431     be achieved by optimizing a small number of communication primitives.
432     Three optimizable primitives are provided by the WRAPPER
433     \begin{itemize}
434     \item{\bf EXCHANGE} This operation is used to transfer data between interior
435     and overlap regions of neighboring tiles. A number of different forms of this
436     operation are supported. These different forms handle
437     \begin{itemize}
438     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
439     separately.
440     \item Bindings to different communication methods.
441     Exchange primitives select between using shared memory or distributed
442     memory communication.
443     \item Transformation operations required when transporting
444     data between different grid regions. Transferring data
445     between faces of a cube-sphere grid, for example, involves a rotation
446     of vector components.
447     \item Forward and reverse mode computations. Derivative calculations require
448     tangent linear and adjoint forms of the exchange primitives.
449    
450     \end{itemize}
451    
452     \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
453     operation for the pressure inversion phase of the MITgcm algorithm.
454     For certain configurations scaling can be highly sensitive to
455     the performance of the global sum primitive. This operation is a collective
456     operation involving all tiles of the simulated domain. Different forms
457     of the global sum primitive exist for handling
458     \begin{itemize}
459     \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
460     separately.
461     \item Bindings to different communication methods.
462     Exchange primitives select between using shared memory or distributed
463     memory communication.
464     \item Forward and reverse mode computations. Derivative calculations require
465     tangent linear and adjoint forms of the exchange primitives.
466     \end{itemize}
467    
468     \item{\bf BARRIER} The WRAPPER provides a global synchronization function
469     called barrier. This is used to synchronize computations over all tiles.
470     The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in
471     some cases use the same underlying code.
472     \end{itemize}
473    
474    
475     \subsection{Memory architecture}
476    
477     The WRAPPER machine model is aimed to target efficiently systems with
478     highly pipelined memory architectures and systems with deep memory
479     hierarchies that favor memory reuse. This is achieved by supporting a
480     flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
481     Within a CPU computations are carried out sequentially on each tile
482     in turn. By reshaping tiles according to the target platform it is
483     possible to automatically tune code to improve memory performance.
484     On a vector machine a given domain might be sub-divided into a few
485     long, thin regions. On a commodity microprocessor based system, however,
486     the same region could be simulated use many more smaller
487     sub-domains.
488    
489    
490     \begin{figure}
491     \begin{center}
492 adcroft 1.2 \resizebox{5in}{!}{
493     \includegraphics{part4/tiling_detail.eps}
494 cnh 1.1 }
495     \end{center}
496     \caption{The tiling strategy that the WRAPPER supports allows tiles
497     to be shaped to suit the underlying system memory architecture.
498     Compact tiles that lead to greater memory reuse can be used on cache
499     based systems (upper half of figure) with deep memory hierarchies, long tiles
500     with large inner loops can be used to exploit vector systems having
501     highly pipelined memory systems.
502     } \label{fig:tiling-strategy}
503     \end{figure}
504    
505     \newpage
506     \subsection{Summary}
507     Following the discussion above, the machine model that the WRAPPER
508     presents to an application has the following characteristics
509    
510     \begin{itemize}
511     \item The machine consists of one or more logical processors. \vspace{-3mm}
512     \item Each processor operates on tiles that it owns.\vspace{-3mm}
513     \item A processor may own more than one tile.\vspace{-3mm}
514     \item Processors may compute concurrently.\vspace{-3mm}
515     \item Exchange of information between tiles is handled by the
516     machine (WRAPPER) not by the application.
517     \end{itemize}
518     Behind the scenes this allows the WRAPPER to adapt the machine model
519     functions to exploit hardware on which
520     \begin{itemize}
521     \item Processors may be able to communicate very efficiently with each other
522     using shared memory. \vspace{-3mm}
523     \item An alternative communication mechanism based on a relatively
524     simple inter-process communication API may be required.\vspace{-3mm}
525     \item Shared memory may not necessarily obey sequential consistency,
526     however some mechanism will exist for enforcing memory consistency.
527     \vspace{-3mm}
528     \item Memory consistency that is enforced at the hardware level
529     may be expensive. Unnecessary triggering of consistency protocols
530     should be avoided. \vspace{-3mm}
531     \item Memory access patterns may need to either repetitive or highly
532     pipelined for optimum hardware performance. \vspace{-3mm}
533     \end{itemize}
534    
535     This generic model captures the essential hardware ingredients
536     of almost all successful scientific computer systems designed in the
537     last 50 years.
538    
539     \section{Using the WRAPPER}
540    
541     In order to support maximum portability the WRAPPER is implemented primarily
542     in sequential Fortran 77. At a practical level the key steps provided by the
543     WRAPPER are
544     \begin{enumerate}
545     \item specifying how a domain will be decomposed
546     \item starting a code in either sequential or parallel modes of operations
547     \item controlling communication between tiles and between concurrently
548     computing CPU's.
549     \end{enumerate}
550     This section describes the details of each of these operations.
551 adcroft 1.6 Section \ref{sect:specifying_a_decomposition} explains how the way in which
552 cnh 1.1 a domain is decomposed (or composed) is expressed. Section
553 adcroft 1.6 \ref{sect:starting_a_code} describes practical details of running codes
554 cnh 1.1 in various different parallel modes on contemporary computer systems.
555 adcroft 1.6 Section \ref{sect:controlling_communication} explains the internal information
556 cnh 1.1 that the WRAPPER uses to control how information is communicated between
557     tiles.
558    
559     \subsection{Specifying a domain decomposition}
560 adcroft 1.6 \label{sect:specifying_a_decomposition}
561 cnh 1.1
562     At its heart much of the WRAPPER works only in terms of a collection of tiles
563     which are interconnected to each other. This is also true of application
564     code operating within the WRAPPER. Application code is written as a series
565     of compute operations, each of which operates on a single tile. If
566     application code needs to perform operations involving data
567     associated with another tile, it uses a WRAPPER function to obtain
568     that data.
569     The specification of how a global domain is constructed from tiles or alternatively
570     how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
571     This file defines the following parameters \\
572    
573     \fbox{
574     \begin{minipage}{4.75in}
575     Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
576     File: {\em model/inc/SIZE.h}
577     \end{minipage}
578     } \\
579    
580     Together these parameters define a tiling decomposition of the style shown in
581     figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
582     the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
583     maximum size of the overlap extent. This must be set to the maximum width
584     of the computation stencil that the numerical code finite-difference operations
585     require between overlap region updates. The maximum overlap required
586     by any of the operations in the MITgcm code distributed with this release is three grid
587     points. This is set by the requirements of the $\nabla^4$ dissipation and
588     diffusion operator. Code modifications and enhancements that involve adding wide
589     finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
590     Setting {\em OLx} and {\em OLy} to a too large value will decrease code
591     performance (because redundant computations will be performed), however it will
592     not cause any other problems.
593    
594     \begin{figure}
595     \begin{center}
596 adcroft 1.2 \resizebox{5in}{!}{
597     \includegraphics{part4/size_h.eps}
598 cnh 1.1 }
599     \end{center}
600     \caption{ The three level domain decomposition hierarchy employed by the
601     WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
602     to a single process. Multiple processes can exist, each with multiple tiles.
603     Tiles within a process can be spread over multiple compute threads.
604     } \label{fig:labelled_tile}
605     \end{figure}
606    
607     The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
608     be created within a single process. Each of these tiles will have internal
609     dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
610     allocated to different threads of a process that are then bound to
611     different physical processors ( see the multi-threaded
612 adcroft 1.6 execution discussion in section \ref{sect:starting_the_code} ) then
613 cnh 1.1 computation will be performed concurrently on each tile. However, it is also
614     possible to run the same decomposition within a process running a single thread on
615     a single processor. In this case the tiles will be computed over sequentially.
616     If the decomposition is run in a single process running multiple threads
617     but attached to a single physical processor, then, in general, the computation
618     for different tiles will be interleaved by system level software.
619     This too is a valid mode of operation.
620    
621     The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
622     numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
623     are used to form the loop ranges for many numerical calculations and
624     to provide dimensions for arrays holding numerical state.
625     The {\em nSx} and{\em nSy} are used in conjunction with the thread number
626     parameter {\em myThid}. Much of the numerical code operating within the
627     WRAPPER takes the form
628     \begin{verbatim}
629     DO bj=myByLo(myThid),myByHi(myThid)
630     DO bi=myBxLo(myThid),myBxHi(myThid)
631     :
632     a block of computations ranging
633     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
634     :
635     ENDDO
636     ENDDO
637    
638     communication code to sum a number or maybe update
639     tile overlap regions
640    
641     DO bj=myByLo(myThid),myByHi(myThid)
642     DO bi=myBxLo(myThid),myBxHi(myThid)
643     :
644     another block of computations ranging
645     over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
646     :
647     ENDDO
648     ENDDO
649     \end{verbatim}
650     The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
651     myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
652     schematic. These variables specify the subset of the tiles in
653     the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
654     thread number {\em myThid} owns. The thread number variable {\em myThid}
655     ranges from 1 to the total number of threads requested at execution time.
656     For each value of {\em myThid} the loop scheme above will step sequentially
657     through the tiles owned by that thread. However, different threads will
658     have different ranges of tiles assigned to them, so that separate threads can
659     compute iterations of the {\em bi}, {\em bj} loop concurrently.
660     Within a {\em bi}, {\em bj} loop
661     computation is performed concurrently over as many processes and threads
662     as there are physical processors available to compute.
663    
664 afe 1.11 An exception to the the use of {\em bi} and {\em bj} in loops arises in the
665     exchange routines used when the exch2 package is used with the cubed
666     sphere. In this case {\em bj} is generally set to 1 and the loop runs from
667     1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number,
668     which is then used to reference exchange parameters.
669    
670 cnh 1.1 The amount of computation that can be embedded
671     a single loop over {\em bi} and {\em bj} varies for different parts of the
672     MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
673     from the two-dimensional implicit elliptic solver. This portion of the
674     code computes the l2Norm of a vector whose elements are held in
675     the array {\em cg2d\_r} writing the final result to scalar variable
676     {\em err}. In this case, because the l2norm requires a global reduction,
677     the {\em bi},{\em bj} loop only contains one statement. This computation
678     phase is then followed by a communication phase in which all threads and
679     processes must participate. However,
680     in other areas of the MITgcm code entries subsections of code are within
681     a single {\em bi},{\em bj} loop. For example the evaluation of all
682     the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
683     is within a single {\em bi},{\em bj} loop.
684    
685     \begin{figure}
686     \begin{verbatim}
687     REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
688     REAL*8 err
689     :
690     :
691     other computations
692     :
693     :
694     err = 0.
695     DO bj=myByLo(myThid),myByHi(myThid)
696     DO bi=myBxLo(myThid),myBxHi(myThid)
697     DO J=1,sNy
698     DO I=1,sNx
699     err = err +
700     & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
701     ENDDO
702     ENDDO
703     ENDDO
704     ENDDO
705    
706     CALL GLOBAL_SUM_R8( err , myThid )
707     err = SQRT(err)
708    
709     \end{verbatim}
710     \caption{Example of numerical code for calculating
711     the l2-norm of a vector within the WRAPPER. Notice that
712     under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
713     dimensions. These right most indices are tile indexes. Different
714     threads with a single process operate on different ranges of tile
715     index, as controlled by the settings of
716     {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
717     } \label{fig:bibj_extract}
718     \end{figure}
719    
720     The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
721     are used to indicate to the WRAPPER level how many processes (each with
722     {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
723     This information is needed during initialization and during I/O phases.
724     However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
725     the values of {\em nPx} and {\em nPy} are absent
726     from the core numerical and support code.
727    
728     \subsubsection{Examples of {\em SIZE.h} specifications}
729    
730     The following different {\em SIZE.h} parameter setting illustrate how to
731     interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
732     and {\em nPy}.
733     \begin{enumerate}
734     \item
735     \begin{verbatim}
736     PARAMETER (
737     & sNx = 90,
738     & sNy = 40,
739     & OLx = 3,
740     & OLy = 3,
741     & nSx = 1,
742     & nSy = 1,
743     & nPx = 1,
744     & nPy = 1)
745     \end{verbatim}
746     This sets up a single tile with x-dimension of ninety grid points, y-dimension of
747     forty grid points, and x and y overlaps of three grid points each.
748     \item
749     \begin{verbatim}
750     PARAMETER (
751     & sNx = 45,
752     & sNy = 20,
753     & OLx = 3,
754     & OLy = 3,
755     & nSx = 1,
756     & nSy = 1,
757     & nPx = 2,
758     & nPy = 2)
759     \end{verbatim}
760     This sets up tiles with x-dimension of forty-five grid points, y-dimension of
761     twenty grid points, and x and y overlaps of three grid points each. There are
762     four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
763     arranged so that the global domain size is again ninety grid points in x and
764     forty grid points in y. In general the formula for global grid size (held in
765     model variables {\em Nx} and {\em Ny}) is
766     \begin{verbatim}
767     Nx = sNx*nSx*nPx
768     Ny = sNy*nSy*nPy
769     \end{verbatim}
770     \item
771     \begin{verbatim}
772     PARAMETER (
773     & sNx = 90,
774     & sNy = 10,
775     & OLx = 3,
776     & OLy = 3,
777     & nSx = 1,
778     & nSy = 2,
779     & nPx = 1,
780     & nPy = 2)
781     \end{verbatim}
782     This sets up tiles with x-dimension of ninety grid points, y-dimension of
783     ten grid points, and x and y overlaps of three grid points each. There are
784     four tiles allocated to two separate processes ({\em nPy=2}) each of which
785     has two separate sub-domains {\em nSy=2},
786     The global domain size is again ninety grid points in x and
787     forty grid points in y. The two sub-domains in each process will be computed
788     sequentially if they are given to a single thread within a single process.
789     Alternatively if the code is invoked with multiple threads per process
790 edhill 1.10 the two domains in y may be computed concurrently.
791 cnh 1.1 \item
792     \begin{verbatim}
793     PARAMETER (
794     & sNx = 32,
795     & sNy = 32,
796     & OLx = 3,
797     & OLy = 3,
798     & nSx = 6,
799     & nSy = 1,
800     & nPx = 1,
801     & nPy = 1)
802     \end{verbatim}
803     This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
804     thirty-two grid points, and x and y overlaps of three grid points each.
805     There are six tiles allocated to six separate logical processors ({\em nSx=6}).
806     This set of values can be used for a cube sphere calculation.
807     Each tile of size $32 \times 32$ represents a face of the
808 cnh 1.4 cube. Initializing the tile connectivity correctly ( see section
809 adcroft 1.6 \ref{sect:cube_sphere_communication}. allows the rotations associated with
810 cnh 1.1 moving between the six cube faces to be embedded within the
811     tile-tile communication code.
812     \end{enumerate}
813    
814    
815     \subsection{Starting the code}
816 adcroft 1.6 \label{sect:starting_the_code}
817 cnh 1.1 When code is started under the WRAPPER, execution begins in a main routine {\em
818     eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
819     to the application through a routine called {\em THE\_MODEL\_MAIN()}
820     once the WRAPPER has initialized correctly and has created the necessary variables
821     to support subsequent calls to communication routines
822     by the application code. The startup calling sequence followed by the
823     WRAPPER is shown in figure \ref{fig:wrapper_startup}.
824    
825     \begin{figure}
826 heimbach 1.8 {\footnotesize
827 cnh 1.1 \begin{verbatim}
828    
829     MAIN
830     |
831     |--EEBOOT :: WRAPPER initialization
832     | |
833     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
834     | | allow basic I/O.
835     | |-- EEINTRO_MSG :: Write startup greeting.
836     | |
837     | |-- EESET_PARMS :: Set WRAPPER parameters
838     | |
839     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
840     | |
841     | |-- INI_PROCS :: Associate processes with grid regions.
842     | |
843     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
844     | |
845     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
846     | :: communication data structures
847     |
848     |
849     |--CHECK_THREADS :: Validate multiple thread start up.
850     |
851     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
852    
853    
854     \end{verbatim}
855 heimbach 1.8 }
856 cnh 1.1 \caption{Main stages of the WRAPPER startup procedure.
857     This process proceeds transfer of control to application code, which
858     occurs through the procedure {\em THE\_MODEL\_MAIN()}.
859     } \label{fig:wrapper_startup}
860     \end{figure}
861    
862     \subsubsection{Multi-threaded execution}
863 adcroft 1.6 \label{sect:multi-threaded-execution}
864 cnh 1.1 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
865     WRAPPER may cause several coarse grain threads to be initialized. The routine
866     {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
867     stack argument which is the thread number, stored in the
868     variable {\em myThid}. In addition to specifying a decomposition with
869 adcroft 1.6 multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
870 cnh 1.1 configuring and starting a code to run using multiple threads requires the following
871     steps.\\
872    
873     \paragraph{Compilation}
874     First the code must be compiled with appropriate multi-threading directives
875     active in the file {\em main.F} and with appropriate compiler flags
876     to request multi-threading support. The header files
877     {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
878     contain directives compatible with compilers for Sun, Compaq, SGI,
879     Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
880     activated by using compile time
881     directives {\em -DTARGET\_SUN},
882     {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
883     or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
884     for invoking multi-threaded compilation vary from system to system
885     and from compiler to compiler. The options will be described
886     in the individual compiler documentation. For the Fortran compiler
887     from Sun the following options are needed to correctly compile
888     multi-threaded code
889     \begin{verbatim}
890     -stackvar -explicitpar -vpara -noautopar
891     \end{verbatim}
892     These options are specific to the Sun compiler. Other compilers
893     will use different syntax that will be described in their
894     documentation. The effect of these options is as follows
895     \begin{enumerate}
896     \item {\bf -stackvar} Causes all local variables to be allocated in stack
897     storage. This is necessary for local variables to ensure that they are private
898     to their thread. Note, when using this option it may be necessary to override
899     the default limit on stack-size that the operating system assigns to a process.
900     This can normally be done by changing the settings of the command shells
901     {\em stack-size} limit variable. However, on some systems changing this limit
902     will require privileged administrator access to modify system parameters.
903    
904     \item {\bf -explicitpar} Requests that multiple threads be spawned
905     in response to explicit directives in the application code. These
906     directives are inserted with syntax appropriate to the particular target
907     platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
908    
909     \item {\bf -vpara} This causes the compiler to describe the multi-threaded
910     configuration it is creating. This is not required
911     but it can be useful when trouble shooting.
912    
913     \item {\bf -noautopar} This inhibits any automatic multi-threaded
914     parallelization the compiler may otherwise generate.
915    
916     \end{enumerate}
917    
918    
919     An example of valid settings for the {\em eedata} file for a
920     domain with two subdomains in y and running with two threads is shown
921     below
922     \begin{verbatim}
923     nTx=1,nTy=2
924     \end{verbatim}
925     This set of values will cause computations to stay within a single
926     thread when moving across the {\em nSx} sub-domains. In the y-direction,
927     however, sub-domains will be split equally between two threads.
928    
929     \paragraph{Multi-threading files and parameters} The following
930     files and variables are used in setting up multi-threaded execution.\\
931    
932     \fbox{
933     \begin{minipage}{4.75in}
934     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
935     File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
936     File: {\em model/src/THE\_MODEL\_MAIN.F}\\
937     File: {\em eesupp/src/MAIN.F}\\
938 afe 1.11 File: {\em tools/genmake2}\\
939 cnh 1.1 File: {\em eedata}\\
940     CPP: {\em TARGET\_SUN}\\
941     CPP: {\em TARGET\_DEC}\\
942     CPP: {\em TARGET\_HP }\\
943     CPP: {\em TARGET\_SGI}\\
944     CPP: {\em TARGET\_CRAY\_VECTOR}\\
945     Parameter: {\em nTx}\\
946     Parameter: {\em nTy}
947     \end{minipage}
948     } \\
949    
950     \subsubsection{Multi-process execution}
951 adcroft 1.6 \label{sect:multi-process-execution}
952 cnh 1.1
953     Despite its appealing programming model, multi-threaded execution remains
954     less common then multi-process execution. One major reason for this
955     is that many system libraries are still not ``thread-safe''. This means that for
956     example on some systems it is not safe to call system routines to
957     do I/O when running in multi-threaded mode, except for in a limited set of
958     circumstances. Another reason is that support for multi-threaded programming
959     models varies between systems.
960    
961     Multi-process execution is more ubiquitous.
962     In order to run code in a multi-process configuration a decomposition
963 adcroft 1.6 specification ( see section \ref{sect:specifying_a_decomposition})
964 cnh 1.3 is given ( in which the at least one of the
965 cnh 1.1 parameters {\em nPx} or {\em nPy} will be greater than one)
966     and then, as for multi-threaded operation,
967     appropriate compile time and run time steps must be taken.
968    
969     \paragraph{Compilation} Multi-process execution under the WRAPPER
970     assumes that the portable, MPI libraries are available
971     for controlling the start-up of multiple processes. The MPI libraries
972     are not required, although they are usually used, for performance
973     critical communication. However, in order to simplify the task
974     of controlling and coordinating the start up of a large number
975     (hundreds and possibly even thousands) of copies of the same
976     program, MPI is used. The calls to the MPI multi-process startup
977 afe 1.12 routines must be activated at compile time. Currently MPI libraries are
978     invoked by
979     specifying the appropriate options file with the
980 afe 1.15 {\tt-of} flag when running the {\em genmake2}
981 afe 1.12 script, which generates the Makefile for compiling and linking MITgcm.
982     (Previously this was done by setting the {\em ALLOW\_USE\_MPI} and
983     {\em ALWAYS\_USE\_MPI} flags in the {\em CPP\_EEOPTIONS.h} file.) More
984     detailed information about the use of {\em genmake2} for specifying
985 afe 1.15 local compiler flags is located in section \ref{sect:genmake}.\\
986 cnh 1.1
987    
988     \fbox{
989     \begin{minipage}{4.75in}
990 afe 1.15 Directory: {\em tools/build\_options}\\
991 afe 1.11 File: {\em tools/genmake2}
992 cnh 1.1 \end{minipage}
993     } \\
994     \paragraph{\bf Execution} The mechanics of starting a program in
995     multi-process mode under MPI is not standardized. Documentation
996     associated with the distribution of MPI installed on a system will
997     describe how to start a program using that distribution.
998     For the free, open-source MPICH system the MITgcm program is started
999     using a command such as
1000     \begin{verbatim}
1001     mpirun -np 64 -machinefile mf ./mitgcmuv
1002     \end{verbatim}
1003 cnh 1.5 In this example the text {\em -np 64} specifies the number of processes
1004 cnh 1.1 that will be created. The numeric value {\em 64} must be equal to the
1005     product of the processor grid settings of {\em nPx} and {\em nPy}
1006     in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
1007     called ``mf'' will be read to get a list of processor names on
1008     which the sixty-four processes will execute. The syntax of this file
1009 afe 1.15 is specified by the MPI distribution.
1010 cnh 1.1 \\
1011    
1012     \fbox{
1013     \begin{minipage}{4.75in}
1014     File: {\em SIZE.h}\\
1015     Parameter: {\em nPx}\\
1016     Parameter: {\em nPy}
1017     \end{minipage}
1018     } \\
1019    
1020 adcroft 1.2
1021     \paragraph{Environment variables}
1022     On most systems multi-threaded execution also requires the setting
1023     of a special environment variable. On many machines this variable
1024     is called PARALLEL and its values should be set to the number
1025     of parallel threads required. Generally the help pages associated
1026     with the multi-threaded compiler on a machine will explain
1027     how to set the required environment variables for that machines.
1028    
1029     \paragraph{Runtime input parameters}
1030     Finally the file {\em eedata} needs to be configured to indicate
1031     the number of threads to be used in the x and y directions.
1032     The variables {\em nTx} and {\em nTy} in this file are used to
1033     specify the information required. The product of {\em nTx} and
1034     {\em nTy} must be equal to the number of threads spawned i.e.
1035     the setting of the environment variable PARALLEL.
1036     The value of {\em nTx} must subdivide the number of sub-domains
1037     in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
1038     number of sub-domains in y ({\em nSy}) exactly.
1039 cnh 1.1 The multiprocess startup of the MITgcm executable {\em mitgcmuv}
1040     is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1041     {\em INI\_PROCS()}. The first routine performs basic steps required
1042     to make sure each process is started and has a textual output
1043     stream associated with it. By default two output files are opened
1044     for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.
1045     The {\bf NNNNN} part of the name is filled in with the process
1046     number so that process number 0 will create output files
1047     {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create
1048     output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
1049     are used for reporting status and configuration information and
1050     for reporting error conditions on a process by process basis.
1051 adcroft 1.2 The {\em EEBOOT\_MINIMAL()} procedure also sets the variables
1052 cnh 1.1 {\em myProcId} and {\em MPI\_COMM\_MODEL}.
1053     These variables are related
1054     to processor identification are are used later in the routine
1055     {\em INI\_PROCS()} to allocate tiles to processes.
1056    
1057     Allocation of processes to tiles in controlled by the routine
1058     {\em INI\_PROCS()}. For each process this routine sets
1059     the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
1060 afe 1.15 These variables specify in index space the coordinates
1061     of the southernmost and westernmost corner of the
1062     southernmost and westernmost tile owned by this process.
1063 cnh 1.1 The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
1064     are also set in this routine. These are used to identify
1065     processes holding tiles to the west, east, south and north
1066     of this process. These values are stored in global storage
1067     in the header file {\em EESUPPORT.h} for use by
1068 afe 1.13 communication routines. The above does not hold when the
1069     exch2 package is used -- exch2 sets its own parameters to
1070     specify the global indices of tiles and their relationships
1071 afe 1.15 to each other. See the documentation on the exch2 package
1072     (\ref{sec:exch2}) for
1073     details.
1074 cnh 1.1 \\
1075    
1076     \fbox{
1077     \begin{minipage}{4.75in}
1078     File: {\em eesupp/src/eeboot\_minimal.F}\\
1079     File: {\em eesupp/src/ini\_procs.F} \\
1080     File: {\em eesupp/inc/EESUPPORT.h} \\
1081     Parameter: {\em myProcId} \\
1082     Parameter: {\em MPI\_COMM\_MODEL} \\
1083     Parameter: {\em myXGlobalLo} \\
1084     Parameter: {\em myYGlobalLo} \\
1085     Parameter: {\em pidW } \\
1086     Parameter: {\em pidE } \\
1087     Parameter: {\em pidS } \\
1088     Parameter: {\em pidN }
1089     \end{minipage}
1090     } \\
1091    
1092    
1093     \subsection{Controlling communication}
1094     The WRAPPER maintains internal information that is used for communication
1095     operations and that can be customized for different platforms. This section
1096     describes the information that is held and used.
1097 adcroft 1.2
1098 cnh 1.1 \begin{enumerate}
1099 afe 1.15 \item {\bf Tile-tile connectivity information}
1100     For each tile the WRAPPER
1101     sets a flag that sets the tile number to the north,
1102     south, east and
1103 cnh 1.1 west of that tile. This number is unique over all tiles in a
1104 afe 1.15 configuration. Except when using the cubed sphere and the exch2 package,
1105 afe 1.14 the number is held in the variables {\em tileNo}
1106 cnh 1.1 ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
1107     {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
1108     that specifies the type of communication that is used between tiles.
1109     This information is held in the variables {\em tileCommModeN},
1110     {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.
1111     This latter set of variables can take one of the following values
1112     {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
1113     A value of {\em COMM\_NONE} is used to indicate that a tile has no
1114 cnh 1.4 neighbor to communicate with on a particular face. A value
1115 cnh 1.1 of {\em COMM\_MSG} is used to indicated that some form of distributed
1116     memory communication is required to communicate between
1117 adcroft 1.6 these tile faces ( see section \ref{sect:distributed_memory_communication}).
1118 cnh 1.1 A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1119     forms of shared memory communication ( see section
1120 adcroft 1.6 \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1121 cnh 1.1 that a CPU should communicate by writing to data structures owned by another
1122     CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1123     from data structures owned by another CPU. These flags affect the behavior
1124     of the WRAPPER exchange primitive
1125     (see figure \ref{fig:communication_primitives}). The routine
1126     {\em ini\_communication\_patterns()} is responsible for setting the
1127     communication mode values for each tile.
1128 afe 1.14
1129     When using the cubed sphere configuration with the exch2 package, the
1130     relationships between tiles and their communication methods are set
1131 afe 1.15 by the package in other variables. See the exch2 package documentation
1132     (\ref{sec:exch2} for details.
1133 afe 1.14
1134    
1135 cnh 1.1
1136     \fbox{
1137     \begin{minipage}{4.75in}
1138     File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1139     File: {\em eesupp/inc/EESUPPORT.h} \\
1140     Parameter: {\em tileNo} \\
1141     Parameter: {\em tileNoE} \\
1142     Parameter: {\em tileNoW} \\
1143     Parameter: {\em tileNoN} \\
1144     Parameter: {\em tileNoS} \\
1145     Parameter: {\em tileCommModeE} \\
1146     Parameter: {\em tileCommModeW} \\
1147     Parameter: {\em tileCommModeN} \\
1148     Parameter: {\em tileCommModeS} \\
1149     \end{minipage}
1150     } \\
1151    
1152     \item {\bf MP directives}
1153     The WRAPPER transfers control to numerical application code through
1154     the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1155     that allows for it to be invoked by several threads. Support for this
1156     is based on using multi-processing (MP) compiler directives.
1157     Most commercially available Fortran compilers support the generation
1158     of code to spawn multiple threads through some form of compiler directives.
1159     As this is generally much more convenient than writing code to interface
1160     to operating system libraries to explicitly spawn threads, and on some systems
1161     this may be the only method available the WRAPPER is distributed with
1162     template MP directives for a number of systems.
1163    
1164     These directives are inserted into the code just before and after the
1165     transfer of control to numerical algorithm code through the routine
1166     {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of
1167     the code that performs this process for a Silicon Graphics system.
1168     This code is extracted from the files {\em main.F} and
1169     {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies
1170     how many instances of the routine {\em THE\_MODEL\_MAIN} will
1171     be created. The value of {\em nThreads} is set in the routine
1172     {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the
1173     the product of the parameters {\em nTx} and {\em nTy} that
1174     are read from the file {\em eedata}. If the value of {\em nThreads}
1175     is inconsistent with the number of threads requested from the
1176     operating system (for example by using an environment
1177 adcroft 1.6 variable as described in section \ref{sect:multi_threaded_execution})
1178 cnh 1.1 then usually an error will be reported by the routine
1179     {\em CHECK\_THREADS}.\\
1180    
1181     \fbox{
1182     \begin{minipage}{4.75in}
1183     File: {\em eesupp/src/ini\_threading\_environment.F}\\
1184     File: {\em eesupp/src/check\_threads.F} \\
1185     File: {\em eesupp/src/main.F} \\
1186     File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1187     File: {\em eedata } \\
1188     Parameter: {\em nThreads} \\
1189     Parameter: {\em nTx} \\
1190     Parameter: {\em nTy} \\
1191     \end{minipage}
1192     }
1193    
1194     \item {\bf memsync flags}
1195 adcroft 1.6 As discussed in section \ref{sect:memory_consistency}, when using shared memory,
1196 cnh 1.1 a low-level system function may be need to force memory consistency.
1197     The routine {\em MEMSYNC()} is used for this purpose. This routine should
1198     not need modifying and the information below is only provided for
1199     completeness. A logical parameter {\em exchNeedsMemSync} set
1200     in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether
1201     the {\em MEMSYNC()} primitive is called. In general this
1202     routine is only used for multi-threaded execution.
1203     The code that goes into the {\em MEMSYNC()}
1204     routine is specific to the compiler and
1205     processor being used for multi-threaded execution and in general
1206     must be written using a short code snippet of assembly language.
1207     For an Ultra Sparc system the following code snippet is used
1208     \begin{verbatim}
1209     asm("membar #LoadStore|#StoreStore");
1210     \end{verbatim}
1211 cnh 1.4 for an Alpha based system the equivalent code reads
1212 cnh 1.1 \begin{verbatim}
1213     asm("mb");
1214     \end{verbatim}
1215     while on an x86 system the following code is required
1216     \begin{verbatim}
1217     asm("lock; addl $0,0(%%esp)": : :"memory")
1218     \end{verbatim}
1219    
1220     \item {\bf Cache line size}
1221 adcroft 1.6 As discussed in section \ref{sect:cache_effects_and_false_sharing},
1222 cnh 1.1 milti-threaded codes explicitly avoid penalties associated with excessive
1223 cnh 1.5 coherence traffic on an SMP system. To do this the shared memory data structures
1224 cnh 1.1 used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1225     are padded. The variables that control the padding are set in the
1226     header file {\em EEPARAMS.h}. These variables are called
1227     {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and
1228     {\em lShare8}. The default values should not normally need changing.
1229     \item {\bf \_BARRIER}
1230     This is a CPP macro that is expanded to a call to a routine
1231 cnh 1.5 which synchronizes all the logical processors running under the
1232 cnh 1.1 WRAPPER. Using a macro here preserves flexibility to insert
1233     a specialized call in-line into application code. By default this
1234     resolves to calling the procedure {\em BARRIER()}. The default
1235     setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1236    
1237     \item {\bf \_GSUM}
1238     This is a CPP macro that is expanded to a call to a routine
1239 cnh 1.5 which sums up a floating point number
1240 cnh 1.1 over all the logical processors running under the
1241     WRAPPER. Using a macro here provides extra flexibility to insert
1242     a specialized call in-line into application code. By default this
1243 cnh 1.5 resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
1244     64-bit floating point operands)
1245     or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
1246 cnh 1.1 setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1247     The \_GSUM macro is a performance critical operation, especially for
1248     large processor count, small tile size configurations.
1249 adcroft 1.6 The custom communication example discussed in section \ref{sect:jam_example}
1250 cnh 1.1 shows how the macro is used to invoke a custom global sum routine
1251     for a specific set of hardware.
1252    
1253     \item {\bf \_EXCH}
1254     The \_EXCH CPP macro is used to update tile overlap regions.
1255     It is qualified by a suffix indicating whether overlap updates are for
1256     two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1257     physical fields and whether fields are 32-bit floating point
1258     ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point
1259     ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined
1260     in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1261     \_EXCH operation plays a crucial role in scaling to small tile,
1262     large logical and physical processor count configurations.
1263 adcroft 1.6 The example in section \ref{sect:jam_example} discusses defining an
1264 cnh 1.5 optimized and specialized form on the \_EXCH operation.
1265 cnh 1.1
1266     The \_EXCH operation is also central to supporting grids such as
1267     the cube-sphere grid. In this class of grid a rotation may be required
1268     between tiles. Aligning the coordinate requiring rotation with the
1269 cnh 1.5 tile decomposition, allows the coordinate transformation to
1270 afe 1.16 be embedded within a custom form of the \_EXCH primitive. In these
1271     cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1272     package documentation \ref{sec:exch2}.
1273 cnh 1.1
1274     \item {\bf Reverse Mode}
1275     The communication primitives \_EXCH and \_GSUM both employ
1276     hand-written adjoint forms (or reverse mode) forms.
1277     These reverse mode forms can be found in the
1278 cnh 1.5 source code directory {\em pkg/autodiff}.
1279 cnh 1.1 For the global sum primitive the reverse mode form
1280     calls are to {\em GLOBAL\_ADSUM\_R4} and
1281     {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1282 cnh 1.5 exchange primitives are found in routines
1283 cnh 1.1 prefixed {\em ADEXCH}. The exchange routines make calls to
1284     the same low-level communication primitives as the forward mode
1285     operations. However, the routine argument {\em simulationMode}
1286     is set to the value {\em REVERSE\_SIMULATION}. This signifies
1287     ti the low-level routines that the adjoint forms of the
1288     appropriate communication operation should be performed.
1289 afe 1.16
1290 cnh 1.1 \item {\bf MAX\_NO\_THREADS}
1291     The variable {\em MAX\_NO\_THREADS} is used to indicate the
1292     maximum number of OS threads that a code will use. This
1293     value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1294     For single threaded execution it can be reduced to one if required.
1295 cnh 1.5 The value; is largely private to the WRAPPER and application code
1296 cnh 1.1 will nor normally reference the value, except in the following scenario.
1297    
1298     For certain physical parametrization schemes it is necessary to have
1299     a substantial number of work arrays. Where these arrays are allocated
1300     in heap storage ( for example COMMON blocks ) multi-threaded
1301     execution will require multiple instances of the COMMON block data.
1302     This can be achieved using a Fortran 90 module construct, however,
1303     if this might be unavailable then the work arrays can be extended
1304     with dimensions use the tile dimensioning scheme of {\em nSx}
1305     and {\em nSy} ( as described in section
1306 adcroft 1.6 \ref{sect:specifying_a_decomposition}). However, if the configuration
1307 cnh 1.1 being specified involves many more tiles than OS threads then
1308     it can save memory resources to reduce the variable
1309     {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1310 cnh 1.5 will be used and to declare the physical parameterization
1311     work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
1312 cnh 1.1 An example of this is given in the verification experiment
1313     {\em aim.5l\_cs}. Here the default setting of
1314     {\em MAX\_NO\_THREADS} is altered to
1315     \begin{verbatim}
1316     INTEGER MAX_NO_THREADS
1317     PARAMETER ( MAX_NO_THREADS = 6 )
1318     \end{verbatim}
1319     and several work arrays for storing intermediate calculations are
1320     created with declarations of the form.
1321     \begin{verbatim}
1322     common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1323     \end{verbatim}
1324 cnh 1.5 This declaration scheme is not used widely, because most global data
1325 cnh 1.1 is used for permanent not temporary storage of state information.
1326     In the case of permanent state information this approach cannot be used
1327     because there has to be enough storage allocated for all tiles.
1328     However, the technique can sometimes be a useful scheme for reducing memory
1329 cnh 1.5 requirements in complex physical parameterizations.
1330 adcroft 1.2 \end{enumerate}
1331    
1332     \begin{figure}
1333     \begin{verbatim}
1334     C--
1335     C-- Parallel directives for MIPS Pro Fortran compiler
1336     C--
1337     C Parallel compiler directives for SGI with IRIX
1338     C$PAR PARALLEL DO
1339     C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1340     C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1341     C
1342     DO I=1,nThreads
1343     myThid = I
1344    
1345     C-- Invoke nThreads instances of the numerical model
1346     CALL THE_MODEL_MAIN(myThid)
1347    
1348     ENDDO
1349     \end{verbatim}
1350     \caption{Prior to transferring control to
1351     the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
1352     MP directives to spawn multiple threads.
1353     } \label{fig:mp_directives}
1354     \end{figure}
1355 cnh 1.1
1356    
1357     \subsubsection{Specializing the Communication Code}
1358    
1359     The isolation of performance critical communication primitives and the
1360     sub-division of the simulation domain into tiles is a powerful tool.
1361     Here we show how it can be used to improve application performance and
1362 cnh 1.5 how it can be used to adapt to new griding approaches.
1363 cnh 1.1
1364     \subsubsection{JAM example}
1365 adcroft 1.6 \label{sect:jam_example}
1366 cnh 1.1 On some platforms a big performance boost can be obtained by
1367     binding the communication routines {\em \_EXCH} and
1368     {\em \_GSUM} to specialized native libraries ) fro example the
1369     shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag
1370     is used as an illustration of a specialized communication configuration
1371     that substitutes for standard, portable forms of {\em \_EXCH} and
1372     {\em \_GSUM}. It affects three source files {\em eeboot.F},
1373     {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined
1374     is has the following effects.
1375     \begin{itemize}
1376     \item An extra phase is included at boot time to initialize the custom
1377     communications library ( see {\em ini\_jam.F}).
1378     \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1379     with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
1380     \item a highly specialized form of the exchange operator (optimized
1381 cnh 1.4 for overlap regions of width one) is substituted into the elliptic
1382 cnh 1.1 solver routine {\em cg2d.F}.
1383     \end{itemize}
1384     Developing specialized code for other libraries follows a similar
1385     pattern.
1386    
1387     \subsubsection{Cube sphere communication}
1388 adcroft 1.6 \label{sect:cube_sphere_communication}
1389 cnh 1.1 Actual {\em \_EXCH} routine code is generated automatically from
1390     a series of template files, for example {\em exch\_rx.template}.
1391     This is done to allow a large number of variations on the exchange
1392     process to be maintained. One set of variations supports the
1393 cnh 1.4 cube sphere grid. Support for a cube sphere grid in MITgcm is based
1394 afe 1.16 on having each face of the cube as a separate tile or tiles.
1395 cnh 1.4 The exchange routines are then able to absorb much of the
1396 cnh 1.1 detailed rotation and reorientation required when moving around the
1397     cube grid. The set of {\em \_EXCH} routines that contain the
1398     word cube in their name perform these transformations.
1399     They are invoked when the run-time logical parameter
1400     {\em useCubedSphereExchange} is set true. To facilitate the
1401     transformations on a staggered C-grid, exchange operations are defined
1402 cnh 1.4 separately for both vector and scalar quantities and for
1403 cnh 1.1 grid-centered and for grid-face and corner quantities.
1404     Three sets of exchange routines are defined. Routines
1405     with names of the form {\em exch\_rx} are used to exchange
1406     cell centered scalar quantities. Routines with names of the form
1407     {\em exch\_uv\_rx} are used to exchange vector quantities located at
1408     the C-grid velocity points. The vector quantities exchanged by the
1409     {\em exch\_uv\_rx} routines can either be signed (for example velocity
1410     components) or un-signed (for example grid-cell separations).
1411     Routines with names of the form {\em exch\_z\_rx} are used to exchange
1412     quantities at the C-grid vorticity point locations.
1413    
1414    
1415    
1416    
1417     \section{MITgcm execution under WRAPPER}
1418    
1419     Fitting together the WRAPPER elements, package elements and
1420     MITgcm core equation elements of the source code produces calling
1421 adcroft 1.6 sequence shown in section \ref{sect:calling_sequence}
1422 cnh 1.1
1423     \subsection{Annotated call tree for MITgcm and WRAPPER}
1424 adcroft 1.6 \label{sect:calling_sequence}
1425 cnh 1.1
1426     WRAPPER layer.
1427    
1428 heimbach 1.8 {\footnotesize
1429 cnh 1.1 \begin{verbatim}
1430    
1431     MAIN
1432     |
1433     |--EEBOOT :: WRAPPER initialization
1434     | |
1435     | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1436     | | allow basic I/O.
1437     | |-- EEINTRO_MSG :: Write startup greeting.
1438     | |
1439     | |-- EESET_PARMS :: Set WRAPPER parameters
1440     | |
1441     | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1442     | |
1443     | |-- INI_PROCS :: Associate processes with grid regions.
1444     | |
1445     | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1446     | |
1447     | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1448     | :: communication data structures
1449     |
1450     |
1451     |--CHECK_THREADS :: Validate multiple thread start up.
1452     |
1453     |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1454    
1455     \end{verbatim}
1456 heimbach 1.8 }
1457 cnh 1.1
1458     Core equations plus packages.
1459    
1460 heimbach 1.8 {\footnotesize
1461 cnh 1.1 \begin{verbatim}
1462     C
1463     C
1464     C Invocation from WRAPPER level...
1465     C :
1466     C :
1467     C |
1468     C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1469     C | :: Called from WRAPPER level numerical
1470 cnh 1.4 C | :: code invocation routine. On entry
1471 cnh 1.1 C | :: to THE_MODEL_MAIN separate thread and
1472     C | :: separate processes will have been established.
1473     C | :: Each thread and process will have a unique ID
1474     C | :: but as yet it will not be associated with a
1475     C | :: specific region in decomposed discrete space.
1476     C |
1477     C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1478     C | | :: grid, solver matrices etc..
1479     C | |
1480     C | |-INI_PARMS :: Routine to set kernel model parameters.
1481     C | | :: By default kernel parameters are read from file
1482     C | | :: "data" in directory in which code executes.
1483     C | |
1484 cnh 1.4 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1485 cnh 1.1 C | |
1486 cnh 1.4 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1487 cnh 1.1 C | | | :: Grid arrays are held and described in GRID.h.
1488     C | | |
1489 cnh 1.4 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1490 cnh 1.1 C | | |
1491 cnh 1.4 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1492 cnh 1.1 C | | | :: (calculate grid from kernel parameters).
1493     C | | |
1494     C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1495 cnh 1.4 C | | | :: initialization (calculate grid from
1496 cnh 1.1 C | | | :: kernel parameters).
1497     C | | |
1498     C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1499 cnh 1.4 C | | :: grid initializations. ( input from raw
1500 cnh 1.1 C | | :: grid files, LONC.bin, DXF.bin etc... )
1501     C | |
1502     C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1503     C | |
1504     C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1505     C | | :: land masking for solid-fluid boundaries.
1506     C | |
1507     C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1508     C | |
1509     C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1510 cnh 1.4 C | | :: sphere options are coded.
1511 cnh 1.1 C | |
1512     C | |-PACAKGES_BOOT :: Start up the optional package environment.
1513     C | | :: Runtime selection of active packages.
1514     C | |
1515     C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1516     C | | |
1517     C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1518     C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1519     C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1520     C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1521     C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1522     C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1523     C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1524     C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1525     C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1526     C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1527     C | |
1528     C | |-PACKAGES_CHECK
1529     C | | |
1530     C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1531 cnh 1.4 C | | |-OBCS_CHECK :: Open bndy Package. pkg/obcs
1532 cnh 1.1 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1533     C | |
1534     C | |-PACKAGES_INIT_FIXED
1535     C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1536     C | | |-FLT_INIT :: Floats Package. see pkg/flt
1537     C | |
1538     C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1539     C | |
1540     C | |-INI_CG2D :: 2d con. grad solver initialisation.
1541     C | |
1542     C | |-INI_CG3D :: 3d con. grad solver initialisation.
1543     C | |
1544     C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1545     C | :: Includes annotated table of kernel
1546     C | :: parameter settings.
1547     C |
1548     C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1549     C |
1550     C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1551 cnh 1.4 C ! :: Auotmatically generated by TAMC/TAF.
1552 cnh 1.1 C |
1553     C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1554     C |
1555     C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1556     C |
1557     C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1558     C | |
1559     C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1560     C | | | :: variables
1561     C | | |
1562     C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1563     C | | |
1564     C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1565 cnh 1.4 C | | | :: sphere options are coded.
1566 cnh 1.1 C | | |
1567     C | | |-INI_CG2D :: 2d con. grad solver initialisation.
1568     C | | |-INI_CG3D :: 3d con. grad solver initialisation.
1569     C | | |-INI_MIXING :: Initialise diapycnal diffusivity.
1570     C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical
1571     C | | | :: fields).
1572     C | | |
1573 cnh 1.4 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1574 cnh 1.1 C | | | |-INI_VEL :: Initialize 3D flow field.
1575     C | | | |-INI_THETA :: Set model initial temperature field.
1576     C | | | |-INI_SALT :: Set model initial salinity field.
1577     C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1578     C | | |
1579     C | | |-INI_TR1 :: Set initial tracer 1 distribution.
1580     C | | |
1581     C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1582     C | | | | :: Here applied to move restart conditions
1583     C | | | | :: (saved in mid timestep) to correct level in
1584     C | | | | :: time (only used for pre-c35).
1585     C | | | |
1586     C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1587     C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1588     C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1589     C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1590     C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1591     C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1592     C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1593     C | | | | |-FIND_RHO :: Find adjacent densities.
1594     C | | | | |-CONVECT :: Mix static instability.
1595     C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1596     C | | | |
1597     C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1598     C | | |
1599     C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1600     C | | | | :: Extra time history interactions.
1601     C | | | |
1602     C | | | |-FIND_RHO :: Find adjacent densities.
1603     C | | | |-CONVECT :: Mix static instability.
1604     C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1605     C | | |
1606     C | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving
1607     C | | | | :: package data.
1608     C | | | |
1609     C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1610     C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1611     C | | | |-KPP_OPEN_DIAGS
1612     C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1613     C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1614     C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1615     C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1616     C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1617     C | | | |-INI_FORCING :: Set model initial forcing fields.
1618     C | | | | :: Either set in-line or from file as shown.
1619     C | | | |-READ_FLD_XY_RS(zonalWindFile)
1620     C | | | |-READ_FLD_XY_RS(meridWindFile)
1621     C | | | |-READ_FLD_XY_RS(surfQFile)
1622     C | | | |-READ_FLD_XY_RS(EmPmRfile)
1623     C | | | |-READ_FLD_XY_RS(thetaClimFile)
1624     C | | | |-READ_FLD_XY_RS(saltClimFile)
1625     C | | | |-READ_FLD_XY_RS(surfQswFile)
1626     C | | |
1627     C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1628     C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1629     C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1630     C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1631     C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1632     C | |
1633     C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1634     C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1635     C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1636     C | |
1637     C | |-MONITOR :: Monitor state ( see pkg/monitor )
1638     C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1639     C====|>|
1640     C====|>| ****************************
1641     C====|>| BEGIN MAIN TIMESTEPPING LOOP
1642     C====|>| ****************************
1643     C====|>|
1644     C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1645     C/\ | | |
1646     C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1647     C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1648     C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1649     C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1650     C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1651 cnh 1.4 C/\ | | | | :: Simple interpolation between end-points
1652 cnh 1.1 C/\ | | | | :: for forcing datasets.
1653     C/\ | | | |
1654     C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1655     C/\ | | |
1656     C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1657     C/\ | | | |
1658     C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1659     C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1660     C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-Rhonil] of a slice
1661     C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1662     C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1663     C/\ | | | |
1664     C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1665     C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1666     C/\ | | | | :: forcing terms.
1667     C/\ | | | |
1668     C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1669     C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1670     C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1671     C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1672     C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1673     C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1674     C/\ | | | | advection case) (see pkg/gad).
1675     C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1676     C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1677     C/\ | | | | |
1678     C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1679     C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1680     C/\ | | | |
1681     C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1682     C/\ | | | | |
1683     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1684     C/\ | | | | | :: ( see pkg/gad )
1685     C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1686     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1687     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1688     C/\ | | | |
1689     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1690     C/\ | | | |
1691     C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1692     C/\ | | | | |
1693     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1694     C/\ | | | | | :: ( see pkg/gad )
1695     C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1696     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1697     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1698     C/\ | | | |
1699     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1700     C/\ | | | |
1701     C/\ | | | |-CALC_GTR1 :: Calculate other tracer(s) tendency terms
1702     C/\ | | | | |
1703     C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1704     C/\ | | | | | :: ( see pkg/gad )
1705     C/\ | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.
1706     C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1707     C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1708     C/\ | | | |
1709     C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1710     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1711     C/\ | | | |-FREEZE :: Limit range of temperature.
1712     C/\ | | | |
1713     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1714     C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1715     C/\ | | | |
1716     C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1717     C/\ | | | |-EXCH :: Update overlaps
1718     C/\ | | |
1719     C/\ | | |-DYNAMICS :: Momentum equations driver.
1720     C/\ | | | |
1721     C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1722     C/\ | | | | Potential anomaly.
1723     C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1724     C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1725     C/\ | | | |
1726     C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1727     C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1728     C/\ | | | | pkg/mom_fluxform ).
1729     C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1730     C/\ | | | | pkg/mom_vecinv ).
1731     C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1732     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1733     C/\ | | | |
1734     C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1735     C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1736     C/\ | | | |
1737     C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1738     C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1739     C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1740     C/\ | | |
1741     C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1742     C/\ | | |
1743     C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1744     C/\ | | |
1745     C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1746     C/\ | | |
1747     C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1748     C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1749     C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1750     C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1751     C/\ | | |
1752     C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1753     C/\ | | | |
1754     C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1755     C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1756     C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1757     C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1758     C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1759     C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1760     C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1761     C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1762     C/\ | | | | |-CONVECT :: Mix static instability.
1763     C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1764     C/\ | | | |
1765     C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1766     C/\ | | |
1767     C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1768     C/\ | | | |-EXCH
1769     C/\ | | |
1770     C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1771     C/\ | | |
1772     C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1773     C/\ | | |
1774     C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1775     C/\ | | | |-WRITE_STATE :: Core state I/O
1776     C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1777     C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1778     C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1779     C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1780     C/\ | | |
1781     C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1782     C/\ | |
1783     C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1784     C<===|=|
1785     C<===|=| **************************
1786     C<===|=| END MAIN TIMESTEPPING LOOP
1787     C<===|=| **************************
1788     C<===|=|
1789     C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1790     C |
1791     C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1792     C |
1793     C |-TIMER_PRINTALL :: Computational timing summary
1794     C |
1795     C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1796     C :: events.
1797     C
1798     \end{verbatim}
1799 heimbach 1.8 }
1800 cnh 1.1
1801     \subsection{Measuring and Characterizing Performance}
1802    
1803     TO BE DONE (CNH)
1804    
1805     \subsection{Estimating Resource Requirements}
1806    
1807     TO BE DONE (CNH)
1808    
1809     \subsubsection{Atlantic 1/6 degree example}
1810     \subsubsection{Dry Run testing}
1811     \subsubsection{Adjoint Resource Requirements}
1812     \subsubsection{State Estimation Environment Resources}
1813    

  ViewVC Help
Powered by ViewVC 1.1.22