/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Contents of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.19 - (show annotations) (download) (as text)
Tue Apr 20 23:32:59 2004 UTC (21 years, 2 months ago) by edhill
Branch: MAIN
Changes since 1.18: +4 -1 lines
File MIME type: application/x-tex
 o add more CMIREDIR tags for updates to the CMI web site

1 % $Header: /u/gcmpack/manual/part4/sarch.tex,v 1.18 2004/03/23 16:47:05 afe Exp $
2
3 This chapter focuses on describing the {\bf WRAPPER} environment within which
4 both the core numerics and the pluggable packages operate. The description
5 presented here is intended to be a detailed exposition and contains significant
6 background material, as well as advanced details on working with the WRAPPER.
7 The tutorial sections of this manual (see sections
8 \ref{sect:tutorials} and \ref{sect:tutorialIII})
9 contain more succinct, step-by-step instructions on running basic numerical
10 experiments, of varous types, both sequentially and in parallel. For many
11 projects simply starting from an example code and adapting it to suit a
12 particular situation
13 will be all that is required.
14 The first part of this chapter discusses the MITgcm architecture at an
15 abstract level. In the second part of the chapter we described practical
16 details of the MITgcm implementation and of current tools and operating system
17 features that are employed.
18
19 \section{Overall architectural goals}
20
21 Broadly, the goals of the software architecture employed in MITgcm are
22 three-fold
23
24 \begin{itemize}
25 \item We wish to be able to study a very broad range
26 of interesting and challenging rotating fluids problems.
27 \item We wish the model code to be readily targeted to
28 a wide range of platforms
29 \item On any given platform we would like to be
30 able to achieve performance comparable to an implementation
31 developed and specialized specifically for that platform.
32 \end{itemize}
33
34 These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
35 which conveys the goals of the MITgcm design. The goals lead to
36 a software architecture which at the high-level can be viewed as consisting
37 of
38
39 \begin{enumerate}
40 \item A core set of numerical and support code. This is discussed in detail in
41 section \ref{sect:partII}.
42 \item A scheme for supporting optional "pluggable" {\bf packages} (containing
43 for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
44 These packages are used both to overlay alternate dynamics and to introduce
45 specialized physical content onto the core numerical code. An overview of
46 the {\bf package} scheme is given at the start of part \ref{part:packages}.
47 \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
48 Programming Environment Resource), within which the core numerics and pluggable
49 packages operate.
50 \end{enumerate}
51
52 This chapter focuses on describing the {\bf WRAPPER} environment under which
53 both the core numerics and the pluggable packages function. The description
54 presented here is intended to be a detailed exposition and contains significant
55 background material, as well as advanced details on working with the WRAPPER.
56 The examples section of this manual (part \ref{part:example}) contains more
57 succinct, step-by-step instructions on running basic numerical
58 experiments both sequentially and in parallel. For many projects simply
59 starting from an example code and adapting it to suit a particular situation
60 will be all that is required.
61
62
63 \begin{figure}
64 \begin{center}
65 \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
66 \end{center}
67 \caption{
68 The MITgcm architecture is designed to allow simulation of a wide
69 range of physical problems on a wide range of hardware. The computational
70 resource requirements of the applications targeted range from around
71 $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
72 ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
73 interest range from $10^{9}$ floating point operations to more than $10^{17}$
74 floating point operations.}
75 \label{fig:mitgcm_architecture_goals}
76 \end{figure}
77
78 \section{WRAPPER}
79 \begin{rawhtml}
80 <!-- CMIREDIR:wrapper: -->
81 \end{rawhtml}
82
83 A significant element of the software architecture utilized in
84 MITgcm is a software superstructure and substructure collectively
85 called the WRAPPER (Wrappable Application Parallel Programming
86 Environment Resource). All numerical and support code in MITgcm is written
87 to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
88 the WRAPPER means that coding has to follow certain, relatively
89 straightforward, rules and conventions ( these are discussed further in
90 section \ref{sect:specifying_a_decomposition} ).
91
92 The approach taken by the WRAPPER is illustrated in figure
93 \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
94 that fits within it from architectural differences between hardware platforms
95 and operating systems. This allows numerical code to be easily retargetted.
96
97
98 \begin{figure}
99 \begin{center}
100 \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
101 \end{center}
102 \caption{
103 Numerical code is written to fit within a software support
104 infrastructure called WRAPPER. The WRAPPER is portable and
105 can be specialized for a wide range of specific target hardware and
106 programming environments, without impacting numerical code that fits
107 within the WRAPPER. Codes that fit within the WRAPPER can generally be
108 made to run as fast on a particular platform as codes specially
109 optimized for that platform.}
110 \label{fig:fit_in_wrapper}
111 \end{figure}
112
113 \subsection{Target hardware}
114 \label{sect:target_hardware}
115
116 The WRAPPER is designed to target as broad as possible a range of computer
117 systems. The original development of the WRAPPER took place on a
118 multi-processor, CRAY Y-MP system. On that system, numerical code performance
119 and scaling under the WRAPPER was in excess of that of an implementation that
120 was tightly bound to the CRAY systems proprietary multi-tasking and
121 micro-tasking approach. Later developments have been carried out on
122 uniprocessor and multi-processor Sun systems with both uniform memory access
123 (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
124 been undertaken on x86 cluster systems, Alpha processor based clustered SMP
125 systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
126 The MITgcm code, operating within the WRAPPER, is also routinely used on
127 large scale MPP systems (for example T3E systems and IBM SP systems). In all
128 cases numerical code, operating within the WRAPPER, performs and scales very
129 competitively with equivalent numerical code that has been modified to contain
130 native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.
131
132 \subsection{Supporting hardware neutrality}
133
134 The different systems listed in section \ref{sect:target_hardware} can be
135 categorized in many different ways. For example, one common distinction is
136 between shared-memory parallel systems (SMP's, PVP's) and distributed memory
137 parallel systems (for example x86 clusters and large MPP systems). This is one
138 example of a difference between compute platforms that can impact an
139 application. Another common distinction is between vector processing systems
140 with highly specialized CPU's and memory subsystems and commodity
141 microprocessor based systems. There are numerous other differences, especially
142 in relation to how parallel execution is supported. To capture the essential
143 differences between different platforms the WRAPPER uses a {\it machine model}.
144
145 \subsection{WRAPPER machine model}
146
147 Applications using the WRAPPER are not written to target just one
148 particular machine (for example an IBM SP2) or just one particular family or
149 class of machines (for example Parallel Vector Processor Systems). Instead the
150 WRAPPER provides applications with an
151 abstract {\it machine model}. The machine model is very general, however, it can
152 easily be specialized to fit, in a computationally efficient manner, any
153 computer architecture currently available to the scientific computing community.
154
155 \subsection{Machine model parallelism}
156 \begin{rawhtml}
157 <!-- CMIREDIR:domain_decomp: -->
158 \end{rawhtml}
159
160 Codes operating under the WRAPPER target an abstract machine that is assumed to
161 consist of one or more logical processors that can compute concurrently.
162 Computational work is divided among the logical
163 processors by allocating ``ownership'' to
164 each processor of a certain set (or sets) of calculations. Each set of
165 calculations owned by a particular processor is associated with a specific
166 region of the physical space that is being simulated, only one processor will
167 be associated with each such region (domain decomposition).
168
169 In a strict sense the logical processors over which work is divided do not need
170 to correspond to physical processors. It is perfectly possible to execute a
171 configuration decomposed for multiple logical processors on a single physical
172 processor. This helps ensure that numerical code that is written to fit
173 within the WRAPPER will parallelize with no additional effort and is
174 also useful when debugging codes. Generally, however,
175 the computational domain will be subdivided over multiple logical
176 processors in order to then bind those logical processors to physical
177 processor resources that can compute in parallel.
178
179 \subsubsection{Tiles}
180
181 Computationally, associated with each region of physical
182 space allocated to a particular logical processor, there will be data
183 structures (arrays, scalar variables etc...) that hold the simulated state of
184 that region. We refer to these data structures as being {\bf owned} by the
185 processor to which their
186 associated region of physical space has been allocated. Individual
187 regions that are allocated to processors are called {\bf tiles}. A
188 processor can own more
189 than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being
190 mapped to a set of logical processors, with each processors owning a single
191 region of the domain (a single tile). Except for periods of
192 communication and coordination, each processor computes autonomously, working
193 only with data from the tile (or tiles) that the processor owns. When multiple
194 tiles are alloted to a single processor, each tile is computed on
195 independently of the other tiles, in a sequential fashion.
196
197 \begin{figure}
198 \begin{center}
199 \resizebox{5in}{!}{
200 \includegraphics{part4/domain_decomp.eps}
201 }
202 \end{center}
203 \caption{ The WRAPPER provides support for one and two dimensional
204 decompositions of grid-point domains. The figure shows a hypothetical domain of
205 total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in
206 two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf
207 tiles} are {\bf owned} by different processors. The {\bf owning}
208 processors perform the
209 arithmetic operations associated with a {\bf tile}. Although not illustrated
210 here, a single processor can {\bf own} several {\bf tiles}.
211 Whenever a processor wishes to transfer data between tiles or
212 communicate with other processors it calls a WRAPPER supplied
213 function.
214 } \label{fig:domaindecomp}
215 \end{figure}
216
217
218 \subsubsection{Tile layout}
219
220 Tiles consist of an interior region and an overlap region. The overlap region
221 of a tile corresponds to the interior region of an adjacent tile.
222 In figure \ref{fig:tiledworld} each tile would own the region
223 within the black square and hold duplicate information for overlap
224 regions extending into the tiles to the north, south, east and west.
225 During
226 computational phases a processor will reference data in an overlap region
227 whenever it requires values that outside the domain it owns. Periodically
228 processors will make calls to WRAPPER functions to communicate data between
229 tiles, in order to keep the overlap regions up to date (see section
230 \ref{sect:communication_primitives}). The WRAPPER functions can use a
231 variety of different mechanisms to communicate data between tiles.
232
233 \begin{figure}
234 \begin{center}
235 \resizebox{5in}{!}{
236 \includegraphics{part4/tiled-world.eps}
237 }
238 \end{center}
239 \caption{ A global grid subdivided into tiles.
240 Tiles contain a interior region and an overlap region.
241 Overlap regions are periodically updated from neighboring tiles.
242 } \label{fig:tiledworld}
243 \end{figure}
244
245 \subsection{Communication mechanisms}
246
247 Logical processors are assumed to be able to exchange information
248 between tiles and between each other using at least one of two possible
249 mechanisms.
250
251 \begin{itemize}
252 \item {\bf Shared memory communication}.
253 Under this mode of communication data transfers are assumed to be possible
254 using direct addressing of regions of memory. In this case a CPU is able to read
255 (and write) directly to regions of memory "owned" by another CPU
256 using simple programming language level assignment operations of the
257 the sort shown in figure \ref{fig:simple_assign}. In this way one CPU
258 (CPU1 in the figure) can communicate information to another CPU (CPU2 in the
259 figure) by assigning a particular value to a particular memory location.
260
261 \item {\bf Distributed memory communication}.
262 Under this mode of communication there is no mechanism, at the application code level,
263 for directly addressing regions of memory owned and visible to another CPU. Instead
264 a communication library must be used as illustrated in figure
265 \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the
266 communication library to communicate data from a tile that it owns to a tile
267 that another CPU owns. By default the WRAPPER binds to the MPI communication
268 library \ref{MPI} for this style of communication.
269 \end{itemize}
270
271 The WRAPPER assumes that communication will use one of these two styles
272 of communication. The underlying hardware and operating system support
273 for the style used is not specified and can vary from system to system.
274
275 \begin{figure}
276 \begin{verbatim}
277
278 CPU1 | CPU2
279 ==== | ====
280 |
281 a(3) = 8 | WHILE ( a(3) .NE. 8 )
282 | WAIT
283 | END WHILE
284 |
285 \end{verbatim}
286 \caption{ In the WRAPPER shared memory communication model, simple writes to an
287 array can be made to be visible to other CPU's at the application code level.
288 So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
289 element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)
290 will be able to see the value $8$ when they read from $a(3)$.
291 This provides a very low latency and high bandwidth communication
292 mechanism.
293 } \label{fig:simple_assign}
294 \end{figure}
295
296 \begin{figure}
297 \begin{verbatim}
298
299 CPU1 | CPU2
300 ==== | ====
301 |
302 a(3) = 8 | WHILE ( a(3) .NE. 8 )
303 CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
304 | END WHILE
305 |
306 \end{verbatim}
307 \caption{ In the WRAPPER distributed memory communication model
308 data can not be made directly visible to other CPU's.
309 If one CPU writes the value $8$ to element $3$ of array $a$, then
310 at least one of CPU1 and/or CPU2 in the figure above will need
311 to call a bespoke communication library in order for the updated
312 value to be communicated between CPU's.
313 } \label{fig:comm_msg}
314 \end{figure}
315
316 \subsection{Shared memory communication}
317 \label{sect:shared_memory_communication}
318
319 Under shared communication independent CPU's are operating
320 on the exact same global address space at the application level.
321 This means that CPU 1 can directly write into global
322 data structures that CPU 2 ``owns'' using a simple
323 assignment at the application level.
324 This is the model of memory access is supported at the basic system
325 design level in ``shared-memory'' systems such as PVP systems, SMP systems,
326 and on distributed shared memory systems (the SGI Origin).
327 On such systems the WRAPPER will generally use simple read and write statements
328 to access directly application data structures when communicating between CPU's.
329
330 In a system where assignments statements, like the one in figure
331 \ref{fig:simple_assign} map directly to
332 hardware instructions that transport data between CPU and memory banks, this
333 can be a very efficient mechanism for communication. In this case two CPU's,
334 CPU1 and CPU2, can communicate simply be reading and writing to an
335 agreed location and following a few basic rules. The latency of this sort
336 of communication is generally not that much higher than the hardware
337 latency of other memory accesses on the system. The bandwidth available
338 between CPU's communicating in this way can be close to the bandwidth of
339 the systems main-memory interconnect. This can make this method of
340 communication very efficient provided it is used appropriately.
341
342 \subsubsection{Memory consistency}
343 \label{sect:memory_consistency}
344
345 When using shared memory communication between
346 multiple processors the WRAPPER level shields user applications from
347 certain counter-intuitive system behaviors. In particular, one issue the
348 WRAPPER layer must deal with is a systems memory model. In general the order
349 of reads and writes expressed by the textual order of an application code may
350 not be the ordering of instructions executed by the processor performing the
351 application. The processor performing the application instructions will always
352 operate so that, for the application instructions the processor is executing,
353 any reordering is not apparent. However, in general machines are often
354 designed so that reordering of instructions is not hidden from other second
355 processors. This means that, in general, even on a shared memory system two
356 processors can observe inconsistent memory values.
357
358 The issue of memory consistency between multiple processors is discussed at
359 length in many computer science papers, however, from a practical point of
360 view, in order to deal with this issue, shared memory machines all provide
361 some mechanism to enforce memory consistency when it is needed. The exact
362 mechanism employed will vary between systems. For communication using shared
363 memory, the WRAPPER provides a place to invoke the appropriate mechanism to
364 ensure memory consistency for a particular platform.
365
366 \subsubsection{Cache effects and false sharing}
367 \label{sect:cache_effects_and_false_sharing}
368
369 Shared-memory machines often have local to processor memory caches
370 which contain mirrored copies of main memory. Automatic cache-coherence
371 protocols are used to maintain consistency between caches on different
372 processors. These cache-coherence protocols typically enforce consistency
373 between regions of memory with large granularity (typically 128 or 256 byte
374 chunks). The coherency protocols employed can be expensive relative to other
375 memory accesses and so care is taken in the WRAPPER (by padding synchronization
376 structures appropriately) to avoid unnecessary coherence traffic.
377
378 \subsubsection{Operating system support for shared memory.}
379
380 Applications running under multiple threads within a single process can
381 use shared memory communication. In this case {\it all} the memory locations
382 in an application are potentially visible to all the compute threads. Multiple
383 threads operating within a single process is the standard mechanism for
384 supporting shared memory that the WRAPPER utilizes. Configuring and launching
385 code to run in multi-threaded mode on specific platforms is discussed in
386 section \ref{sect:running_with_threads}. However, on many systems, potentially
387 very efficient mechanisms for using shared memory communication between
388 multiple processes (in contrast to multiple threads within a single
389 process) also exist. In most cases this works by making a limited region of
390 memory shared between processes. The MMAP \ref{magicgarden} and
391 IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do
392 vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.
393 Extensions exist for the WRAPPER that allow these mechanisms
394 to be used for shared memory communication. However, these mechanisms are not
395 distributed with the default WRAPPER sources, because of their proprietary
396 nature.
397
398 \subsection{Distributed memory communication}
399 \label{sect:distributed_memory_communication}
400 Many parallel systems are not constructed in a way where it is
401 possible or practical for an application to use shared memory
402 for communication. For example cluster systems consist of individual computers
403 connected by a fast network. On such systems their is no notion of shared memory
404 at the system level. For this sort of system the WRAPPER provides support
405 for communication based on a bespoke communication library
406 (see figure \ref{fig:comm_msg}). The default communication library used is MPI
407 \ref{mpi}. However, it is relatively straightforward to implement bindings to
408 optimized platform specific communication libraries. For example the work
409 described in \ref{hoe-hill:99} substituted standard MPI communication for a
410 highly optimized library.
411
412 \subsection{Communication primitives}
413 \label{sect:communication_primitives}
414
415 \begin{figure}
416 \begin{center}
417 \resizebox{5in}{!}{
418 \includegraphics{part4/comm-primm.eps}
419 }
420 \end{center}
421 \caption{Three performance critical parallel primitives are provided
422 by the WRAPPER. These primitives are always used to communicate data
423 between tiles. The figure shows four tiles. The curved arrows indicate
424 exchange primitives which transfer data between the overlap regions at tile
425 edges and interior regions for nearest-neighbor tiles.
426 The straight arrows symbolize global sum operations which connect all tiles.
427 The global sum operation provides both a key arithmetic primitive and can
428 serve as a synchronization primitive. A third barrier primitive is also
429 provided, it behaves much like the global sum primitive.
430 } \label{fig:communication_primitives}
431 \end{figure}
432
433
434 Optimized communication support is assumed to be possibly available
435 for a small number of communication operations.
436 It is assumed that communication performance optimizations can
437 be achieved by optimizing a small number of communication primitives.
438 Three optimizable primitives are provided by the WRAPPER
439 \begin{itemize}
440 \item{\bf EXCHANGE} This operation is used to transfer data between interior
441 and overlap regions of neighboring tiles. A number of different forms of this
442 operation are supported. These different forms handle
443 \begin{itemize}
444 \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
445 separately.
446 \item Bindings to different communication methods.
447 Exchange primitives select between using shared memory or distributed
448 memory communication.
449 \item Transformation operations required when transporting
450 data between different grid regions. Transferring data
451 between faces of a cube-sphere grid, for example, involves a rotation
452 of vector components.
453 \item Forward and reverse mode computations. Derivative calculations require
454 tangent linear and adjoint forms of the exchange primitives.
455
456 \end{itemize}
457
458 \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
459 operation for the pressure inversion phase of the MITgcm algorithm.
460 For certain configurations scaling can be highly sensitive to
461 the performance of the global sum primitive. This operation is a collective
462 operation involving all tiles of the simulated domain. Different forms
463 of the global sum primitive exist for handling
464 \begin{itemize}
465 \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
466 separately.
467 \item Bindings to different communication methods.
468 Exchange primitives select between using shared memory or distributed
469 memory communication.
470 \item Forward and reverse mode computations. Derivative calculations require
471 tangent linear and adjoint forms of the exchange primitives.
472 \end{itemize}
473
474 \item{\bf BARRIER} The WRAPPER provides a global synchronization function
475 called barrier. This is used to synchronize computations over all tiles.
476 The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in
477 some cases use the same underlying code.
478 \end{itemize}
479
480
481 \subsection{Memory architecture}
482
483 The WRAPPER machine model is aimed to target efficiently systems with
484 highly pipelined memory architectures and systems with deep memory
485 hierarchies that favor memory reuse. This is achieved by supporting a
486 flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
487 Within a CPU computations are carried out sequentially on each tile
488 in turn. By reshaping tiles according to the target platform it is
489 possible to automatically tune code to improve memory performance.
490 On a vector machine a given domain might be sub-divided into a few
491 long, thin regions. On a commodity microprocessor based system, however,
492 the same region could be simulated use many more smaller
493 sub-domains.
494
495
496 \begin{figure}
497 \begin{center}
498 \resizebox{5in}{!}{
499 \includegraphics{part4/tiling_detail.eps}
500 }
501 \end{center}
502 \caption{The tiling strategy that the WRAPPER supports allows tiles
503 to be shaped to suit the underlying system memory architecture.
504 Compact tiles that lead to greater memory reuse can be used on cache
505 based systems (upper half of figure) with deep memory hierarchies, long tiles
506 with large inner loops can be used to exploit vector systems having
507 highly pipelined memory systems.
508 } \label{fig:tiling-strategy}
509 \end{figure}
510
511 \newpage
512 \subsection{Summary}
513 Following the discussion above, the machine model that the WRAPPER
514 presents to an application has the following characteristics
515
516 \begin{itemize}
517 \item The machine consists of one or more logical processors. \vspace{-3mm}
518 \item Each processor operates on tiles that it owns.\vspace{-3mm}
519 \item A processor may own more than one tile.\vspace{-3mm}
520 \item Processors may compute concurrently.\vspace{-3mm}
521 \item Exchange of information between tiles is handled by the
522 machine (WRAPPER) not by the application.
523 \end{itemize}
524 Behind the scenes this allows the WRAPPER to adapt the machine model
525 functions to exploit hardware on which
526 \begin{itemize}
527 \item Processors may be able to communicate very efficiently with each other
528 using shared memory. \vspace{-3mm}
529 \item An alternative communication mechanism based on a relatively
530 simple inter-process communication API may be required.\vspace{-3mm}
531 \item Shared memory may not necessarily obey sequential consistency,
532 however some mechanism will exist for enforcing memory consistency.
533 \vspace{-3mm}
534 \item Memory consistency that is enforced at the hardware level
535 may be expensive. Unnecessary triggering of consistency protocols
536 should be avoided. \vspace{-3mm}
537 \item Memory access patterns may need to either repetitive or highly
538 pipelined for optimum hardware performance. \vspace{-3mm}
539 \end{itemize}
540
541 This generic model captures the essential hardware ingredients
542 of almost all successful scientific computer systems designed in the
543 last 50 years.
544
545 \section{Using the WRAPPER}
546
547 In order to support maximum portability the WRAPPER is implemented primarily
548 in sequential Fortran 77. At a practical level the key steps provided by the
549 WRAPPER are
550 \begin{enumerate}
551 \item specifying how a domain will be decomposed
552 \item starting a code in either sequential or parallel modes of operations
553 \item controlling communication between tiles and between concurrently
554 computing CPU's.
555 \end{enumerate}
556 This section describes the details of each of these operations.
557 Section \ref{sect:specifying_a_decomposition} explains how the way in which
558 a domain is decomposed (or composed) is expressed. Section
559 \ref{sect:starting_a_code} describes practical details of running codes
560 in various different parallel modes on contemporary computer systems.
561 Section \ref{sect:controlling_communication} explains the internal information
562 that the WRAPPER uses to control how information is communicated between
563 tiles.
564
565 \subsection{Specifying a domain decomposition}
566 \label{sect:specifying_a_decomposition}
567
568 At its heart much of the WRAPPER works only in terms of a collection of tiles
569 which are interconnected to each other. This is also true of application
570 code operating within the WRAPPER. Application code is written as a series
571 of compute operations, each of which operates on a single tile. If
572 application code needs to perform operations involving data
573 associated with another tile, it uses a WRAPPER function to obtain
574 that data.
575 The specification of how a global domain is constructed from tiles or alternatively
576 how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
577 This file defines the following parameters \\
578
579 \fbox{
580 \begin{minipage}{4.75in}
581 Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
582 File: {\em model/inc/SIZE.h}
583 \end{minipage}
584 } \\
585
586 Together these parameters define a tiling decomposition of the style shown in
587 figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
588 the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
589 maximum size of the overlap extent. This must be set to the maximum width
590 of the computation stencil that the numerical code finite-difference operations
591 require between overlap region updates. The maximum overlap required
592 by any of the operations in the MITgcm code distributed with this release is three grid
593 points. This is set by the requirements of the $\nabla^4$ dissipation and
594 diffusion operator. Code modifications and enhancements that involve adding wide
595 finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
596 Setting {\em OLx} and {\em OLy} to a too large value will decrease code
597 performance (because redundant computations will be performed), however it will
598 not cause any other problems.
599
600 \begin{figure}
601 \begin{center}
602 \resizebox{5in}{!}{
603 \includegraphics{part4/size_h.eps}
604 }
605 \end{center}
606 \caption{ The three level domain decomposition hierarchy employed by the
607 WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
608 to a single process. Multiple processes can exist, each with multiple tiles.
609 Tiles within a process can be spread over multiple compute threads.
610 } \label{fig:labelled_tile}
611 \end{figure}
612
613 The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
614 be created within a single process. Each of these tiles will have internal
615 dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
616 allocated to different threads of a process that are then bound to
617 different physical processors ( see the multi-threaded
618 execution discussion in section \ref{sect:starting_the_code} ) then
619 computation will be performed concurrently on each tile. However, it is also
620 possible to run the same decomposition within a process running a single thread on
621 a single processor. In this case the tiles will be computed over sequentially.
622 If the decomposition is run in a single process running multiple threads
623 but attached to a single physical processor, then, in general, the computation
624 for different tiles will be interleaved by system level software.
625 This too is a valid mode of operation.
626
627 The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
628 numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
629 are used to form the loop ranges for many numerical calculations and
630 to provide dimensions for arrays holding numerical state.
631 The {\em nSx} and{\em nSy} are used in conjunction with the thread number
632 parameter {\em myThid}. Much of the numerical code operating within the
633 WRAPPER takes the form
634 \begin{verbatim}
635 DO bj=myByLo(myThid),myByHi(myThid)
636 DO bi=myBxLo(myThid),myBxHi(myThid)
637 :
638 a block of computations ranging
639 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
640 :
641 ENDDO
642 ENDDO
643
644 communication code to sum a number or maybe update
645 tile overlap regions
646
647 DO bj=myByLo(myThid),myByHi(myThid)
648 DO bi=myBxLo(myThid),myBxHi(myThid)
649 :
650 another block of computations ranging
651 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
652 :
653 ENDDO
654 ENDDO
655 \end{verbatim}
656 The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
657 myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
658 schematic. These variables specify the subset of the tiles in
659 the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
660 thread number {\em myThid} owns. The thread number variable {\em myThid}
661 ranges from 1 to the total number of threads requested at execution time.
662 For each value of {\em myThid} the loop scheme above will step sequentially
663 through the tiles owned by that thread. However, different threads will
664 have different ranges of tiles assigned to them, so that separate threads can
665 compute iterations of the {\em bi}, {\em bj} loop concurrently.
666 Within a {\em bi}, {\em bj} loop
667 computation is performed concurrently over as many processes and threads
668 as there are physical processors available to compute.
669
670 An exception to the the use of {\em bi} and {\em bj} in loops arises in the
671 exchange routines used when the exch2 package is used with the cubed
672 sphere. In this case {\em bj} is generally set to 1 and the loop runs from
673 1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number,
674 which is then used to reference exchange parameters.
675
676 The amount of computation that can be embedded
677 a single loop over {\em bi} and {\em bj} varies for different parts of the
678 MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
679 from the two-dimensional implicit elliptic solver. This portion of the
680 code computes the l2Norm of a vector whose elements are held in
681 the array {\em cg2d\_r} writing the final result to scalar variable
682 {\em err}. In this case, because the l2norm requires a global reduction,
683 the {\em bi},{\em bj} loop only contains one statement. This computation
684 phase is then followed by a communication phase in which all threads and
685 processes must participate. However,
686 in other areas of the MITgcm code entries subsections of code are within
687 a single {\em bi},{\em bj} loop. For example the evaluation of all
688 the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
689 is within a single {\em bi},{\em bj} loop.
690
691 \begin{figure}
692 \begin{verbatim}
693 REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
694 REAL*8 err
695 :
696 :
697 other computations
698 :
699 :
700 err = 0.
701 DO bj=myByLo(myThid),myByHi(myThid)
702 DO bi=myBxLo(myThid),myBxHi(myThid)
703 DO J=1,sNy
704 DO I=1,sNx
705 err = err +
706 & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
707 ENDDO
708 ENDDO
709 ENDDO
710 ENDDO
711
712 CALL GLOBAL_SUM_R8( err , myThid )
713 err = SQRT(err)
714
715 \end{verbatim}
716 \caption{Example of numerical code for calculating
717 the l2-norm of a vector within the WRAPPER. Notice that
718 under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
719 dimensions. These right most indices are tile indexes. Different
720 threads with a single process operate on different ranges of tile
721 index, as controlled by the settings of
722 {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
723 } \label{fig:bibj_extract}
724 \end{figure}
725
726 The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
727 are used to indicate to the WRAPPER level how many processes (each with
728 {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
729 This information is needed during initialization and during I/O phases.
730 However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
731 the values of {\em nPx} and {\em nPy} are absent
732 from the core numerical and support code.
733
734 \subsubsection{Examples of {\em SIZE.h} specifications}
735
736 The following different {\em SIZE.h} parameter setting illustrate how to
737 interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
738 and {\em nPy}.
739 \begin{enumerate}
740 \item
741 \begin{verbatim}
742 PARAMETER (
743 & sNx = 90,
744 & sNy = 40,
745 & OLx = 3,
746 & OLy = 3,
747 & nSx = 1,
748 & nSy = 1,
749 & nPx = 1,
750 & nPy = 1)
751 \end{verbatim}
752 This sets up a single tile with x-dimension of ninety grid points, y-dimension of
753 forty grid points, and x and y overlaps of three grid points each.
754 \item
755 \begin{verbatim}
756 PARAMETER (
757 & sNx = 45,
758 & sNy = 20,
759 & OLx = 3,
760 & OLy = 3,
761 & nSx = 1,
762 & nSy = 1,
763 & nPx = 2,
764 & nPy = 2)
765 \end{verbatim}
766 This sets up tiles with x-dimension of forty-five grid points, y-dimension of
767 twenty grid points, and x and y overlaps of three grid points each. There are
768 four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
769 arranged so that the global domain size is again ninety grid points in x and
770 forty grid points in y. In general the formula for global grid size (held in
771 model variables {\em Nx} and {\em Ny}) is
772 \begin{verbatim}
773 Nx = sNx*nSx*nPx
774 Ny = sNy*nSy*nPy
775 \end{verbatim}
776 \item
777 \begin{verbatim}
778 PARAMETER (
779 & sNx = 90,
780 & sNy = 10,
781 & OLx = 3,
782 & OLy = 3,
783 & nSx = 1,
784 & nSy = 2,
785 & nPx = 1,
786 & nPy = 2)
787 \end{verbatim}
788 This sets up tiles with x-dimension of ninety grid points, y-dimension of
789 ten grid points, and x and y overlaps of three grid points each. There are
790 four tiles allocated to two separate processes ({\em nPy=2}) each of which
791 has two separate sub-domains {\em nSy=2},
792 The global domain size is again ninety grid points in x and
793 forty grid points in y. The two sub-domains in each process will be computed
794 sequentially if they are given to a single thread within a single process.
795 Alternatively if the code is invoked with multiple threads per process
796 the two domains in y may be computed concurrently.
797 \item
798 \begin{verbatim}
799 PARAMETER (
800 & sNx = 32,
801 & sNy = 32,
802 & OLx = 3,
803 & OLy = 3,
804 & nSx = 6,
805 & nSy = 1,
806 & nPx = 1,
807 & nPy = 1)
808 \end{verbatim}
809 This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
810 thirty-two grid points, and x and y overlaps of three grid points each.
811 There are six tiles allocated to six separate logical processors ({\em nSx=6}).
812 This set of values can be used for a cube sphere calculation.
813 Each tile of size $32 \times 32$ represents a face of the
814 cube. Initializing the tile connectivity correctly ( see section
815 \ref{sect:cube_sphere_communication}. allows the rotations associated with
816 moving between the six cube faces to be embedded within the
817 tile-tile communication code.
818 \end{enumerate}
819
820
821 \subsection{Starting the code}
822 \label{sect:starting_the_code}
823 When code is started under the WRAPPER, execution begins in a main routine {\em
824 eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
825 to the application through a routine called {\em THE\_MODEL\_MAIN()}
826 once the WRAPPER has initialized correctly and has created the necessary variables
827 to support subsequent calls to communication routines
828 by the application code. The startup calling sequence followed by the
829 WRAPPER is shown in figure \ref{fig:wrapper_startup}.
830
831 \begin{figure}
832 {\footnotesize
833 \begin{verbatim}
834
835 MAIN
836 |
837 |--EEBOOT :: WRAPPER initialization
838 | |
839 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
840 | | allow basic I/O.
841 | |-- EEINTRO_MSG :: Write startup greeting.
842 | |
843 | |-- EESET_PARMS :: Set WRAPPER parameters
844 | |
845 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
846 | |
847 | |-- INI_PROCS :: Associate processes with grid regions.
848 | |
849 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
850 | |
851 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
852 | :: communication data structures
853 |
854 |
855 |--CHECK_THREADS :: Validate multiple thread start up.
856 |
857 |--THE_MODEL_MAIN :: Numerical code top-level driver routine
858
859
860 \end{verbatim}
861 }
862 \caption{Main stages of the WRAPPER startup procedure.
863 This process proceeds transfer of control to application code, which
864 occurs through the procedure {\em THE\_MODEL\_MAIN()}.
865 } \label{fig:wrapper_startup}
866 \end{figure}
867
868 \subsubsection{Multi-threaded execution}
869 \label{sect:multi-threaded-execution}
870 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
871 WRAPPER may cause several coarse grain threads to be initialized. The routine
872 {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
873 stack argument which is the thread number, stored in the
874 variable {\em myThid}. In addition to specifying a decomposition with
875 multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
876 configuring and starting a code to run using multiple threads requires the following
877 steps.\\
878
879 \paragraph{Compilation}
880 First the code must be compiled with appropriate multi-threading directives
881 active in the file {\em main.F} and with appropriate compiler flags
882 to request multi-threading support. The header files
883 {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
884 contain directives compatible with compilers for Sun, Compaq, SGI,
885 Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
886 activated by using compile time
887 directives {\em -DTARGET\_SUN},
888 {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
889 or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
890 for invoking multi-threaded compilation vary from system to system
891 and from compiler to compiler. The options will be described
892 in the individual compiler documentation. For the Fortran compiler
893 from Sun the following options are needed to correctly compile
894 multi-threaded code
895 \begin{verbatim}
896 -stackvar -explicitpar -vpara -noautopar
897 \end{verbatim}
898 These options are specific to the Sun compiler. Other compilers
899 will use different syntax that will be described in their
900 documentation. The effect of these options is as follows
901 \begin{enumerate}
902 \item {\bf -stackvar} Causes all local variables to be allocated in stack
903 storage. This is necessary for local variables to ensure that they are private
904 to their thread. Note, when using this option it may be necessary to override
905 the default limit on stack-size that the operating system assigns to a process.
906 This can normally be done by changing the settings of the command shells
907 {\em stack-size} limit variable. However, on some systems changing this limit
908 will require privileged administrator access to modify system parameters.
909
910 \item {\bf -explicitpar} Requests that multiple threads be spawned
911 in response to explicit directives in the application code. These
912 directives are inserted with syntax appropriate to the particular target
913 platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
914
915 \item {\bf -vpara} This causes the compiler to describe the multi-threaded
916 configuration it is creating. This is not required
917 but it can be useful when trouble shooting.
918
919 \item {\bf -noautopar} This inhibits any automatic multi-threaded
920 parallelization the compiler may otherwise generate.
921
922 \end{enumerate}
923
924
925 An example of valid settings for the {\em eedata} file for a
926 domain with two subdomains in y and running with two threads is shown
927 below
928 \begin{verbatim}
929 nTx=1,nTy=2
930 \end{verbatim}
931 This set of values will cause computations to stay within a single
932 thread when moving across the {\em nSx} sub-domains. In the y-direction,
933 however, sub-domains will be split equally between two threads.
934
935 \paragraph{Multi-threading files and parameters} The following
936 files and variables are used in setting up multi-threaded execution.\\
937
938 \fbox{
939 \begin{minipage}{4.75in}
940 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
941 File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
942 File: {\em model/src/THE\_MODEL\_MAIN.F}\\
943 File: {\em eesupp/src/MAIN.F}\\
944 File: {\em tools/genmake2}\\
945 File: {\em eedata}\\
946 CPP: {\em TARGET\_SUN}\\
947 CPP: {\em TARGET\_DEC}\\
948 CPP: {\em TARGET\_HP }\\
949 CPP: {\em TARGET\_SGI}\\
950 CPP: {\em TARGET\_CRAY\_VECTOR}\\
951 Parameter: {\em nTx}\\
952 Parameter: {\em nTy}
953 \end{minipage}
954 } \\
955
956 \subsubsection{Multi-process execution}
957 \label{sect:multi-process-execution}
958
959 Despite its appealing programming model, multi-threaded execution remains
960 less common then multi-process execution. One major reason for this
961 is that many system libraries are still not ``thread-safe''. This means that for
962 example on some systems it is not safe to call system routines to
963 do I/O when running in multi-threaded mode, except for in a limited set of
964 circumstances. Another reason is that support for multi-threaded programming
965 models varies between systems.
966
967 Multi-process execution is more ubiquitous.
968 In order to run code in a multi-process configuration a decomposition
969 specification ( see section \ref{sect:specifying_a_decomposition})
970 is given ( in which the at least one of the
971 parameters {\em nPx} or {\em nPy} will be greater than one)
972 and then, as for multi-threaded operation,
973 appropriate compile time and run time steps must be taken.
974
975 \paragraph{Compilation} Multi-process execution under the WRAPPER
976 assumes that the portable, MPI libraries are available
977 for controlling the start-up of multiple processes. The MPI libraries
978 are not required, although they are usually used, for performance
979 critical communication. However, in order to simplify the task
980 of controlling and coordinating the start up of a large number
981 (hundreds and possibly even thousands) of copies of the same
982 program, MPI is used. The calls to the MPI multi-process startup
983 routines must be activated at compile time. Currently MPI libraries are
984 invoked by
985 specifying the appropriate options file with the
986 {\tt-of} flag when running the {\em genmake2}
987 script, which generates the Makefile for compiling and linking MITgcm.
988 (Previously this was done by setting the {\em ALLOW\_USE\_MPI} and
989 {\em ALWAYS\_USE\_MPI} flags in the {\em CPP\_EEOPTIONS.h} file.) More
990 detailed information about the use of {\em genmake2} for specifying
991 local compiler flags is located in section \ref{sect:genmake}.\\
992
993
994 \fbox{
995 \begin{minipage}{4.75in}
996 Directory: {\em tools/build\_options}\\
997 File: {\em tools/genmake2}
998 \end{minipage}
999 } \\
1000 \paragraph{\bf Execution} The mechanics of starting a program in
1001 multi-process mode under MPI is not standardized. Documentation
1002 associated with the distribution of MPI installed on a system will
1003 describe how to start a program using that distribution.
1004 For the free, open-source MPICH system the MITgcm program is started
1005 using a command such as
1006 \begin{verbatim}
1007 mpirun -np 64 -machinefile mf ./mitgcmuv
1008 \end{verbatim}
1009 In this example the text {\em -np 64} specifies the number of processes
1010 that will be created. The numeric value {\em 64} must be equal to the
1011 product of the processor grid settings of {\em nPx} and {\em nPy}
1012 in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
1013 called ``mf'' will be read to get a list of processor names on
1014 which the sixty-four processes will execute. The syntax of this file
1015 is specified by the MPI distribution.
1016 \\
1017
1018 \fbox{
1019 \begin{minipage}{4.75in}
1020 File: {\em SIZE.h}\\
1021 Parameter: {\em nPx}\\
1022 Parameter: {\em nPy}
1023 \end{minipage}
1024 } \\
1025
1026
1027 \paragraph{Environment variables}
1028 On most systems multi-threaded execution also requires the setting
1029 of a special environment variable. On many machines this variable
1030 is called PARALLEL and its values should be set to the number
1031 of parallel threads required. Generally the help pages associated
1032 with the multi-threaded compiler on a machine will explain
1033 how to set the required environment variables for that machines.
1034
1035 \paragraph{Runtime input parameters}
1036 Finally the file {\em eedata} needs to be configured to indicate
1037 the number of threads to be used in the x and y directions.
1038 The variables {\em nTx} and {\em nTy} in this file are used to
1039 specify the information required. The product of {\em nTx} and
1040 {\em nTy} must be equal to the number of threads spawned i.e.
1041 the setting of the environment variable PARALLEL.
1042 The value of {\em nTx} must subdivide the number of sub-domains
1043 in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
1044 number of sub-domains in y ({\em nSy}) exactly.
1045 The multiprocess startup of the MITgcm executable {\em mitgcmuv}
1046 is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1047 {\em INI\_PROCS()}. The first routine performs basic steps required
1048 to make sure each process is started and has a textual output
1049 stream associated with it. By default two output files are opened
1050 for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.
1051 The {\bf NNNNN} part of the name is filled in with the process
1052 number so that process number 0 will create output files
1053 {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create
1054 output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
1055 are used for reporting status and configuration information and
1056 for reporting error conditions on a process by process basis.
1057 The {\em EEBOOT\_MINIMAL()} procedure also sets the variables
1058 {\em myProcId} and {\em MPI\_COMM\_MODEL}.
1059 These variables are related
1060 to processor identification are are used later in the routine
1061 {\em INI\_PROCS()} to allocate tiles to processes.
1062
1063 Allocation of processes to tiles in controlled by the routine
1064 {\em INI\_PROCS()}. For each process this routine sets
1065 the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
1066 These variables specify in index space the coordinates
1067 of the southernmost and westernmost corner of the
1068 southernmost and westernmost tile owned by this process.
1069 The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
1070 are also set in this routine. These are used to identify
1071 processes holding tiles to the west, east, south and north
1072 of this process. These values are stored in global storage
1073 in the header file {\em EESUPPORT.h} for use by
1074 communication routines. The above does not hold when the
1075 exch2 package is used -- exch2 sets its own parameters to
1076 specify the global indices of tiles and their relationships
1077 to each other. See the documentation on the exch2 package
1078 (\ref{sec:exch2}) for
1079 details.
1080 \\
1081
1082 \fbox{
1083 \begin{minipage}{4.75in}
1084 File: {\em eesupp/src/eeboot\_minimal.F}\\
1085 File: {\em eesupp/src/ini\_procs.F} \\
1086 File: {\em eesupp/inc/EESUPPORT.h} \\
1087 Parameter: {\em myProcId} \\
1088 Parameter: {\em MPI\_COMM\_MODEL} \\
1089 Parameter: {\em myXGlobalLo} \\
1090 Parameter: {\em myYGlobalLo} \\
1091 Parameter: {\em pidW } \\
1092 Parameter: {\em pidE } \\
1093 Parameter: {\em pidS } \\
1094 Parameter: {\em pidN }
1095 \end{minipage}
1096 } \\
1097
1098
1099 \subsection{Controlling communication}
1100 The WRAPPER maintains internal information that is used for communication
1101 operations and that can be customized for different platforms. This section
1102 describes the information that is held and used.
1103
1104 \begin{enumerate}
1105 \item {\bf Tile-tile connectivity information}
1106 For each tile the WRAPPER
1107 sets a flag that sets the tile number to the north,
1108 south, east and
1109 west of that tile. This number is unique over all tiles in a
1110 configuration. Except when using the cubed sphere and the exch2 package,
1111 the number is held in the variables {\em tileNo}
1112 ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
1113 {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
1114 that specifies the type of communication that is used between tiles.
1115 This information is held in the variables {\em tileCommModeN},
1116 {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.
1117 This latter set of variables can take one of the following values
1118 {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
1119 A value of {\em COMM\_NONE} is used to indicate that a tile has no
1120 neighbor to communicate with on a particular face. A value
1121 of {\em COMM\_MSG} is used to indicated that some form of distributed
1122 memory communication is required to communicate between
1123 these tile faces ( see section \ref{sect:distributed_memory_communication}).
1124 A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1125 forms of shared memory communication ( see section
1126 \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1127 that a CPU should communicate by writing to data structures owned by another
1128 CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1129 from data structures owned by another CPU. These flags affect the behavior
1130 of the WRAPPER exchange primitive
1131 (see figure \ref{fig:communication_primitives}). The routine
1132 {\em ini\_communication\_patterns()} is responsible for setting the
1133 communication mode values for each tile.
1134
1135 When using the cubed sphere configuration with the exch2 package, the
1136 relationships between tiles and their communication methods are set
1137 by the package in other variables. See the exch2 package documentation
1138 (\ref{sec:exch2} for details.
1139
1140
1141
1142 \fbox{
1143 \begin{minipage}{4.75in}
1144 File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1145 File: {\em eesupp/inc/EESUPPORT.h} \\
1146 Parameter: {\em tileNo} \\
1147 Parameter: {\em tileNoE} \\
1148 Parameter: {\em tileNoW} \\
1149 Parameter: {\em tileNoN} \\
1150 Parameter: {\em tileNoS} \\
1151 Parameter: {\em tileCommModeE} \\
1152 Parameter: {\em tileCommModeW} \\
1153 Parameter: {\em tileCommModeN} \\
1154 Parameter: {\em tileCommModeS} \\
1155 \end{minipage}
1156 } \\
1157
1158 \item {\bf MP directives}
1159 The WRAPPER transfers control to numerical application code through
1160 the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1161 that allows for it to be invoked by several threads. Support for this
1162 is based on using multi-processing (MP) compiler directives.
1163 Most commercially available Fortran compilers support the generation
1164 of code to spawn multiple threads through some form of compiler directives.
1165 As this is generally much more convenient than writing code to interface
1166 to operating system libraries to explicitly spawn threads, and on some systems
1167 this may be the only method available the WRAPPER is distributed with
1168 template MP directives for a number of systems.
1169
1170 These directives are inserted into the code just before and after the
1171 transfer of control to numerical algorithm code through the routine
1172 {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of
1173 the code that performs this process for a Silicon Graphics system.
1174 This code is extracted from the files {\em main.F} and
1175 {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies
1176 how many instances of the routine {\em THE\_MODEL\_MAIN} will
1177 be created. The value of {\em nThreads} is set in the routine
1178 {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the
1179 the product of the parameters {\em nTx} and {\em nTy} that
1180 are read from the file {\em eedata}. If the value of {\em nThreads}
1181 is inconsistent with the number of threads requested from the
1182 operating system (for example by using an environment
1183 variable as described in section \ref{sect:multi_threaded_execution})
1184 then usually an error will be reported by the routine
1185 {\em CHECK\_THREADS}.\\
1186
1187 \fbox{
1188 \begin{minipage}{4.75in}
1189 File: {\em eesupp/src/ini\_threading\_environment.F}\\
1190 File: {\em eesupp/src/check\_threads.F} \\
1191 File: {\em eesupp/src/main.F} \\
1192 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1193 File: {\em eedata } \\
1194 Parameter: {\em nThreads} \\
1195 Parameter: {\em nTx} \\
1196 Parameter: {\em nTy} \\
1197 \end{minipage}
1198 }
1199
1200 \item {\bf memsync flags}
1201 As discussed in section \ref{sect:memory_consistency}, when using shared memory,
1202 a low-level system function may be need to force memory consistency.
1203 The routine {\em MEMSYNC()} is used for this purpose. This routine should
1204 not need modifying and the information below is only provided for
1205 completeness. A logical parameter {\em exchNeedsMemSync} set
1206 in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether
1207 the {\em MEMSYNC()} primitive is called. In general this
1208 routine is only used for multi-threaded execution.
1209 The code that goes into the {\em MEMSYNC()}
1210 routine is specific to the compiler and
1211 processor being used for multi-threaded execution and in general
1212 must be written using a short code snippet of assembly language.
1213 For an Ultra Sparc system the following code snippet is used
1214 \begin{verbatim}
1215 asm("membar #LoadStore|#StoreStore");
1216 \end{verbatim}
1217 for an Alpha based system the equivalent code reads
1218 \begin{verbatim}
1219 asm("mb");
1220 \end{verbatim}
1221 while on an x86 system the following code is required
1222 \begin{verbatim}
1223 asm("lock; addl $0,0(%%esp)": : :"memory")
1224 \end{verbatim}
1225
1226 \item {\bf Cache line size}
1227 As discussed in section \ref{sect:cache_effects_and_false_sharing},
1228 milti-threaded codes explicitly avoid penalties associated with excessive
1229 coherence traffic on an SMP system. To do this the shared memory data structures
1230 used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1231 are padded. The variables that control the padding are set in the
1232 header file {\em EEPARAMS.h}. These variables are called
1233 {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and
1234 {\em lShare8}. The default values should not normally need changing.
1235 \item {\bf \_BARRIER}
1236 This is a CPP macro that is expanded to a call to a routine
1237 which synchronizes all the logical processors running under the
1238 WRAPPER. Using a macro here preserves flexibility to insert
1239 a specialized call in-line into application code. By default this
1240 resolves to calling the procedure {\em BARRIER()}. The default
1241 setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1242
1243 \item {\bf \_GSUM}
1244 This is a CPP macro that is expanded to a call to a routine
1245 which sums up a floating point number
1246 over all the logical processors running under the
1247 WRAPPER. Using a macro here provides extra flexibility to insert
1248 a specialized call in-line into application code. By default this
1249 resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
1250 64-bit floating point operands)
1251 or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
1252 setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1253 The \_GSUM macro is a performance critical operation, especially for
1254 large processor count, small tile size configurations.
1255 The custom communication example discussed in section \ref{sect:jam_example}
1256 shows how the macro is used to invoke a custom global sum routine
1257 for a specific set of hardware.
1258
1259 \item {\bf \_EXCH}
1260 The \_EXCH CPP macro is used to update tile overlap regions.
1261 It is qualified by a suffix indicating whether overlap updates are for
1262 two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1263 physical fields and whether fields are 32-bit floating point
1264 ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point
1265 ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined
1266 in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1267 \_EXCH operation plays a crucial role in scaling to small tile,
1268 large logical and physical processor count configurations.
1269 The example in section \ref{sect:jam_example} discusses defining an
1270 optimized and specialized form on the \_EXCH operation.
1271
1272 The \_EXCH operation is also central to supporting grids such as
1273 the cube-sphere grid. In this class of grid a rotation may be required
1274 between tiles. Aligning the coordinate requiring rotation with the
1275 tile decomposition, allows the coordinate transformation to
1276 be embedded within a custom form of the \_EXCH primitive. In these
1277 cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1278 package documentation \ref{sec:exch2}.
1279
1280 \item {\bf Reverse Mode}
1281 The communication primitives \_EXCH and \_GSUM both employ
1282 hand-written adjoint forms (or reverse mode) forms.
1283 These reverse mode forms can be found in the
1284 source code directory {\em pkg/autodiff}.
1285 For the global sum primitive the reverse mode form
1286 calls are to {\em GLOBAL\_ADSUM\_R4} and
1287 {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1288 exchange primitives are found in routines
1289 prefixed {\em ADEXCH}. The exchange routines make calls to
1290 the same low-level communication primitives as the forward mode
1291 operations. However, the routine argument {\em simulationMode}
1292 is set to the value {\em REVERSE\_SIMULATION}. This signifies
1293 ti the low-level routines that the adjoint forms of the
1294 appropriate communication operation should be performed.
1295
1296 \item {\bf MAX\_NO\_THREADS}
1297 The variable {\em MAX\_NO\_THREADS} is used to indicate the
1298 maximum number of OS threads that a code will use. This
1299 value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1300 For single threaded execution it can be reduced to one if required.
1301 The value; is largely private to the WRAPPER and application code
1302 will nor normally reference the value, except in the following scenario.
1303
1304 For certain physical parametrization schemes it is necessary to have
1305 a substantial number of work arrays. Where these arrays are allocated
1306 in heap storage ( for example COMMON blocks ) multi-threaded
1307 execution will require multiple instances of the COMMON block data.
1308 This can be achieved using a Fortran 90 module construct, however,
1309 if this might be unavailable then the work arrays can be extended
1310 with dimensions use the tile dimensioning scheme of {\em nSx}
1311 and {\em nSy} ( as described in section
1312 \ref{sect:specifying_a_decomposition}). However, if the configuration
1313 being specified involves many more tiles than OS threads then
1314 it can save memory resources to reduce the variable
1315 {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1316 will be used and to declare the physical parameterization
1317 work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
1318 An example of this is given in the verification experiment
1319 {\em aim.5l\_cs}. Here the default setting of
1320 {\em MAX\_NO\_THREADS} is altered to
1321 \begin{verbatim}
1322 INTEGER MAX_NO_THREADS
1323 PARAMETER ( MAX_NO_THREADS = 6 )
1324 \end{verbatim}
1325 and several work arrays for storing intermediate calculations are
1326 created with declarations of the form.
1327 \begin{verbatim}
1328 common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1329 \end{verbatim}
1330 This declaration scheme is not used widely, because most global data
1331 is used for permanent not temporary storage of state information.
1332 In the case of permanent state information this approach cannot be used
1333 because there has to be enough storage allocated for all tiles.
1334 However, the technique can sometimes be a useful scheme for reducing memory
1335 requirements in complex physical parameterizations.
1336 \end{enumerate}
1337
1338 \begin{figure}
1339 \begin{verbatim}
1340 C--
1341 C-- Parallel directives for MIPS Pro Fortran compiler
1342 C--
1343 C Parallel compiler directives for SGI with IRIX
1344 C$PAR PARALLEL DO
1345 C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1346 C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1347 C
1348 DO I=1,nThreads
1349 myThid = I
1350
1351 C-- Invoke nThreads instances of the numerical model
1352 CALL THE_MODEL_MAIN(myThid)
1353
1354 ENDDO
1355 \end{verbatim}
1356 \caption{Prior to transferring control to
1357 the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
1358 MP directives to spawn multiple threads.
1359 } \label{fig:mp_directives}
1360 \end{figure}
1361
1362
1363 \subsubsection{Specializing the Communication Code}
1364
1365 The isolation of performance critical communication primitives and the
1366 sub-division of the simulation domain into tiles is a powerful tool.
1367 Here we show how it can be used to improve application performance and
1368 how it can be used to adapt to new griding approaches.
1369
1370 \subsubsection{JAM example}
1371 \label{sect:jam_example}
1372 On some platforms a big performance boost can be obtained by
1373 binding the communication routines {\em \_EXCH} and
1374 {\em \_GSUM} to specialized native libraries ) fro example the
1375 shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag
1376 is used as an illustration of a specialized communication configuration
1377 that substitutes for standard, portable forms of {\em \_EXCH} and
1378 {\em \_GSUM}. It affects three source files {\em eeboot.F},
1379 {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined
1380 is has the following effects.
1381 \begin{itemize}
1382 \item An extra phase is included at boot time to initialize the custom
1383 communications library ( see {\em ini\_jam.F}).
1384 \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1385 with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
1386 \item a highly specialized form of the exchange operator (optimized
1387 for overlap regions of width one) is substituted into the elliptic
1388 solver routine {\em cg2d.F}.
1389 \end{itemize}
1390 Developing specialized code for other libraries follows a similar
1391 pattern.
1392
1393 \subsubsection{Cube sphere communication}
1394 \label{sect:cube_sphere_communication}
1395 Actual {\em \_EXCH} routine code is generated automatically from
1396 a series of template files, for example {\em exch\_rx.template}.
1397 This is done to allow a large number of variations on the exchange
1398 process to be maintained. One set of variations supports the
1399 cube sphere grid. Support for a cube sphere grid in MITgcm is based
1400 on having each face of the cube as a separate tile or tiles.
1401 The exchange routines are then able to absorb much of the
1402 detailed rotation and reorientation required when moving around the
1403 cube grid. The set of {\em \_EXCH} routines that contain the
1404 word cube in their name perform these transformations.
1405 They are invoked when the run-time logical parameter
1406 {\em useCubedSphereExchange} is set true. To facilitate the
1407 transformations on a staggered C-grid, exchange operations are defined
1408 separately for both vector and scalar quantities and for
1409 grid-centered and for grid-face and corner quantities.
1410 Three sets of exchange routines are defined. Routines
1411 with names of the form {\em exch\_rx} are used to exchange
1412 cell centered scalar quantities. Routines with names of the form
1413 {\em exch\_uv\_rx} are used to exchange vector quantities located at
1414 the C-grid velocity points. The vector quantities exchanged by the
1415 {\em exch\_uv\_rx} routines can either be signed (for example velocity
1416 components) or un-signed (for example grid-cell separations).
1417 Routines with names of the form {\em exch\_z\_rx} are used to exchange
1418 quantities at the C-grid vorticity point locations.
1419
1420
1421
1422
1423 \section{MITgcm execution under WRAPPER}
1424 \begin{rawhtml}
1425 <!-- CMIREDIR:mitgcm_wrapper: -->
1426 \end{rawhtml}
1427
1428 Fitting together the WRAPPER elements, package elements and
1429 MITgcm core equation elements of the source code produces calling
1430 sequence shown in section \ref{sect:calling_sequence}
1431
1432 \subsection{Annotated call tree for MITgcm and WRAPPER}
1433 \label{sect:calling_sequence}
1434
1435 WRAPPER layer.
1436
1437 {\footnotesize
1438 \begin{verbatim}
1439
1440 MAIN
1441 |
1442 |--EEBOOT :: WRAPPER initialization
1443 | |
1444 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1445 | | allow basic I/O.
1446 | |-- EEINTRO_MSG :: Write startup greeting.
1447 | |
1448 | |-- EESET_PARMS :: Set WRAPPER parameters
1449 | |
1450 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1451 | |
1452 | |-- INI_PROCS :: Associate processes with grid regions.
1453 | |
1454 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1455 | |
1456 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1457 | :: communication data structures
1458 |
1459 |
1460 |--CHECK_THREADS :: Validate multiple thread start up.
1461 |
1462 |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1463
1464 \end{verbatim}
1465 }
1466
1467 Core equations plus packages.
1468
1469 {\footnotesize
1470 \begin{verbatim}
1471 C
1472 C
1473 C Invocation from WRAPPER level...
1474 C :
1475 C :
1476 C |
1477 C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1478 C | :: Called from WRAPPER level numerical
1479 C | :: code invocation routine. On entry
1480 C | :: to THE_MODEL_MAIN separate thread and
1481 C | :: separate processes will have been established.
1482 C | :: Each thread and process will have a unique ID
1483 C | :: but as yet it will not be associated with a
1484 C | :: specific region in decomposed discrete space.
1485 C |
1486 C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1487 C | | :: grid, solver matrices etc..
1488 C | |
1489 C | |-INI_PARMS :: Routine to set kernel model parameters.
1490 C | | :: By default kernel parameters are read from file
1491 C | | :: "data" in directory in which code executes.
1492 C | |
1493 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1494 C | |
1495 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1496 C | | | :: Grid arrays are held and described in GRID.h.
1497 C | | |
1498 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1499 C | | |
1500 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1501 C | | | :: (calculate grid from kernel parameters).
1502 C | | |
1503 C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1504 C | | | :: initialization (calculate grid from
1505 C | | | :: kernel parameters).
1506 C | | |
1507 C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1508 C | | :: grid initializations. ( input from raw
1509 C | | :: grid files, LONC.bin, DXF.bin etc... )
1510 C | |
1511 C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1512 C | |
1513 C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1514 C | | :: land masking for solid-fluid boundaries.
1515 C | |
1516 C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1517 C | |
1518 C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1519 C | | :: sphere options are coded.
1520 C | |
1521 C | |-PACAKGES_BOOT :: Start up the optional package environment.
1522 C | | :: Runtime selection of active packages.
1523 C | |
1524 C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1525 C | | |
1526 C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1527 C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1528 C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1529 C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1530 C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1531 C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1532 C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1533 C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1534 C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1535 C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1536 C | |
1537 C | |-PACKAGES_CHECK
1538 C | | |
1539 C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1540 C | | |-OBCS_CHECK :: Open bndy Package. pkg/obcs
1541 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1542 C | |
1543 C | |-PACKAGES_INIT_FIXED
1544 C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1545 C | | |-FLT_INIT :: Floats Package. see pkg/flt
1546 C | |
1547 C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1548 C | |
1549 C | |-INI_CG2D :: 2d con. grad solver initialisation.
1550 C | |
1551 C | |-INI_CG3D :: 3d con. grad solver initialisation.
1552 C | |
1553 C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1554 C | :: Includes annotated table of kernel
1555 C | :: parameter settings.
1556 C |
1557 C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1558 C |
1559 C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1560 C ! :: Auotmatically generated by TAMC/TAF.
1561 C |
1562 C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1563 C |
1564 C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1565 C |
1566 C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1567 C | |
1568 C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1569 C | | | :: variables
1570 C | | |
1571 C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1572 C | | |
1573 C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1574 C | | | :: sphere options are coded.
1575 C | | |
1576 C | | |-INI_CG2D :: 2d con. grad solver initialisation.
1577 C | | |-INI_CG3D :: 3d con. grad solver initialisation.
1578 C | | |-INI_MIXING :: Initialise diapycnal diffusivity.
1579 C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical
1580 C | | | :: fields).
1581 C | | |
1582 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1583 C | | | |-INI_VEL :: Initialize 3D flow field.
1584 C | | | |-INI_THETA :: Set model initial temperature field.
1585 C | | | |-INI_SALT :: Set model initial salinity field.
1586 C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1587 C | | |
1588 C | | |-INI_TR1 :: Set initial tracer 1 distribution.
1589 C | | |
1590 C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1591 C | | | | :: Here applied to move restart conditions
1592 C | | | | :: (saved in mid timestep) to correct level in
1593 C | | | | :: time (only used for pre-c35).
1594 C | | | |
1595 C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1596 C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1597 C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1598 C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1599 C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1600 C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1601 C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1602 C | | | | |-FIND_RHO :: Find adjacent densities.
1603 C | | | | |-CONVECT :: Mix static instability.
1604 C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1605 C | | | |
1606 C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1607 C | | |
1608 C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1609 C | | | | :: Extra time history interactions.
1610 C | | | |
1611 C | | | |-FIND_RHO :: Find adjacent densities.
1612 C | | | |-CONVECT :: Mix static instability.
1613 C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1614 C | | |
1615 C | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving
1616 C | | | | :: package data.
1617 C | | | |
1618 C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1619 C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1620 C | | | |-KPP_OPEN_DIAGS
1621 C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1622 C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1623 C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1624 C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1625 C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1626 C | | | |-INI_FORCING :: Set model initial forcing fields.
1627 C | | | | :: Either set in-line or from file as shown.
1628 C | | | |-READ_FLD_XY_RS(zonalWindFile)
1629 C | | | |-READ_FLD_XY_RS(meridWindFile)
1630 C | | | |-READ_FLD_XY_RS(surfQFile)
1631 C | | | |-READ_FLD_XY_RS(EmPmRfile)
1632 C | | | |-READ_FLD_XY_RS(thetaClimFile)
1633 C | | | |-READ_FLD_XY_RS(saltClimFile)
1634 C | | | |-READ_FLD_XY_RS(surfQswFile)
1635 C | | |
1636 C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1637 C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1638 C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1639 C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1640 C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1641 C | |
1642 C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1643 C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1644 C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1645 C | |
1646 C | |-MONITOR :: Monitor state ( see pkg/monitor )
1647 C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1648 C====|>|
1649 C====|>| ****************************
1650 C====|>| BEGIN MAIN TIMESTEPPING LOOP
1651 C====|>| ****************************
1652 C====|>|
1653 C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1654 C/\ | | |
1655 C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1656 C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1657 C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1658 C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1659 C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1660 C/\ | | | | :: Simple interpolation between end-points
1661 C/\ | | | | :: for forcing datasets.
1662 C/\ | | | |
1663 C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1664 C/\ | | |
1665 C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1666 C/\ | | | |
1667 C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1668 C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1669 C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-Rhonil] of a slice
1670 C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1671 C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1672 C/\ | | | |
1673 C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1674 C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1675 C/\ | | | | :: forcing terms.
1676 C/\ | | | |
1677 C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1678 C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1679 C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1680 C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1681 C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1682 C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1683 C/\ | | | | advection case) (see pkg/gad).
1684 C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1685 C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1686 C/\ | | | | |
1687 C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1688 C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1689 C/\ | | | |
1690 C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1691 C/\ | | | | |
1692 C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1693 C/\ | | | | | :: ( see pkg/gad )
1694 C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1695 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1696 C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1697 C/\ | | | |
1698 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1699 C/\ | | | |
1700 C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1701 C/\ | | | | |
1702 C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1703 C/\ | | | | | :: ( see pkg/gad )
1704 C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1705 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1706 C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1707 C/\ | | | |
1708 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1709 C/\ | | | |
1710 C/\ | | | |-CALC_GTR1 :: Calculate other tracer(s) tendency terms
1711 C/\ | | | | |
1712 C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1713 C/\ | | | | | :: ( see pkg/gad )
1714 C/\ | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.
1715 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1716 C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1717 C/\ | | | |
1718 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1719 C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1720 C/\ | | | |-FREEZE :: Limit range of temperature.
1721 C/\ | | | |
1722 C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1723 C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1724 C/\ | | | |
1725 C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1726 C/\ | | | |-EXCH :: Update overlaps
1727 C/\ | | |
1728 C/\ | | |-DYNAMICS :: Momentum equations driver.
1729 C/\ | | | |
1730 C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1731 C/\ | | | | Potential anomaly.
1732 C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1733 C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1734 C/\ | | | |
1735 C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1736 C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1737 C/\ | | | | pkg/mom_fluxform ).
1738 C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1739 C/\ | | | | pkg/mom_vecinv ).
1740 C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1741 C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1742 C/\ | | | |
1743 C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1744 C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1745 C/\ | | | |
1746 C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1747 C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1748 C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1749 C/\ | | |
1750 C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1751 C/\ | | |
1752 C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1753 C/\ | | |
1754 C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1755 C/\ | | |
1756 C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1757 C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1758 C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1759 C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1760 C/\ | | |
1761 C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1762 C/\ | | | |
1763 C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1764 C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1765 C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1766 C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1767 C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1768 C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1769 C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1770 C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1771 C/\ | | | | |-CONVECT :: Mix static instability.
1772 C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1773 C/\ | | | |
1774 C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1775 C/\ | | |
1776 C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1777 C/\ | | | |-EXCH
1778 C/\ | | |
1779 C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1780 C/\ | | |
1781 C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1782 C/\ | | |
1783 C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1784 C/\ | | | |-WRITE_STATE :: Core state I/O
1785 C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1786 C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1787 C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1788 C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1789 C/\ | | |
1790 C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1791 C/\ | |
1792 C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1793 C<===|=|
1794 C<===|=| **************************
1795 C<===|=| END MAIN TIMESTEPPING LOOP
1796 C<===|=| **************************
1797 C<===|=|
1798 C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1799 C |
1800 C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1801 C |
1802 C |-TIMER_PRINTALL :: Computational timing summary
1803 C |
1804 C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1805 C :: events.
1806 C
1807 \end{verbatim}
1808 }
1809
1810 \subsection{Measuring and Characterizing Performance}
1811
1812 TO BE DONE (CNH)
1813
1814 \subsection{Estimating Resource Requirements}
1815
1816 TO BE DONE (CNH)
1817
1818 \subsubsection{Atlantic 1/6 degree example}
1819 \subsubsection{Dry Run testing}
1820 \subsubsection{Adjoint Resource Requirements}
1821 \subsubsection{State Estimation Environment Resources}
1822

  ViewVC Help
Powered by ViewVC 1.1.22