/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Contents of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.7 - (show annotations) (download) (as text)
Thu Feb 28 19:32:20 2002 UTC (23 years, 4 months ago) by cnh
Branch: MAIN
Changes since 1.6: +16 -6 lines
File MIME type: application/x-tex
Updates for special on-line version with
hyperlinked and animated figures

Separating tutorials and reference material

1 % $Header: /u/u0/gcmpack/manual/part4/sarch.tex,v 1.6 2001/11/13 20:13:55 adcroft Exp $
2
3 This chapter focuses on describing the {\bf WRAPPER} environment within which
4 both the core numerics and the pluggable packages operate. The description
5 presented here is intended to be a detailed exposition and contains significant
6 background material, as well as advanced details on working with the WRAPPER.
7 The tutorial sections of this manual (see Chapters
8 \ref{chap:tutorialI}, \ref{chap:tutorialII} and \ref{chap:tutorialIII})
9 contain more succinct, step-by-step instructions on running basic numerical
10 experiments, of varous types, both sequentially and in parallel. For many
11 projects simply starting from an example code and adapting it to suit a
12 particular situation
13 will be all that is required.
14 The first part of this chapter discusses the MITgcm architecture at an
15 abstract level. In the second part of the chapter we described practical
16 details of the MITgcm implementation and of current tools and operating system
17 features that are employed.
18
19 \section{Overall architectural goals}
20
21 Broadly, the goals of the software architecture employed in MITgcm are
22 three-fold
23
24 \begin{itemize}
25 \item We wish to be able to study a very broad range
26 of interesting and challenging rotating fluids problems.
27 \item We wish the model code to be readily targeted to
28 a wide range of platforms
29 \item On any given platform we would like to be
30 able to achieve performance comparable to an implementation
31 developed and specialized specifically for that platform.
32 \end{itemize}
33
34 These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
35 which conveys the goals of the MITgcm design. The goals lead to
36 a software architecture which at the high-level can be viewed as consisting
37 of
38
39 \begin{enumerate}
40 \item A core set of numerical and support code. This is discussed in detail in
41 section \ref{sect:partII}.
42 \item A scheme for supporting optional "pluggable" {\bf packages} (containing
43 for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
44 These packages are used both to overlay alternate dynamics and to introduce
45 specialized physical content onto the core numerical code. An overview of
46 the {\bf package} scheme is given at the start of part \ref{part:packages}.
47 \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
48 Programming Environment Resource), within which the core numerics and pluggable
49 packages operate.
50 \end{enumerate}
51
52 This chapter focuses on describing the {\bf WRAPPER} environment under which
53 both the core numerics and the pluggable packages function. The description
54 presented here is intended to be a detailed exposition and contains significant
55 background material, as well as advanced details on working with the WRAPPER.
56 The examples section of this manual (part \ref{part:example}) contains more
57 succinct, step-by-step instructions on running basic numerical
58 experiments both sequentially and in parallel. For many projects simply
59 starting from an example code and adapting it to suit a particular situation
60 will be all that is required.
61
62
63 \begin{figure}
64 \begin{center}
65 \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
66 \end{center}
67 \caption{
68 The MITgcm architecture is designed to allow simulation of a wide
69 range of physical problems on a wide range of hardware. The computational
70 resource requirements of the applications targeted range from around
71 $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
72 ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
73 interest range from $10^{9}$ floating point operations to more than $10^{17}$
74 floating point operations.}
75 \label{fig:mitgcm_architecture_goals}
76 \end{figure}
77
78 \section{WRAPPER}
79
80 A significant element of the software architecture utilized in
81 MITgcm is a software superstructure and substructure collectively
82 called the WRAPPER (Wrappable Application Parallel Programming
83 Environment Resource). All numerical and support code in MITgcm is written
84 to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
85 the WRAPPER means that coding has to follow certain, relatively
86 straightforward, rules and conventions ( these are discussed further in
87 section \ref{sect:specifying_a_decomposition} ).
88
89 The approach taken by the WRAPPER is illustrated in figure
90 \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
91 that fits within it from architectural differences between hardware platforms
92 and operating systems. This allows numerical code to be easily retargetted.
93
94
95 \begin{figure}
96 \begin{center}
97 \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
98 \end{center}
99 \caption{
100 Numerical code is written too fit within a software support
101 infrastructure called WRAPPER. The WRAPPER is portable and
102 can be specialized for a wide range of specific target hardware and
103 programming environments, without impacting numerical code that fits
104 within the WRAPPER. Codes that fit within the WRAPPER can generally be
105 made to run as fast on a particular platform as codes specially
106 optimized for that platform.}
107 \label{fig:fit_in_wrapper}
108 \end{figure}
109
110 \subsection{Target hardware}
111 \label{sect:target_hardware}
112
113 The WRAPPER is designed to target as broad as possible a range of computer
114 systems. The original development of the WRAPPER took place on a
115 multi-processor, CRAY Y-MP system. On that system, numerical code performance
116 and scaling under the WRAPPER was in excess of that of an implementation that
117 was tightly bound to the CRAY systems proprietary multi-tasking and
118 micro-tasking approach. Later developments have been carried out on
119 uniprocessor and multi-processor Sun systems with both uniform memory access
120 (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
121 been undertaken on x86 cluster systems, Alpha processor based clustered SMP
122 systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
123 The MITgcm code, operating within the WRAPPER, is also used routinely used on
124 large scale MPP systems (for example T3E systems and IBM SP systems). In all
125 cases numerical code, operating within the WRAPPER, performs and scales very
126 competitively with equivalent numerical code that has been modified to contain
127 native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.
128
129 \subsection{Supporting hardware neutrality}
130
131 The different systems listed in section \ref{sect:target_hardware} can be
132 categorized in many different ways. For example, one common distinction is
133 between shared-memory parallel systems (SMP's, PVP's) and distributed memory
134 parallel systems (for example x86 clusters and large MPP systems). This is one
135 example of a difference between compute platforms that can impact an
136 application. Another common distinction is between vector processing systems
137 with highly specialized CPU's and memory subsystems and commodity
138 microprocessor based systems. There are numerous other differences, especially
139 in relation to how parallel execution is supported. To capture the essential
140 differences between different platforms the WRAPPER uses a {\it machine model}.
141
142 \subsection{WRAPPER machine model}
143
144 Applications using the WRAPPER are not written to target just one
145 particular machine (for example an IBM SP2) or just one particular family or
146 class of machines (for example Parallel Vector Processor Systems). Instead the
147 WRAPPER provides applications with an
148 abstract {\it machine model}. The machine model is very general, however, it can
149 easily be specialized to fit, in a computationally efficient manner, any
150 computer architecture currently available to the scientific computing community.
151
152 \subsection{Machine model parallelism}
153
154 Codes operating under the WRAPPER target an abstract machine that is assumed to
155 consist of one or more logical processors that can compute concurrently.
156 Computational work is divided among the logical
157 processors by allocating ``ownership'' to
158 each processor of a certain set (or sets) of calculations. Each set of
159 calculations owned by a particular processor is associated with a specific
160 region of the physical space that is being simulated, only one processor will
161 be associated with each such region (domain decomposition).
162
163 In a strict sense the logical processors over which work is divided do not need
164 to correspond to physical processors. It is perfectly possible to execute a
165 configuration decomposed for multiple logical processors on a single physical
166 processor. This helps ensure that numerical code that is written to fit
167 within the WRAPPER will parallelize with no additional effort and is
168 also useful when debugging codes. Generally, however,
169 the computational domain will be subdivided over multiple logical
170 processors in order to then bind those logical processors to physical
171 processor resources that can compute in parallel.
172
173 \subsubsection{Tiles}
174
175 Computationally, associated with each region of physical
176 space allocated to a particular logical processor, there will be data
177 structures (arrays, scalar variables etc...) that hold the simulated state of
178 that region. We refer to these data structures as being {\bf owned} by the
179 processor to which their
180 associated region of physical space has been allocated. Individual
181 regions that are allocated to processors are called {\bf tiles}. A
182 processor can own more
183 than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being
184 mapped to a set of logical processors, with each processors owning a single
185 region of the domain (a single tile). Except for periods of
186 communication and coordination, each processor computes autonomously, working
187 only with data from the tile (or tiles) that the processor owns. When multiple
188 tiles are alloted to a single processor, each tile is computed on
189 independently of the other tiles, in a sequential fashion.
190
191 \begin{figure}
192 \begin{center}
193 \resizebox{5in}{!}{
194 \includegraphics{part4/domain_decomp.eps}
195 }
196 \end{center}
197 \caption{ The WRAPPER provides support for one and two dimensional
198 decompositions of grid-point domains. The figure shows a hypothetical domain of
199 total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in
200 two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf
201 tiles} are {\bf owned} by different processors. The {\bf owning}
202 processors perform the
203 arithmetic operations associated with a {\bf tile}. Although not illustrated
204 here, a single processor can {\bf own} several {\bf tiles}.
205 Whenever a processor wishes to transfer data between tiles or
206 communicate with other processors it calls a WRAPPER supplied
207 function.
208 } \label{fig:domaindecomp}
209 \end{figure}
210
211
212 \subsubsection{Tile layout}
213
214 Tiles consist of an interior region and an overlap region. The overlap region
215 of a tile corresponds to the interior region of an adjacent tile.
216 In figure \ref{fig:tiledworld} each tile would own the region
217 within the black square and hold duplicate information for overlap
218 regions extending into the tiles to the north, south, east and west.
219 During
220 computational phases a processor will reference data in an overlap region
221 whenever it requires values that outside the domain it owns. Periodically
222 processors will make calls to WRAPPER functions to communicate data between
223 tiles, in order to keep the overlap regions up to date (see section
224 \ref{sect:communication_primitives}). The WRAPPER functions can use a
225 variety of different mechanisms to communicate data between tiles.
226
227 \begin{figure}
228 \begin{center}
229 \resizebox{5in}{!}{
230 \includegraphics{part4/tiled-world.eps}
231 }
232 \end{center}
233 \caption{ A global grid subdivided into tiles.
234 Tiles contain a interior region and an overlap region.
235 Overlap regions are periodically updated from neighboring tiles.
236 } \label{fig:tiledworld}
237 \end{figure}
238
239 \subsection{Communication mechanisms}
240
241 Logical processors are assumed to be able to exchange information
242 between tiles and between each other using at least one of two possible
243 mechanisms.
244
245 \begin{itemize}
246 \item {\bf Shared memory communication}.
247 Under this mode of communication data transfers are assumed to be possible
248 using direct addressing of regions of memory. In this case a CPU is able to read
249 (and write) directly to regions of memory "owned" by another CPU
250 using simple programming language level assignment operations of the
251 the sort shown in figure \ref{fig:simple_assign}. In this way one CPU
252 (CPU1 in the figure) can communicate information to another CPU (CPU2 in the
253 figure) by assigning a particular value to a particular memory location.
254
255 \item {\bf Distributed memory communication}.
256 Under this mode of communication there is no mechanism, at the application code level,
257 for directly addressing regions of memory owned and visible to another CPU. Instead
258 a communication library must be used as illustrated in figure
259 \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the
260 communication library to communicate data from a tile that it owns to a tile
261 that another CPU owns. By default the WRAPPER binds to the MPI communication
262 library \ref{MPI} for this style of communication.
263 \end{itemize}
264
265 The WRAPPER assumes that communication will use one of these two styles
266 of communication. The underlying hardware and operating system support
267 for the style used is not specified and can vary from system to system.
268
269 \begin{figure}
270 \begin{verbatim}
271
272 CPU1 | CPU2
273 ==== | ====
274 |
275 a(3) = 8 | WHILE ( a(3) .NE. 8 )
276 | WAIT
277 | END WHILE
278 |
279 \end{verbatim}
280 \caption{ In the WRAPPER shared memory communication model, simple writes to an
281 array can be made to be visible to other CPU's at the application code level.
282 So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
283 element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)
284 will be able to see the value $8$ when they read from $a(3)$.
285 This provides a very low latency and high bandwidth communication
286 mechanism.
287 } \label{fig:simple_assign}
288 \end{figure}
289
290 \begin{figure}
291 \begin{verbatim}
292
293 CPU1 | CPU2
294 ==== | ====
295 |
296 a(3) = 8 | WHILE ( a(3) .NE. 8 )
297 CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
298 | END WHILE
299 |
300 \end{verbatim}
301 \caption{ In the WRAPPER distributed memory communication model
302 data can not be made directly visible to other CPU's.
303 If one CPU writes the value $8$ to element $3$ of array $a$, then
304 at least one of CPU1 and/or CPU2 in the figure above will need
305 to call a bespoke communication library in order for the updated
306 value to be communicated between CPU's.
307 } \label{fig:comm_msg}
308 \end{figure}
309
310 \subsection{Shared memory communication}
311 \label{sect:shared_memory_communication}
312
313 Under shared communication independent CPU's are operating
314 on the exact same global address space at the application level.
315 This means that CPU 1 can directly write into global
316 data structures that CPU 2 ``owns'' using a simple
317 assignment at the application level.
318 This is the model of memory access is supported at the basic system
319 design level in ``shared-memory'' systems such as PVP systems, SMP systems,
320 and on distributed shared memory systems (the SGI Origin).
321 On such systems the WRAPPER will generally use simple read and write statements
322 to access directly application data structures when communicating between CPU's.
323
324 In a system where assignments statements, like the one in figure
325 \ref{fig:simple_assign} map directly to
326 hardware instructions that transport data between CPU and memory banks, this
327 can be a very efficient mechanism for communication. In this case two CPU's,
328 CPU1 and CPU2, can communicate simply be reading and writing to an
329 agreed location and following a few basic rules. The latency of this sort
330 of communication is generally not that much higher than the hardware
331 latency of other memory accesses on the system. The bandwidth available
332 between CPU's communicating in this way can be close to the bandwidth of
333 the systems main-memory interconnect. This can make this method of
334 communication very efficient provided it is used appropriately.
335
336 \subsubsection{Memory consistency}
337 \label{sect:memory_consistency}
338
339 When using shared memory communication between
340 multiple processors the WRAPPER level shields user applications from
341 certain counter-intuitive system behaviors. In particular, one issue the
342 WRAPPER layer must deal with is a systems memory model. In general the order
343 of reads and writes expressed by the textual order of an application code may
344 not be the ordering of instructions executed by the processor performing the
345 application. The processor performing the application instructions will always
346 operate so that, for the application instructions the processor is executing,
347 any reordering is not apparent. However, in general machines are often
348 designed so that reordering of instructions is not hidden from other second
349 processors. This means that, in general, even on a shared memory system two
350 processors can observe inconsistent memory values.
351
352 The issue of memory consistency between multiple processors is discussed at
353 length in many computer science papers, however, from a practical point of
354 view, in order to deal with this issue, shared memory machines all provide
355 some mechanism to enforce memory consistency when it is needed. The exact
356 mechanism employed will vary between systems. For communication using shared
357 memory, the WRAPPER provides a place to invoke the appropriate mechanism to
358 ensure memory consistency for a particular platform.
359
360 \subsubsection{Cache effects and false sharing}
361 \label{sect:cache_effects_and_false_sharing}
362
363 Shared-memory machines often have local to processor memory caches
364 which contain mirrored copies of main memory. Automatic cache-coherence
365 protocols are used to maintain consistency between caches on different
366 processors. These cache-coherence protocols typically enforce consistency
367 between regions of memory with large granularity (typically 128 or 256 byte
368 chunks). The coherency protocols employed can be expensive relative to other
369 memory accesses and so care is taken in the WRAPPER (by padding synchronization
370 structures appropriately) to avoid unnecessary coherence traffic.
371
372 \subsubsection{Operating system support for shared memory.}
373
374 Applications running under multiple threads within a single process can
375 use shared memory communication. In this case {\it all} the memory locations
376 in an application are potentially visible to all the compute threads. Multiple
377 threads operating within a single process is the standard mechanism for
378 supporting shared memory that the WRAPPER utilizes. Configuring and launching
379 code to run in multi-threaded mode on specific platforms is discussed in
380 section \ref{sect:running_with_threads}. However, on many systems, potentially
381 very efficient mechanisms for using shared memory communication between
382 multiple processes (in contrast to multiple threads within a single
383 process) also exist. In most cases this works by making a limited region of
384 memory shared between processes. The MMAP \ref{magicgarden} and
385 IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do
386 vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.
387 Extensions exist for the WRAPPER that allow these mechanisms
388 to be used for shared memory communication. However, these mechanisms are not
389 distributed with the default WRAPPER sources, because of their proprietary
390 nature.
391
392 \subsection{Distributed memory communication}
393 \label{sect:distributed_memory_communication}
394 Many parallel systems are not constructed in a way where it is
395 possible or practical for an application to use shared memory
396 for communication. For example cluster systems consist of individual computers
397 connected by a fast network. On such systems their is no notion of shared memory
398 at the system level. For this sort of system the WRAPPER provides support
399 for communication based on a bespoke communication library
400 (see figure \ref{fig:comm_msg}). The default communication library used is MPI
401 \ref{mpi}. However, it is relatively straightforward to implement bindings to
402 optimized platform specific communication libraries. For example the work
403 described in \ref{hoe-hill:99} substituted standard MPI communication for a
404 highly optimized library.
405
406 \subsection{Communication primitives}
407 \label{sect:communication_primitives}
408
409 \begin{figure}
410 \begin{center}
411 \resizebox{5in}{!}{
412 \includegraphics{part4/comm-primm.eps}
413 }
414 \end{center}
415 \caption{Three performance critical parallel primitives are provided
416 by the WRAPPER. These primitives are always used to communicate data
417 between tiles. The figure shows four tiles. The curved arrows indicate
418 exchange primitives which transfer data between the overlap regions at tile
419 edges and interior regions for nearest-neighbor tiles.
420 The straight arrows symbolize global sum operations which connect all tiles.
421 The global sum operation provides both a key arithmetic primitive and can
422 serve as a synchronization primitive. A third barrier primitive is also
423 provided, it behaves much like the global sum primitive.
424 } \label{fig:communication_primitives}
425 \end{figure}
426
427
428 Optimized communication support is assumed to be possibly available
429 for a small number of communication operations.
430 It is assumed that communication performance optimizations can
431 be achieved by optimizing a small number of communication primitives.
432 Three optimizable primitives are provided by the WRAPPER
433 \begin{itemize}
434 \item{\bf EXCHANGE} This operation is used to transfer data between interior
435 and overlap regions of neighboring tiles. A number of different forms of this
436 operation are supported. These different forms handle
437 \begin{itemize}
438 \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
439 separately.
440 \item Bindings to different communication methods.
441 Exchange primitives select between using shared memory or distributed
442 memory communication.
443 \item Transformation operations required when transporting
444 data between different grid regions. Transferring data
445 between faces of a cube-sphere grid, for example, involves a rotation
446 of vector components.
447 \item Forward and reverse mode computations. Derivative calculations require
448 tangent linear and adjoint forms of the exchange primitives.
449
450 \end{itemize}
451
452 \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
453 operation for the pressure inversion phase of the MITgcm algorithm.
454 For certain configurations scaling can be highly sensitive to
455 the performance of the global sum primitive. This operation is a collective
456 operation involving all tiles of the simulated domain. Different forms
457 of the global sum primitive exist for handling
458 \begin{itemize}
459 \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
460 separately.
461 \item Bindings to different communication methods.
462 Exchange primitives select between using shared memory or distributed
463 memory communication.
464 \item Forward and reverse mode computations. Derivative calculations require
465 tangent linear and adjoint forms of the exchange primitives.
466 \end{itemize}
467
468 \item{\bf BARRIER} The WRAPPER provides a global synchronization function
469 called barrier. This is used to synchronize computations over all tiles.
470 The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in
471 some cases use the same underlying code.
472 \end{itemize}
473
474
475 \subsection{Memory architecture}
476
477 The WRAPPER machine model is aimed to target efficiently systems with
478 highly pipelined memory architectures and systems with deep memory
479 hierarchies that favor memory reuse. This is achieved by supporting a
480 flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
481 Within a CPU computations are carried out sequentially on each tile
482 in turn. By reshaping tiles according to the target platform it is
483 possible to automatically tune code to improve memory performance.
484 On a vector machine a given domain might be sub-divided into a few
485 long, thin regions. On a commodity microprocessor based system, however,
486 the same region could be simulated use many more smaller
487 sub-domains.
488
489
490 \begin{figure}
491 \begin{center}
492 \resizebox{5in}{!}{
493 \includegraphics{part4/tiling_detail.eps}
494 }
495 \end{center}
496 \caption{The tiling strategy that the WRAPPER supports allows tiles
497 to be shaped to suit the underlying system memory architecture.
498 Compact tiles that lead to greater memory reuse can be used on cache
499 based systems (upper half of figure) with deep memory hierarchies, long tiles
500 with large inner loops can be used to exploit vector systems having
501 highly pipelined memory systems.
502 } \label{fig:tiling-strategy}
503 \end{figure}
504
505 \newpage
506 \subsection{Summary}
507 Following the discussion above, the machine model that the WRAPPER
508 presents to an application has the following characteristics
509
510 \begin{itemize}
511 \item The machine consists of one or more logical processors. \vspace{-3mm}
512 \item Each processor operates on tiles that it owns.\vspace{-3mm}
513 \item A processor may own more than one tile.\vspace{-3mm}
514 \item Processors may compute concurrently.\vspace{-3mm}
515 \item Exchange of information between tiles is handled by the
516 machine (WRAPPER) not by the application.
517 \end{itemize}
518 Behind the scenes this allows the WRAPPER to adapt the machine model
519 functions to exploit hardware on which
520 \begin{itemize}
521 \item Processors may be able to communicate very efficiently with each other
522 using shared memory. \vspace{-3mm}
523 \item An alternative communication mechanism based on a relatively
524 simple inter-process communication API may be required.\vspace{-3mm}
525 \item Shared memory may not necessarily obey sequential consistency,
526 however some mechanism will exist for enforcing memory consistency.
527 \vspace{-3mm}
528 \item Memory consistency that is enforced at the hardware level
529 may be expensive. Unnecessary triggering of consistency protocols
530 should be avoided. \vspace{-3mm}
531 \item Memory access patterns may need to either repetitive or highly
532 pipelined for optimum hardware performance. \vspace{-3mm}
533 \end{itemize}
534
535 This generic model captures the essential hardware ingredients
536 of almost all successful scientific computer systems designed in the
537 last 50 years.
538
539 \section{Using the WRAPPER}
540
541 In order to support maximum portability the WRAPPER is implemented primarily
542 in sequential Fortran 77. At a practical level the key steps provided by the
543 WRAPPER are
544 \begin{enumerate}
545 \item specifying how a domain will be decomposed
546 \item starting a code in either sequential or parallel modes of operations
547 \item controlling communication between tiles and between concurrently
548 computing CPU's.
549 \end{enumerate}
550 This section describes the details of each of these operations.
551 Section \ref{sect:specifying_a_decomposition} explains how the way in which
552 a domain is decomposed (or composed) is expressed. Section
553 \ref{sect:starting_a_code} describes practical details of running codes
554 in various different parallel modes on contemporary computer systems.
555 Section \ref{sect:controlling_communication} explains the internal information
556 that the WRAPPER uses to control how information is communicated between
557 tiles.
558
559 \subsection{Specifying a domain decomposition}
560 \label{sect:specifying_a_decomposition}
561
562 At its heart much of the WRAPPER works only in terms of a collection of tiles
563 which are interconnected to each other. This is also true of application
564 code operating within the WRAPPER. Application code is written as a series
565 of compute operations, each of which operates on a single tile. If
566 application code needs to perform operations involving data
567 associated with another tile, it uses a WRAPPER function to obtain
568 that data.
569 The specification of how a global domain is constructed from tiles or alternatively
570 how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
571 This file defines the following parameters \\
572
573 \fbox{
574 \begin{minipage}{4.75in}
575 Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
576 File: {\em model/inc/SIZE.h}
577 \end{minipage}
578 } \\
579
580 Together these parameters define a tiling decomposition of the style shown in
581 figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
582 the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
583 maximum size of the overlap extent. This must be set to the maximum width
584 of the computation stencil that the numerical code finite-difference operations
585 require between overlap region updates. The maximum overlap required
586 by any of the operations in the MITgcm code distributed with this release is three grid
587 points. This is set by the requirements of the $\nabla^4$ dissipation and
588 diffusion operator. Code modifications and enhancements that involve adding wide
589 finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
590 Setting {\em OLx} and {\em OLy} to a too large value will decrease code
591 performance (because redundant computations will be performed), however it will
592 not cause any other problems.
593
594 \begin{figure}
595 \begin{center}
596 \resizebox{5in}{!}{
597 \includegraphics{part4/size_h.eps}
598 }
599 \end{center}
600 \caption{ The three level domain decomposition hierarchy employed by the
601 WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
602 to a single process. Multiple processes can exist, each with multiple tiles.
603 Tiles within a process can be spread over multiple compute threads.
604 } \label{fig:labelled_tile}
605 \end{figure}
606
607 The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
608 be created within a single process. Each of these tiles will have internal
609 dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
610 allocated to different threads of a process that are then bound to
611 different physical processors ( see the multi-threaded
612 execution discussion in section \ref{sect:starting_the_code} ) then
613 computation will be performed concurrently on each tile. However, it is also
614 possible to run the same decomposition within a process running a single thread on
615 a single processor. In this case the tiles will be computed over sequentially.
616 If the decomposition is run in a single process running multiple threads
617 but attached to a single physical processor, then, in general, the computation
618 for different tiles will be interleaved by system level software.
619 This too is a valid mode of operation.
620
621 The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
622 numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
623 are used to form the loop ranges for many numerical calculations and
624 to provide dimensions for arrays holding numerical state.
625 The {\em nSx} and{\em nSy} are used in conjunction with the thread number
626 parameter {\em myThid}. Much of the numerical code operating within the
627 WRAPPER takes the form
628 \begin{verbatim}
629 DO bj=myByLo(myThid),myByHi(myThid)
630 DO bi=myBxLo(myThid),myBxHi(myThid)
631 :
632 a block of computations ranging
633 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
634 :
635 ENDDO
636 ENDDO
637
638 communication code to sum a number or maybe update
639 tile overlap regions
640
641 DO bj=myByLo(myThid),myByHi(myThid)
642 DO bi=myBxLo(myThid),myBxHi(myThid)
643 :
644 another block of computations ranging
645 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
646 :
647 ENDDO
648 ENDDO
649 \end{verbatim}
650 The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
651 myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
652 schematic. These variables specify the subset of the tiles in
653 the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
654 thread number {\em myThid} owns. The thread number variable {\em myThid}
655 ranges from 1 to the total number of threads requested at execution time.
656 For each value of {\em myThid} the loop scheme above will step sequentially
657 through the tiles owned by that thread. However, different threads will
658 have different ranges of tiles assigned to them, so that separate threads can
659 compute iterations of the {\em bi}, {\em bj} loop concurrently.
660 Within a {\em bi}, {\em bj} loop
661 computation is performed concurrently over as many processes and threads
662 as there are physical processors available to compute.
663
664 The amount of computation that can be embedded
665 a single loop over {\em bi} and {\em bj} varies for different parts of the
666 MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
667 from the two-dimensional implicit elliptic solver. This portion of the
668 code computes the l2Norm of a vector whose elements are held in
669 the array {\em cg2d\_r} writing the final result to scalar variable
670 {\em err}. In this case, because the l2norm requires a global reduction,
671 the {\em bi},{\em bj} loop only contains one statement. This computation
672 phase is then followed by a communication phase in which all threads and
673 processes must participate. However,
674 in other areas of the MITgcm code entries subsections of code are within
675 a single {\em bi},{\em bj} loop. For example the evaluation of all
676 the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
677 is within a single {\em bi},{\em bj} loop.
678
679 \begin{figure}
680 \begin{verbatim}
681 REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
682 REAL*8 err
683 :
684 :
685 other computations
686 :
687 :
688 err = 0.
689 DO bj=myByLo(myThid),myByHi(myThid)
690 DO bi=myBxLo(myThid),myBxHi(myThid)
691 DO J=1,sNy
692 DO I=1,sNx
693 err = err +
694 & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
695 ENDDO
696 ENDDO
697 ENDDO
698 ENDDO
699
700 CALL GLOBAL_SUM_R8( err , myThid )
701 err = SQRT(err)
702
703 \end{verbatim}
704 \caption{Example of numerical code for calculating
705 the l2-norm of a vector within the WRAPPER. Notice that
706 under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
707 dimensions. These right most indices are tile indexes. Different
708 threads with a single process operate on different ranges of tile
709 index, as controlled by the settings of
710 {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
711 } \label{fig:bibj_extract}
712 \end{figure}
713
714 The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
715 are used to indicate to the WRAPPER level how many processes (each with
716 {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
717 This information is needed during initialization and during I/O phases.
718 However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
719 the values of {\em nPx} and {\em nPy} are absent
720 from the core numerical and support code.
721
722 \subsubsection{Examples of {\em SIZE.h} specifications}
723
724 The following different {\em SIZE.h} parameter setting illustrate how to
725 interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
726 and {\em nPy}.
727 \begin{enumerate}
728 \item
729 \begin{verbatim}
730 PARAMETER (
731 & sNx = 90,
732 & sNy = 40,
733 & OLx = 3,
734 & OLy = 3,
735 & nSx = 1,
736 & nSy = 1,
737 & nPx = 1,
738 & nPy = 1)
739 \end{verbatim}
740 This sets up a single tile with x-dimension of ninety grid points, y-dimension of
741 forty grid points, and x and y overlaps of three grid points each.
742 \item
743 \begin{verbatim}
744 PARAMETER (
745 & sNx = 45,
746 & sNy = 20,
747 & OLx = 3,
748 & OLy = 3,
749 & nSx = 1,
750 & nSy = 1,
751 & nPx = 2,
752 & nPy = 2)
753 \end{verbatim}
754 This sets up tiles with x-dimension of forty-five grid points, y-dimension of
755 twenty grid points, and x and y overlaps of three grid points each. There are
756 four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
757 arranged so that the global domain size is again ninety grid points in x and
758 forty grid points in y. In general the formula for global grid size (held in
759 model variables {\em Nx} and {\em Ny}) is
760 \begin{verbatim}
761 Nx = sNx*nSx*nPx
762 Ny = sNy*nSy*nPy
763 \end{verbatim}
764 \item
765 \begin{verbatim}
766 PARAMETER (
767 & sNx = 90,
768 & sNy = 10,
769 & OLx = 3,
770 & OLy = 3,
771 & nSx = 1,
772 & nSy = 2,
773 & nPx = 1,
774 & nPy = 2)
775 \end{verbatim}
776 This sets up tiles with x-dimension of ninety grid points, y-dimension of
777 ten grid points, and x and y overlaps of three grid points each. There are
778 four tiles allocated to two separate processes ({\em nPy=2}) each of which
779 has two separate sub-domains {\em nSy=2},
780 The global domain size is again ninety grid points in x and
781 forty grid points in y. The two sub-domains in each process will be computed
782 sequentially if they are given to a single thread within a single process.
783 Alternatively if the code is invoked with multiple threads per process
784 the two domains in y may be computed on concurrently.
785 \item
786 \begin{verbatim}
787 PARAMETER (
788 & sNx = 32,
789 & sNy = 32,
790 & OLx = 3,
791 & OLy = 3,
792 & nSx = 6,
793 & nSy = 1,
794 & nPx = 1,
795 & nPy = 1)
796 \end{verbatim}
797 This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
798 thirty-two grid points, and x and y overlaps of three grid points each.
799 There are six tiles allocated to six separate logical processors ({\em nSx=6}).
800 This set of values can be used for a cube sphere calculation.
801 Each tile of size $32 \times 32$ represents a face of the
802 cube. Initializing the tile connectivity correctly ( see section
803 \ref{sect:cube_sphere_communication}. allows the rotations associated with
804 moving between the six cube faces to be embedded within the
805 tile-tile communication code.
806 \end{enumerate}
807
808
809 \subsection{Starting the code}
810 \label{sect:starting_the_code}
811 When code is started under the WRAPPER, execution begins in a main routine {\em
812 eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
813 to the application through a routine called {\em THE\_MODEL\_MAIN()}
814 once the WRAPPER has initialized correctly and has created the necessary variables
815 to support subsequent calls to communication routines
816 by the application code. The startup calling sequence followed by the
817 WRAPPER is shown in figure \ref{fig:wrapper_startup}.
818
819 \begin{figure}
820 \begin{verbatim}
821
822 MAIN
823 |
824 |--EEBOOT :: WRAPPER initialization
825 | |
826 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
827 | | allow basic I/O.
828 | |-- EEINTRO_MSG :: Write startup greeting.
829 | |
830 | |-- EESET_PARMS :: Set WRAPPER parameters
831 | |
832 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
833 | |
834 | |-- INI_PROCS :: Associate processes with grid regions.
835 | |
836 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
837 | |
838 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
839 | :: communication data structures
840 |
841 |
842 |--CHECK_THREADS :: Validate multiple thread start up.
843 |
844 |--THE_MODEL_MAIN :: Numerical code top-level driver routine
845
846
847 \end{verbatim}
848 \caption{Main stages of the WRAPPER startup procedure.
849 This process proceeds transfer of control to application code, which
850 occurs through the procedure {\em THE\_MODEL\_MAIN()}.
851 } \label{fig:wrapper_startup}
852 \end{figure}
853
854 \subsubsection{Multi-threaded execution}
855 \label{sect:multi-threaded-execution}
856 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
857 WRAPPER may cause several coarse grain threads to be initialized. The routine
858 {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
859 stack argument which is the thread number, stored in the
860 variable {\em myThid}. In addition to specifying a decomposition with
861 multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
862 configuring and starting a code to run using multiple threads requires the following
863 steps.\\
864
865 \paragraph{Compilation}
866 First the code must be compiled with appropriate multi-threading directives
867 active in the file {\em main.F} and with appropriate compiler flags
868 to request multi-threading support. The header files
869 {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
870 contain directives compatible with compilers for Sun, Compaq, SGI,
871 Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
872 activated by using compile time
873 directives {\em -DTARGET\_SUN},
874 {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
875 or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
876 for invoking multi-threaded compilation vary from system to system
877 and from compiler to compiler. The options will be described
878 in the individual compiler documentation. For the Fortran compiler
879 from Sun the following options are needed to correctly compile
880 multi-threaded code
881 \begin{verbatim}
882 -stackvar -explicitpar -vpara -noautopar
883 \end{verbatim}
884 These options are specific to the Sun compiler. Other compilers
885 will use different syntax that will be described in their
886 documentation. The effect of these options is as follows
887 \begin{enumerate}
888 \item {\bf -stackvar} Causes all local variables to be allocated in stack
889 storage. This is necessary for local variables to ensure that they are private
890 to their thread. Note, when using this option it may be necessary to override
891 the default limit on stack-size that the operating system assigns to a process.
892 This can normally be done by changing the settings of the command shells
893 {\em stack-size} limit variable. However, on some systems changing this limit
894 will require privileged administrator access to modify system parameters.
895
896 \item {\bf -explicitpar} Requests that multiple threads be spawned
897 in response to explicit directives in the application code. These
898 directives are inserted with syntax appropriate to the particular target
899 platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
900
901 \item {\bf -vpara} This causes the compiler to describe the multi-threaded
902 configuration it is creating. This is not required
903 but it can be useful when trouble shooting.
904
905 \item {\bf -noautopar} This inhibits any automatic multi-threaded
906 parallelization the compiler may otherwise generate.
907
908 \end{enumerate}
909
910
911 An example of valid settings for the {\em eedata} file for a
912 domain with two subdomains in y and running with two threads is shown
913 below
914 \begin{verbatim}
915 nTx=1,nTy=2
916 \end{verbatim}
917 This set of values will cause computations to stay within a single
918 thread when moving across the {\em nSx} sub-domains. In the y-direction,
919 however, sub-domains will be split equally between two threads.
920
921 \paragraph{Multi-threading files and parameters} The following
922 files and variables are used in setting up multi-threaded execution.\\
923
924 \fbox{
925 \begin{minipage}{4.75in}
926 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
927 File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
928 File: {\em model/src/THE\_MODEL\_MAIN.F}\\
929 File: {\em eesupp/src/MAIN.F}\\
930 File: {\em tools/genmake}\\
931 File: {\em eedata}\\
932 CPP: {\em TARGET\_SUN}\\
933 CPP: {\em TARGET\_DEC}\\
934 CPP: {\em TARGET\_HP }\\
935 CPP: {\em TARGET\_SGI}\\
936 CPP: {\em TARGET\_CRAY\_VECTOR}\\
937 Parameter: {\em nTx}\\
938 Parameter: {\em nTy}
939 \end{minipage}
940 } \\
941
942 \subsubsection{Multi-process execution}
943 \label{sect:multi-process-execution}
944
945 Despite its appealing programming model, multi-threaded execution remains
946 less common then multi-process execution. One major reason for this
947 is that many system libraries are still not ``thread-safe''. This means that for
948 example on some systems it is not safe to call system routines to
949 do I/O when running in multi-threaded mode, except for in a limited set of
950 circumstances. Another reason is that support for multi-threaded programming
951 models varies between systems.
952
953 Multi-process execution is more ubiquitous.
954 In order to run code in a multi-process configuration a decomposition
955 specification ( see section \ref{sect:specifying_a_decomposition})
956 is given ( in which the at least one of the
957 parameters {\em nPx} or {\em nPy} will be greater than one)
958 and then, as for multi-threaded operation,
959 appropriate compile time and run time steps must be taken.
960
961 \paragraph{Compilation} Multi-process execution under the WRAPPER
962 assumes that the portable, MPI libraries are available
963 for controlling the start-up of multiple processes. The MPI libraries
964 are not required, although they are usually used, for performance
965 critical communication. However, in order to simplify the task
966 of controlling and coordinating the start up of a large number
967 (hundreds and possibly even thousands) of copies of the same
968 program, MPI is used. The calls to the MPI multi-process startup
969 routines must be activated at compile time. This is done
970 by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}
971 flags in the {\em CPP\_EEOPTIONS.h} file.\\
972
973 \fbox{
974 \begin{minipage}{4.75in}
975 File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\
976 CPP: {\em ALLOW\_USE\_MPI}\\
977 CPP: {\em ALWAYS\_USE\_MPI}\\
978 Parameter: {\em nPx}\\
979 Parameter: {\em nPy}
980 \end{minipage}
981 } \\
982
983 Additionally, compile time options are required to link in the
984 MPI libraries and header files. Examples of these options
985 can be found in the {\em genmake} script that creates makefiles
986 for compilation. When this script is executed with the {bf -mpi}
987 flag it will generate a makefile that includes
988 paths for search for MPI head files and for linking in
989 MPI libraries. For example the {\bf -mpi} flag on a
990 Silicon Graphics IRIX system causes a
991 Makefile with the compilation command
992 Graphics IRIX system \begin{verbatim}
993 mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI
994 \end{verbatim}
995 to be generated.
996 This is the correct set of options for using the MPICH open-source
997 version of MPI, when it has been installed under the subdirectory
998 /usr/local/mpi.
999 However, on many systems there may be several
1000 versions of MPI installed. For example many systems have both
1001 the open source MPICH set of libraries and a vendor specific native form
1002 of the MPI libraries. The correct setup to use will depend on the
1003 local configuration of your system.\\
1004
1005 \fbox{
1006 \begin{minipage}{4.75in}
1007 File: {\em tools/genmake}
1008 \end{minipage}
1009 } \\
1010 \paragraph{\bf Execution} The mechanics of starting a program in
1011 multi-process mode under MPI is not standardized. Documentation
1012 associated with the distribution of MPI installed on a system will
1013 describe how to start a program using that distribution.
1014 For the free, open-source MPICH system the MITgcm program is started
1015 using a command such as
1016 \begin{verbatim}
1017 mpirun -np 64 -machinefile mf ./mitgcmuv
1018 \end{verbatim}
1019 In this example the text {\em -np 64} specifies the number of processes
1020 that will be created. The numeric value {\em 64} must be equal to the
1021 product of the processor grid settings of {\em nPx} and {\em nPy}
1022 in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
1023 called ``mf'' will be read to get a list of processor names on
1024 which the sixty-four processes will execute. The syntax of this file
1025 is specified by the MPI distribution
1026 \\
1027
1028 \fbox{
1029 \begin{minipage}{4.75in}
1030 File: {\em SIZE.h}\\
1031 Parameter: {\em nPx}\\
1032 Parameter: {\em nPy}
1033 \end{minipage}
1034 } \\
1035
1036
1037 \paragraph{Environment variables}
1038 On most systems multi-threaded execution also requires the setting
1039 of a special environment variable. On many machines this variable
1040 is called PARALLEL and its values should be set to the number
1041 of parallel threads required. Generally the help pages associated
1042 with the multi-threaded compiler on a machine will explain
1043 how to set the required environment variables for that machines.
1044
1045 \paragraph{Runtime input parameters}
1046 Finally the file {\em eedata} needs to be configured to indicate
1047 the number of threads to be used in the x and y directions.
1048 The variables {\em nTx} and {\em nTy} in this file are used to
1049 specify the information required. The product of {\em nTx} and
1050 {\em nTy} must be equal to the number of threads spawned i.e.
1051 the setting of the environment variable PARALLEL.
1052 The value of {\em nTx} must subdivide the number of sub-domains
1053 in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
1054 number of sub-domains in y ({\em nSy}) exactly.
1055 The multiprocess startup of the MITgcm executable {\em mitgcmuv}
1056 is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1057 {\em INI\_PROCS()}. The first routine performs basic steps required
1058 to make sure each process is started and has a textual output
1059 stream associated with it. By default two output files are opened
1060 for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.
1061 The {\bf NNNNN} part of the name is filled in with the process
1062 number so that process number 0 will create output files
1063 {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create
1064 output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
1065 are used for reporting status and configuration information and
1066 for reporting error conditions on a process by process basis.
1067 The {\em EEBOOT\_MINIMAL()} procedure also sets the variables
1068 {\em myProcId} and {\em MPI\_COMM\_MODEL}.
1069 These variables are related
1070 to processor identification are are used later in the routine
1071 {\em INI\_PROCS()} to allocate tiles to processes.
1072
1073 Allocation of processes to tiles in controlled by the routine
1074 {\em INI\_PROCS()}. For each process this routine sets
1075 the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
1076 These variables specify (in index space) the coordinate
1077 of the southern most and western most corner of the
1078 southern most and western most tile owned by this process.
1079 The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
1080 are also set in this routine. These are used to identify
1081 processes holding tiles to the west, east, south and north
1082 of this process. These values are stored in global storage
1083 in the header file {\em EESUPPORT.h} for use by
1084 communication routines.
1085 \\
1086
1087 \fbox{
1088 \begin{minipage}{4.75in}
1089 File: {\em eesupp/src/eeboot\_minimal.F}\\
1090 File: {\em eesupp/src/ini\_procs.F} \\
1091 File: {\em eesupp/inc/EESUPPORT.h} \\
1092 Parameter: {\em myProcId} \\
1093 Parameter: {\em MPI\_COMM\_MODEL} \\
1094 Parameter: {\em myXGlobalLo} \\
1095 Parameter: {\em myYGlobalLo} \\
1096 Parameter: {\em pidW } \\
1097 Parameter: {\em pidE } \\
1098 Parameter: {\em pidS } \\
1099 Parameter: {\em pidN }
1100 \end{minipage}
1101 } \\
1102
1103
1104 \subsection{Controlling communication}
1105 The WRAPPER maintains internal information that is used for communication
1106 operations and that can be customized for different platforms. This section
1107 describes the information that is held and used.
1108
1109 \begin{enumerate}
1110 \item {\bf Tile-tile connectivity information} For each tile the WRAPPER
1111 sets a flag that sets the tile number to the north, south, east and
1112 west of that tile. This number is unique over all tiles in a
1113 configuration. The number is held in the variables {\em tileNo}
1114 ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
1115 {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
1116 that specifies the type of communication that is used between tiles.
1117 This information is held in the variables {\em tileCommModeN},
1118 {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.
1119 This latter set of variables can take one of the following values
1120 {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
1121 A value of {\em COMM\_NONE} is used to indicate that a tile has no
1122 neighbor to communicate with on a particular face. A value
1123 of {\em COMM\_MSG} is used to indicated that some form of distributed
1124 memory communication is required to communicate between
1125 these tile faces ( see section \ref{sect:distributed_memory_communication}).
1126 A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1127 forms of shared memory communication ( see section
1128 \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1129 that a CPU should communicate by writing to data structures owned by another
1130 CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1131 from data structures owned by another CPU. These flags affect the behavior
1132 of the WRAPPER exchange primitive
1133 (see figure \ref{fig:communication_primitives}). The routine
1134 {\em ini\_communication\_patterns()} is responsible for setting the
1135 communication mode values for each tile.
1136 \\
1137
1138 \fbox{
1139 \begin{minipage}{4.75in}
1140 File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1141 File: {\em eesupp/inc/EESUPPORT.h} \\
1142 Parameter: {\em tileNo} \\
1143 Parameter: {\em tileNoE} \\
1144 Parameter: {\em tileNoW} \\
1145 Parameter: {\em tileNoN} \\
1146 Parameter: {\em tileNoS} \\
1147 Parameter: {\em tileCommModeE} \\
1148 Parameter: {\em tileCommModeW} \\
1149 Parameter: {\em tileCommModeN} \\
1150 Parameter: {\em tileCommModeS} \\
1151 \end{minipage}
1152 } \\
1153
1154 \item {\bf MP directives}
1155 The WRAPPER transfers control to numerical application code through
1156 the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1157 that allows for it to be invoked by several threads. Support for this
1158 is based on using multi-processing (MP) compiler directives.
1159 Most commercially available Fortran compilers support the generation
1160 of code to spawn multiple threads through some form of compiler directives.
1161 As this is generally much more convenient than writing code to interface
1162 to operating system libraries to explicitly spawn threads, and on some systems
1163 this may be the only method available the WRAPPER is distributed with
1164 template MP directives for a number of systems.
1165
1166 These directives are inserted into the code just before and after the
1167 transfer of control to numerical algorithm code through the routine
1168 {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of
1169 the code that performs this process for a Silicon Graphics system.
1170 This code is extracted from the files {\em main.F} and
1171 {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies
1172 how many instances of the routine {\em THE\_MODEL\_MAIN} will
1173 be created. The value of {\em nThreads} is set in the routine
1174 {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the
1175 the product of the parameters {\em nTx} and {\em nTy} that
1176 are read from the file {\em eedata}. If the value of {\em nThreads}
1177 is inconsistent with the number of threads requested from the
1178 operating system (for example by using an environment
1179 variable as described in section \ref{sect:multi_threaded_execution})
1180 then usually an error will be reported by the routine
1181 {\em CHECK\_THREADS}.\\
1182
1183 \fbox{
1184 \begin{minipage}{4.75in}
1185 File: {\em eesupp/src/ini\_threading\_environment.F}\\
1186 File: {\em eesupp/src/check\_threads.F} \\
1187 File: {\em eesupp/src/main.F} \\
1188 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1189 File: {\em eedata } \\
1190 Parameter: {\em nThreads} \\
1191 Parameter: {\em nTx} \\
1192 Parameter: {\em nTy} \\
1193 \end{minipage}
1194 }
1195
1196 \item {\bf memsync flags}
1197 As discussed in section \ref{sect:memory_consistency}, when using shared memory,
1198 a low-level system function may be need to force memory consistency.
1199 The routine {\em MEMSYNC()} is used for this purpose. This routine should
1200 not need modifying and the information below is only provided for
1201 completeness. A logical parameter {\em exchNeedsMemSync} set
1202 in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether
1203 the {\em MEMSYNC()} primitive is called. In general this
1204 routine is only used for multi-threaded execution.
1205 The code that goes into the {\em MEMSYNC()}
1206 routine is specific to the compiler and
1207 processor being used for multi-threaded execution and in general
1208 must be written using a short code snippet of assembly language.
1209 For an Ultra Sparc system the following code snippet is used
1210 \begin{verbatim}
1211 asm("membar #LoadStore|#StoreStore");
1212 \end{verbatim}
1213 for an Alpha based system the equivalent code reads
1214 \begin{verbatim}
1215 asm("mb");
1216 \end{verbatim}
1217 while on an x86 system the following code is required
1218 \begin{verbatim}
1219 asm("lock; addl $0,0(%%esp)": : :"memory")
1220 \end{verbatim}
1221
1222 \item {\bf Cache line size}
1223 As discussed in section \ref{sect:cache_effects_and_false_sharing},
1224 milti-threaded codes explicitly avoid penalties associated with excessive
1225 coherence traffic on an SMP system. To do this the shared memory data structures
1226 used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1227 are padded. The variables that control the padding are set in the
1228 header file {\em EEPARAMS.h}. These variables are called
1229 {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and
1230 {\em lShare8}. The default values should not normally need changing.
1231 \item {\bf \_BARRIER}
1232 This is a CPP macro that is expanded to a call to a routine
1233 which synchronizes all the logical processors running under the
1234 WRAPPER. Using a macro here preserves flexibility to insert
1235 a specialized call in-line into application code. By default this
1236 resolves to calling the procedure {\em BARRIER()}. The default
1237 setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1238
1239 \item {\bf \_GSUM}
1240 This is a CPP macro that is expanded to a call to a routine
1241 which sums up a floating point number
1242 over all the logical processors running under the
1243 WRAPPER. Using a macro here provides extra flexibility to insert
1244 a specialized call in-line into application code. By default this
1245 resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
1246 64-bit floating point operands)
1247 or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
1248 setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1249 The \_GSUM macro is a performance critical operation, especially for
1250 large processor count, small tile size configurations.
1251 The custom communication example discussed in section \ref{sect:jam_example}
1252 shows how the macro is used to invoke a custom global sum routine
1253 for a specific set of hardware.
1254
1255 \item {\bf \_EXCH}
1256 The \_EXCH CPP macro is used to update tile overlap regions.
1257 It is qualified by a suffix indicating whether overlap updates are for
1258 two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1259 physical fields and whether fields are 32-bit floating point
1260 ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point
1261 ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined
1262 in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1263 \_EXCH operation plays a crucial role in scaling to small tile,
1264 large logical and physical processor count configurations.
1265 The example in section \ref{sect:jam_example} discusses defining an
1266 optimized and specialized form on the \_EXCH operation.
1267
1268 The \_EXCH operation is also central to supporting grids such as
1269 the cube-sphere grid. In this class of grid a rotation may be required
1270 between tiles. Aligning the coordinate requiring rotation with the
1271 tile decomposition, allows the coordinate transformation to
1272 be embedded within a custom form of the \_EXCH primitive.
1273
1274 \item {\bf Reverse Mode}
1275 The communication primitives \_EXCH and \_GSUM both employ
1276 hand-written adjoint forms (or reverse mode) forms.
1277 These reverse mode forms can be found in the
1278 source code directory {\em pkg/autodiff}.
1279 For the global sum primitive the reverse mode form
1280 calls are to {\em GLOBAL\_ADSUM\_R4} and
1281 {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1282 exchange primitives are found in routines
1283 prefixed {\em ADEXCH}. The exchange routines make calls to
1284 the same low-level communication primitives as the forward mode
1285 operations. However, the routine argument {\em simulationMode}
1286 is set to the value {\em REVERSE\_SIMULATION}. This signifies
1287 ti the low-level routines that the adjoint forms of the
1288 appropriate communication operation should be performed.
1289 \item {\bf MAX\_NO\_THREADS}
1290 The variable {\em MAX\_NO\_THREADS} is used to indicate the
1291 maximum number of OS threads that a code will use. This
1292 value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1293 For single threaded execution it can be reduced to one if required.
1294 The value; is largely private to the WRAPPER and application code
1295 will nor normally reference the value, except in the following scenario.
1296
1297 For certain physical parametrization schemes it is necessary to have
1298 a substantial number of work arrays. Where these arrays are allocated
1299 in heap storage ( for example COMMON blocks ) multi-threaded
1300 execution will require multiple instances of the COMMON block data.
1301 This can be achieved using a Fortran 90 module construct, however,
1302 if this might be unavailable then the work arrays can be extended
1303 with dimensions use the tile dimensioning scheme of {\em nSx}
1304 and {\em nSy} ( as described in section
1305 \ref{sect:specifying_a_decomposition}). However, if the configuration
1306 being specified involves many more tiles than OS threads then
1307 it can save memory resources to reduce the variable
1308 {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1309 will be used and to declare the physical parameterization
1310 work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
1311 An example of this is given in the verification experiment
1312 {\em aim.5l\_cs}. Here the default setting of
1313 {\em MAX\_NO\_THREADS} is altered to
1314 \begin{verbatim}
1315 INTEGER MAX_NO_THREADS
1316 PARAMETER ( MAX_NO_THREADS = 6 )
1317 \end{verbatim}
1318 and several work arrays for storing intermediate calculations are
1319 created with declarations of the form.
1320 \begin{verbatim}
1321 common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1322 \end{verbatim}
1323 This declaration scheme is not used widely, because most global data
1324 is used for permanent not temporary storage of state information.
1325 In the case of permanent state information this approach cannot be used
1326 because there has to be enough storage allocated for all tiles.
1327 However, the technique can sometimes be a useful scheme for reducing memory
1328 requirements in complex physical parameterizations.
1329 \end{enumerate}
1330
1331 \begin{figure}
1332 \begin{verbatim}
1333 C--
1334 C-- Parallel directives for MIPS Pro Fortran compiler
1335 C--
1336 C Parallel compiler directives for SGI with IRIX
1337 C$PAR PARALLEL DO
1338 C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1339 C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1340 C
1341 DO I=1,nThreads
1342 myThid = I
1343
1344 C-- Invoke nThreads instances of the numerical model
1345 CALL THE_MODEL_MAIN(myThid)
1346
1347 ENDDO
1348 \end{verbatim}
1349 \caption{Prior to transferring control to
1350 the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
1351 MP directives to spawn multiple threads.
1352 } \label{fig:mp_directives}
1353 \end{figure}
1354
1355
1356 \subsubsection{Specializing the Communication Code}
1357
1358 The isolation of performance critical communication primitives and the
1359 sub-division of the simulation domain into tiles is a powerful tool.
1360 Here we show how it can be used to improve application performance and
1361 how it can be used to adapt to new griding approaches.
1362
1363 \subsubsection{JAM example}
1364 \label{sect:jam_example}
1365 On some platforms a big performance boost can be obtained by
1366 binding the communication routines {\em \_EXCH} and
1367 {\em \_GSUM} to specialized native libraries ) fro example the
1368 shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag
1369 is used as an illustration of a specialized communication configuration
1370 that substitutes for standard, portable forms of {\em \_EXCH} and
1371 {\em \_GSUM}. It affects three source files {\em eeboot.F},
1372 {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined
1373 is has the following effects.
1374 \begin{itemize}
1375 \item An extra phase is included at boot time to initialize the custom
1376 communications library ( see {\em ini\_jam.F}).
1377 \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1378 with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
1379 \item a highly specialized form of the exchange operator (optimized
1380 for overlap regions of width one) is substituted into the elliptic
1381 solver routine {\em cg2d.F}.
1382 \end{itemize}
1383 Developing specialized code for other libraries follows a similar
1384 pattern.
1385
1386 \subsubsection{Cube sphere communication}
1387 \label{sect:cube_sphere_communication}
1388 Actual {\em \_EXCH} routine code is generated automatically from
1389 a series of template files, for example {\em exch\_rx.template}.
1390 This is done to allow a large number of variations on the exchange
1391 process to be maintained. One set of variations supports the
1392 cube sphere grid. Support for a cube sphere grid in MITgcm is based
1393 on having each face of the cube as a separate tile (or tiles).
1394 The exchange routines are then able to absorb much of the
1395 detailed rotation and reorientation required when moving around the
1396 cube grid. The set of {\em \_EXCH} routines that contain the
1397 word cube in their name perform these transformations.
1398 They are invoked when the run-time logical parameter
1399 {\em useCubedSphereExchange} is set true. To facilitate the
1400 transformations on a staggered C-grid, exchange operations are defined
1401 separately for both vector and scalar quantities and for
1402 grid-centered and for grid-face and corner quantities.
1403 Three sets of exchange routines are defined. Routines
1404 with names of the form {\em exch\_rx} are used to exchange
1405 cell centered scalar quantities. Routines with names of the form
1406 {\em exch\_uv\_rx} are used to exchange vector quantities located at
1407 the C-grid velocity points. The vector quantities exchanged by the
1408 {\em exch\_uv\_rx} routines can either be signed (for example velocity
1409 components) or un-signed (for example grid-cell separations).
1410 Routines with names of the form {\em exch\_z\_rx} are used to exchange
1411 quantities at the C-grid vorticity point locations.
1412
1413
1414
1415
1416 \section{MITgcm execution under WRAPPER}
1417
1418 Fitting together the WRAPPER elements, package elements and
1419 MITgcm core equation elements of the source code produces calling
1420 sequence shown in section \ref{sect:calling_sequence}
1421
1422 \subsection{Annotated call tree for MITgcm and WRAPPER}
1423 \label{sect:calling_sequence}
1424
1425 WRAPPER layer.
1426
1427 \begin{verbatim}
1428
1429 MAIN
1430 |
1431 |--EEBOOT :: WRAPPER initialization
1432 | |
1433 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1434 | | allow basic I/O.
1435 | |-- EEINTRO_MSG :: Write startup greeting.
1436 | |
1437 | |-- EESET_PARMS :: Set WRAPPER parameters
1438 | |
1439 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1440 | |
1441 | |-- INI_PROCS :: Associate processes with grid regions.
1442 | |
1443 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1444 | |
1445 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1446 | :: communication data structures
1447 |
1448 |
1449 |--CHECK_THREADS :: Validate multiple thread start up.
1450 |
1451 |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1452
1453 \end{verbatim}
1454
1455 Core equations plus packages.
1456
1457 \begin{verbatim}
1458 C
1459 C
1460 C Invocation from WRAPPER level...
1461 C :
1462 C :
1463 C |
1464 C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1465 C | :: Called from WRAPPER level numerical
1466 C | :: code invocation routine. On entry
1467 C | :: to THE_MODEL_MAIN separate thread and
1468 C | :: separate processes will have been established.
1469 C | :: Each thread and process will have a unique ID
1470 C | :: but as yet it will not be associated with a
1471 C | :: specific region in decomposed discrete space.
1472 C |
1473 C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1474 C | | :: grid, solver matrices etc..
1475 C | |
1476 C | |-INI_PARMS :: Routine to set kernel model parameters.
1477 C | | :: By default kernel parameters are read from file
1478 C | | :: "data" in directory in which code executes.
1479 C | |
1480 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1481 C | |
1482 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1483 C | | | :: Grid arrays are held and described in GRID.h.
1484 C | | |
1485 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1486 C | | |
1487 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1488 C | | | :: (calculate grid from kernel parameters).
1489 C | | |
1490 C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1491 C | | | :: initialization (calculate grid from
1492 C | | | :: kernel parameters).
1493 C | | |
1494 C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1495 C | | :: grid initializations. ( input from raw
1496 C | | :: grid files, LONC.bin, DXF.bin etc... )
1497 C | |
1498 C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1499 C | |
1500 C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1501 C | | :: land masking for solid-fluid boundaries.
1502 C | |
1503 C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1504 C | |
1505 C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1506 C | | :: sphere options are coded.
1507 C | |
1508 C | |-PACAKGES_BOOT :: Start up the optional package environment.
1509 C | | :: Runtime selection of active packages.
1510 C | |
1511 C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1512 C | | |
1513 C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1514 C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1515 C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1516 C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1517 C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1518 C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1519 C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1520 C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1521 C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1522 C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1523 C | |
1524 C | |-PACKAGES_CHECK
1525 C | | |
1526 C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1527 C | | |-OBCS_CHECK :: Open bndy Package. pkg/obcs
1528 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1529 C | |
1530 C | |-PACKAGES_INIT_FIXED
1531 C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1532 C | | |-FLT_INIT :: Floats Package. see pkg/flt
1533 C | |
1534 C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1535 C | |
1536 C | |-INI_CG2D :: 2d con. grad solver initialisation.
1537 C | |
1538 C | |-INI_CG3D :: 3d con. grad solver initialisation.
1539 C | |
1540 C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1541 C | :: Includes annotated table of kernel
1542 C | :: parameter settings.
1543 C |
1544 C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1545 C |
1546 C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1547 C ! :: Auotmatically generated by TAMC/TAF.
1548 C |
1549 C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1550 C |
1551 C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1552 C |
1553 C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1554 C | |
1555 C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1556 C | | | :: variables
1557 C | | |
1558 C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1559 C | | |
1560 C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1561 C | | | :: sphere options are coded.
1562 C | | |
1563 C | | |-INI_CG2D :: 2d con. grad solver initialisation.
1564 C | | |-INI_CG3D :: 3d con. grad solver initialisation.
1565 C | | |-INI_MIXING :: Initialise diapycnal diffusivity.
1566 C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical
1567 C | | | :: fields).
1568 C | | |
1569 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1570 C | | | |-INI_VEL :: Initialize 3D flow field.
1571 C | | | |-INI_THETA :: Set model initial temperature field.
1572 C | | | |-INI_SALT :: Set model initial salinity field.
1573 C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1574 C | | |
1575 C | | |-INI_TR1 :: Set initial tracer 1 distribution.
1576 C | | |
1577 C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1578 C | | | | :: Here applied to move restart conditions
1579 C | | | | :: (saved in mid timestep) to correct level in
1580 C | | | | :: time (only used for pre-c35).
1581 C | | | |
1582 C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1583 C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1584 C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1585 C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1586 C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1587 C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1588 C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1589 C | | | | |-FIND_RHO :: Find adjacent densities.
1590 C | | | | |-CONVECT :: Mix static instability.
1591 C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1592 C | | | |
1593 C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1594 C | | |
1595 C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1596 C | | | | :: Extra time history interactions.
1597 C | | | |
1598 C | | | |-FIND_RHO :: Find adjacent densities.
1599 C | | | |-CONVECT :: Mix static instability.
1600 C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1601 C | | |
1602 C | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving
1603 C | | | | :: package data.
1604 C | | | |
1605 C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1606 C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1607 C | | | |-KPP_OPEN_DIAGS
1608 C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1609 C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1610 C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1611 C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1612 C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1613 C | | | |-INI_FORCING :: Set model initial forcing fields.
1614 C | | | | :: Either set in-line or from file as shown.
1615 C | | | |-READ_FLD_XY_RS(zonalWindFile)
1616 C | | | |-READ_FLD_XY_RS(meridWindFile)
1617 C | | | |-READ_FLD_XY_RS(surfQFile)
1618 C | | | |-READ_FLD_XY_RS(EmPmRfile)
1619 C | | | |-READ_FLD_XY_RS(thetaClimFile)
1620 C | | | |-READ_FLD_XY_RS(saltClimFile)
1621 C | | | |-READ_FLD_XY_RS(surfQswFile)
1622 C | | |
1623 C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1624 C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1625 C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1626 C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1627 C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1628 C | |
1629 C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1630 C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1631 C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1632 C | |
1633 C | |-MONITOR :: Monitor state ( see pkg/monitor )
1634 C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1635 C====|>|
1636 C====|>| ****************************
1637 C====|>| BEGIN MAIN TIMESTEPPING LOOP
1638 C====|>| ****************************
1639 C====|>|
1640 C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1641 C/\ | | |
1642 C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1643 C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1644 C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1645 C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1646 C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1647 C/\ | | | | :: Simple interpolation between end-points
1648 C/\ | | | | :: for forcing datasets.
1649 C/\ | | | |
1650 C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1651 C/\ | | |
1652 C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1653 C/\ | | | |
1654 C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1655 C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1656 C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-Rhonil] of a slice
1657 C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1658 C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1659 C/\ | | | |
1660 C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1661 C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1662 C/\ | | | | :: forcing terms.
1663 C/\ | | | |
1664 C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1665 C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1666 C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1667 C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1668 C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1669 C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1670 C/\ | | | | advection case) (see pkg/gad).
1671 C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1672 C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1673 C/\ | | | | |
1674 C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1675 C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1676 C/\ | | | |
1677 C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1678 C/\ | | | | |
1679 C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1680 C/\ | | | | | :: ( see pkg/gad )
1681 C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1682 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1683 C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1684 C/\ | | | |
1685 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1686 C/\ | | | |
1687 C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1688 C/\ | | | | |
1689 C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1690 C/\ | | | | | :: ( see pkg/gad )
1691 C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1692 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1693 C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1694 C/\ | | | |
1695 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1696 C/\ | | | |
1697 C/\ | | | |-CALC_GTR1 :: Calculate other tracer(s) tendency terms
1698 C/\ | | | | |
1699 C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1700 C/\ | | | | | :: ( see pkg/gad )
1701 C/\ | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.
1702 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1703 C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1704 C/\ | | | |
1705 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1706 C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1707 C/\ | | | |-FREEZE :: Limit range of temperature.
1708 C/\ | | | |
1709 C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1710 C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1711 C/\ | | | |
1712 C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1713 C/\ | | | |-EXCH :: Update overlaps
1714 C/\ | | |
1715 C/\ | | |-DYNAMICS :: Momentum equations driver.
1716 C/\ | | | |
1717 C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1718 C/\ | | | | Potential anomaly.
1719 C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1720 C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1721 C/\ | | | |
1722 C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1723 C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1724 C/\ | | | | pkg/mom_fluxform ).
1725 C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1726 C/\ | | | | pkg/mom_vecinv ).
1727 C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1728 C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1729 C/\ | | | |
1730 C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1731 C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1732 C/\ | | | |
1733 C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1734 C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1735 C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1736 C/\ | | |
1737 C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1738 C/\ | | |
1739 C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1740 C/\ | | |
1741 C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1742 C/\ | | |
1743 C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1744 C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1745 C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1746 C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1747 C/\ | | |
1748 C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1749 C/\ | | | |
1750 C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1751 C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1752 C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1753 C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1754 C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1755 C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1756 C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1757 C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1758 C/\ | | | | |-CONVECT :: Mix static instability.
1759 C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1760 C/\ | | | |
1761 C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1762 C/\ | | |
1763 C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1764 C/\ | | | |-EXCH
1765 C/\ | | |
1766 C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1767 C/\ | | |
1768 C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1769 C/\ | | |
1770 C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1771 C/\ | | | |-WRITE_STATE :: Core state I/O
1772 C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1773 C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1774 C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1775 C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1776 C/\ | | |
1777 C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1778 C/\ | |
1779 C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1780 C<===|=|
1781 C<===|=| **************************
1782 C<===|=| END MAIN TIMESTEPPING LOOP
1783 C<===|=| **************************
1784 C<===|=|
1785 C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1786 C |
1787 C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1788 C |
1789 C |-TIMER_PRINTALL :: Computational timing summary
1790 C |
1791 C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1792 C :: events.
1793 C
1794 \end{verbatim}
1795
1796 \subsection{Measuring and Characterizing Performance}
1797
1798 TO BE DONE (CNH)
1799
1800 \subsection{Estimating Resource Requirements}
1801
1802 TO BE DONE (CNH)
1803
1804 \subsubsection{Atlantic 1/6 degree example}
1805 \subsubsection{Dry Run testing}
1806 \subsubsection{Adjoint Resource Requirements}
1807 \subsubsection{State Estimation Environment Resources}
1808

  ViewVC Help
Powered by ViewVC 1.1.22