/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Contents of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.26 - (show annotations) (download) (as text)
Mon Aug 30 23:09:22 2010 UTC (14 years, 10 months ago) by jmc
Branch: MAIN
CVS Tags: checkpoint01, HEAD
Changes since 1.25: +48 -43 lines
File MIME type: application/x-tex
Error occurred while calculating annotation data.
clean-up latex built:
 (remove multiple definition of label; fix missing reference; replace
  non-standard latex stuff; ...)

1 % $Header: /u/gcmpack/manual/s_software/text/sarch.tex,v 1.25 2010/08/27 13:17:59 jmc Exp $
2
3 This chapter focuses on describing the {\bf WRAPPER} environment
4 within which both the core numerics and the pluggable packages
5 operate. The description presented here is intended to be a detailed
6 exposition and contains significant background material, as well as
7 advanced details on working with the WRAPPER. The tutorial sections
8 of this manual (see sections \ref{sec:modelExamples} and
9 \ref{sec:tutorialIII}) contain more succinct, step-by-step
10 instructions on running basic numerical experiments, of varous types,
11 both sequentially and in parallel. For many projects simply starting
12 from an example code and adapting it to suit a particular situation
13 will be all that is required. The first part of this chapter
14 discusses the MITgcm architecture at an abstract level. In the second
15 part of the chapter we described practical details of the MITgcm
16 implementation and of current tools and operating system features that
17 are employed.
18
19 \section{Overall architectural goals}
20 \begin{rawhtml}
21 <!-- CMIREDIR:overall_architectural_goals: -->
22 \end{rawhtml}
23
24 Broadly, the goals of the software architecture employed in MITgcm are
25 three-fold
26
27 \begin{itemize}
28 \item We wish to be able to study a very broad range of interesting
29 and challenging rotating fluids problems.
30 \item We wish the model code to be readily targeted to a wide range of
31 platforms
32 \item On any given platform we would like to be able to achieve
33 performance comparable to an implementation developed and
34 specialized specifically for that platform.
35 \end{itemize}
36
37 These points are summarized in figure
38 \ref{fig:mitgcm_architecture_goals} which conveys the goals of the
39 MITgcm design. The goals lead to a software architecture which at the
40 high-level can be viewed as consisting of
41
42 \begin{enumerate}
43 \item A core set of numerical and support code. This is discussed in
44 detail in section \ref{chap:discretization}.
45
46 \item A scheme for supporting optional ``pluggable'' {\bf packages}
47 (containing for example mixed-layer schemes, biogeochemical schemes,
48 atmospheric physics). These packages are used both to overlay
49 alternate dynamics and to introduce specialized physical content
50 onto the core numerical code. An overview of the {\bf package}
51 scheme is given at the start of part \ref{chap:packagesI}.
52
53 \item A support framework called {\bf WRAPPER} (Wrappable Application
54 Parallel Programming Environment Resource), within which the core
55 numerics and pluggable packages operate.
56 \end{enumerate}
57
58 This chapter focuses on describing the {\bf WRAPPER} environment under
59 which both the core numerics and the pluggable packages function. The
60 description presented here is intended to be a detailed exposition and
61 contains significant background material, as well as advanced details
62 on working with the WRAPPER. The examples section of this manual
63 (part \ref{chap:getting_started}) contains more succinct, step-by-step
64 instructions on running basic numerical experiments both sequentially
65 and in parallel. For many projects simply starting from an example
66 code and adapting it to suit a particular situation will be all that
67 is required.
68
69
70 \begin{figure}
71 \begin{center}
72 \resizebox{!}{2.5in}{\includegraphics{s_software/figs/mitgcm_goals.eps}}
73 \end{center}
74 \caption{ The MITgcm architecture is designed to allow simulation of a
75 wide range of physical problems on a wide range of hardware. The
76 computational resource requirements of the applications targeted
77 range from around $10^7$ bytes ($\approx 10$ megabytes) of memory to
78 $10^{11}$ bytes ($\approx 100$ gigabytes). Arithmetic operation
79 counts for the applications of interest range from $10^{9}$ floating
80 point operations to more than $10^{17}$ floating point operations.}
81 \label{fig:mitgcm_architecture_goals}
82 \end{figure}
83
84 \section{WRAPPER}
85 \begin{rawhtml}
86 <!-- CMIREDIR:wrapper: -->
87 \end{rawhtml}
88
89 A significant element of the software architecture utilized in MITgcm
90 is a software superstructure and substructure collectively called the
91 WRAPPER (Wrappable Application Parallel Programming Environment
92 Resource). All numerical and support code in MITgcm is written to
93 ``fit'' within the WRAPPER infrastructure. Writing code to ``fit''
94 within the WRAPPER means that coding has to follow certain, relatively
95 straightforward, rules and conventions (these are discussed further in
96 section \ref{sec:specifying_a_decomposition}).
97
98 The approach taken by the WRAPPER is illustrated in figure
99 \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to
100 insulate code that fits within it from architectural differences
101 between hardware platforms and operating systems. This allows
102 numerical code to be easily retargetted.
103
104
105 \begin{figure}
106 \begin{center}
107 \resizebox{!}{4.5in}{\includegraphics{s_software/figs/fit_in_wrapper.eps}}
108 \end{center}
109 \caption{
110 Numerical code is written to fit within a software support
111 infrastructure called WRAPPER. The WRAPPER is portable and
112 can be specialized for a wide range of specific target hardware and
113 programming environments, without impacting numerical code that fits
114 within the WRAPPER. Codes that fit within the WRAPPER can generally be
115 made to run as fast on a particular platform as codes specially
116 optimized for that platform.}
117 \label{fig:fit_in_wrapper}
118 \end{figure}
119
120 \subsection{Target hardware}
121 \label{sec:target_hardware}
122
123 The WRAPPER is designed to target as broad as possible a range of
124 computer systems. The original development of the WRAPPER took place
125 on a multi-processor, CRAY Y-MP system. On that system, numerical code
126 performance and scaling under the WRAPPER was in excess of that of an
127 implementation that was tightly bound to the CRAY systems proprietary
128 multi-tasking and micro-tasking approach. Later developments have been
129 carried out on uniprocessor and multi-processor Sun systems with both
130 uniform memory access (UMA) and non-uniform memory access (NUMA)
131 designs. Significant work has also been undertaken on x86 cluster
132 systems, Alpha processor based clustered SMP systems, and on
133 cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
134 systems. The MITgcm code, operating within the WRAPPER, is also
135 routinely used on large scale MPP systems (for example, Cray T3E and
136 IBM SP systems). In all cases numerical code, operating within the
137 WRAPPER, performs and scales very competitively with equivalent
138 numerical code that has been modified to contain native optimizations
139 for a particular system \cite{hoe-hill:99}.
140
141 \subsection{Supporting hardware neutrality}
142
143 The different systems listed in section \ref{sec:target_hardware} can
144 be categorized in many different ways. For example, one common
145 distinction is between shared-memory parallel systems (SMP and PVP)
146 and distributed memory parallel systems (for example x86 clusters and
147 large MPP systems). This is one example of a difference between
148 compute platforms that can impact an application. Another common
149 distinction is between vector processing systems with highly
150 specialized CPUs and memory subsystems and commodity microprocessor
151 based systems. There are numerous other differences, especially in
152 relation to how parallel execution is supported. To capture the
153 essential differences between different platforms the WRAPPER uses a
154 {\it machine model}.
155
156 \subsection{WRAPPER machine model}
157
158 Applications using the WRAPPER are not written to target just one
159 particular machine (for example an IBM SP2) or just one particular
160 family or class of machines (for example Parallel Vector Processor
161 Systems). Instead the WRAPPER provides applications with an abstract
162 {\it machine model}. The machine model is very general, however, it
163 can easily be specialized to fit, in a computationally efficient
164 manner, any computer architecture currently available to the
165 scientific computing community.
166
167 \subsection{Machine model parallelism}
168 \label{sec:domain_decomposition}
169 \begin{rawhtml}
170 <!-- CMIREDIR:domain_decomp: -->
171 \end{rawhtml}
172
173 Codes operating under the WRAPPER target an abstract machine that is
174 assumed to consist of one or more logical processors that can compute
175 concurrently. Computational work is divided among the logical
176 processors by allocating ``ownership'' to each processor of a certain
177 set (or sets) of calculations. Each set of calculations owned by a
178 particular processor is associated with a specific region of the
179 physical space that is being simulated, only one processor will be
180 associated with each such region (domain decomposition).
181
182 In a strict sense the logical processors over which work is divided do
183 not need to correspond to physical processors. It is perfectly
184 possible to execute a configuration decomposed for multiple logical
185 processors on a single physical processor. This helps ensure that
186 numerical code that is written to fit within the WRAPPER will
187 parallelize with no additional effort. It is also useful for
188 debugging purposes. Generally, however, the computational domain will
189 be subdivided over multiple logical processors in order to then bind
190 those logical processors to physical processor resources that can
191 compute in parallel.
192
193 \subsubsection{Tiles}
194
195 Computationally, the data structures (\textit{eg.} arrays, scalar
196 variables, etc.) that hold the simulated state are associated with
197 each region of physical space and are allocated to a particular
198 logical processor. We refer to these data structures as being {\bf
199 owned} by the processor to which their associated region of physical
200 space has been allocated. Individual regions that are allocated to
201 processors are called {\bf tiles}. A processor can own more than one
202 tile. Figure \ref{fig:domaindecomp} shows a physical domain being
203 mapped to a set of logical processors, with each processors owning a
204 single region of the domain (a single tile). Except for periods of
205 communication and coordination, each processor computes autonomously,
206 working only with data from the tile (or tiles) that the processor
207 owns. When multiple tiles are alloted to a single processor, each
208 tile is computed on independently of the other tiles, in a sequential
209 fashion.
210
211 \begin{figure}
212 \begin{center}
213 \resizebox{5in}{!}{
214 \includegraphics{s_software/figs/domain_decomp.eps}
215 }
216 \end{center}
217 \caption{ The WRAPPER provides support for one and two dimensional
218 decompositions of grid-point domains. The figure shows a
219 hypothetical domain of total size $N_{x}N_{y}N_{z}$. This
220 hypothetical domain is decomposed in two-dimensions along the
221 $N_{x}$ and $N_{y}$ directions. The resulting {\bf tiles} are {\bf
222 owned} by different processors. The {\bf owning} processors
223 perform the arithmetic operations associated with a {\bf tile}.
224 Although not illustrated here, a single processor can {\bf own}
225 several {\bf tiles}. Whenever a processor wishes to transfer data
226 between tiles or communicate with other processors it calls a
227 WRAPPER supplied function. } \label{fig:domaindecomp}
228 \end{figure}
229
230
231 \subsubsection{Tile layout}
232
233 Tiles consist of an interior region and an overlap region. The
234 overlap region of a tile corresponds to the interior region of an
235 adjacent tile. In figure \ref{fig:tiledworld} each tile would own the
236 region within the black square and hold duplicate information for
237 overlap regions extending into the tiles to the north, south, east and
238 west. During computational phases a processor will reference data in
239 an overlap region whenever it requires values that lie outside the
240 domain it owns. Periodically processors will make calls to WRAPPER
241 functions to communicate data between tiles, in order to keep the
242 overlap regions up to date (see section
243 \ref{sec:communication_primitives}). The WRAPPER functions can use a
244 variety of different mechanisms to communicate data between tiles.
245
246 \begin{figure}
247 \begin{center}
248 \resizebox{5in}{!}{
249 \includegraphics{s_software/figs/tiled-world.eps}
250 }
251 \end{center}
252 \caption{ A global grid subdivided into tiles.
253 Tiles contain a interior region and an overlap region.
254 Overlap regions are periodically updated from neighboring tiles.
255 } \label{fig:tiledworld}
256 \end{figure}
257
258 \subsection{Communication mechanisms}
259
260 Logical processors are assumed to be able to exchange information
261 between tiles and between each other using at least one of two
262 possible mechanisms.
263
264 \begin{itemize}
265 \item {\bf Shared memory communication}. Under this mode of
266 communication data transfers are assumed to be possible using direct
267 addressing of regions of memory. In this case a CPU is able to read
268 (and write) directly to regions of memory ``owned'' by another CPU
269 using simple programming language level assignment operations of the
270 the sort shown in figure \ref{fig:simple_assign}. In this way one
271 CPU (CPU1 in the figure) can communicate information to another CPU
272 (CPU2 in the figure) by assigning a particular value to a particular
273 memory location.
274
275 \item {\bf Distributed memory communication}. Under this mode of
276 communication there is no mechanism, at the application code level,
277 for directly addressing regions of memory owned and visible to
278 another CPU. Instead a communication library must be used as
279 illustrated in figure \ref{fig:comm_msg}. In this case CPUs must
280 call a function in the API of the communication library to
281 communicate data from a tile that it owns to a tile that another CPU
282 owns. By default the WRAPPER binds to the MPI communication library
283 \cite{MPI-std-20} for this style of communication.
284 \end{itemize}
285
286 The WRAPPER assumes that communication will use one of these two styles
287 of communication. The underlying hardware and operating system support
288 for the style used is not specified and can vary from system to system.
289
290 \begin{figure}
291 \begin{verbatim}
292
293 CPU1 | CPU2
294 ==== | ====
295 |
296 a(3) = 8 | WHILE ( a(3) .NE. 8 )
297 | WAIT
298 | END WHILE
299 |
300 \end{verbatim}
301 \caption{In the WRAPPER shared memory communication model, simple writes to an
302 array can be made to be visible to other CPUs at the application code level.
303 So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
304 element $3$ of array $a$, then other CPUs (for example CPU2 in the figure above)
305 will be able to see the value $8$ when they read from $a(3)$.
306 This provides a very low latency and high bandwidth communication
307 mechanism.
308 } \label{fig:simple_assign}
309 \end{figure}
310
311 \begin{figure}
312 \begin{verbatim}
313
314 CPU1 | CPU2
315 ==== | ====
316 |
317 a(3) = 8 | WHILE ( a(3) .NE. 8 )
318 CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
319 | END WHILE
320 |
321 \end{verbatim}
322 \caption{ In the WRAPPER distributed memory communication model
323 data can not be made directly visible to other CPUs.
324 If one CPU writes the value $8$ to element $3$ of array $a$, then
325 at least one of CPU1 and/or CPU2 in the figure above will need
326 to call a bespoke communication library in order for the updated
327 value to be communicated between CPUs.
328 } \label{fig:comm_msg}
329 \end{figure}
330
331 \subsection{Shared memory communication}
332 \label{sec:shared_memory_communication}
333
334 Under shared communication independent CPUs are operating on the
335 exact same global address space at the application level. This means
336 that CPU 1 can directly write into global data structures that CPU 2
337 ``owns'' using a simple assignment at the application level. This is
338 the model of memory access is supported at the basic system design
339 level in ``shared-memory'' systems such as PVP systems, SMP systems,
340 and on distributed shared memory systems (\textit{eg.} SGI Origin, SGI
341 Altix, and some AMD Opteron systems). On such systems the WRAPPER
342 will generally use simple read and write statements to access directly
343 application data structures when communicating between CPUs.
344
345 In a system where assignments statements, like the one in figure
346 \ref{fig:simple_assign} map directly to hardware instructions that
347 transport data between CPU and memory banks, this can be a very
348 efficient mechanism for communication. In this case two CPUs, CPU1
349 and CPU2, can communicate simply be reading and writing to an agreed
350 location and following a few basic rules. The latency of this sort of
351 communication is generally not that much higher than the hardware
352 latency of other memory accesses on the system. The bandwidth
353 available between CPUs communicating in this way can be close to the
354 bandwidth of the systems main-memory interconnect. This can make this
355 method of communication very efficient provided it is used
356 appropriately.
357
358 \subsubsection{Memory consistency}
359 \label{sec:memory_consistency}
360
361 When using shared memory communication between multiple processors the
362 WRAPPER level shields user applications from certain counter-intuitive
363 system behaviors. In particular, one issue the WRAPPER layer must
364 deal with is a systems memory model. In general the order of reads
365 and writes expressed by the textual order of an application code may
366 not be the ordering of instructions executed by the processor
367 performing the application. The processor performing the application
368 instructions will always operate so that, for the application
369 instructions the processor is executing, any reordering is not
370 apparent. However, in general machines are often designed so that
371 reordering of instructions is not hidden from other second processors.
372 This means that, in general, even on a shared memory system two
373 processors can observe inconsistent memory values.
374
375 The issue of memory consistency between multiple processors is
376 discussed at length in many computer science papers. From a practical
377 point of view, in order to deal with this issue, shared memory
378 machines all provide some mechanism to enforce memory consistency when
379 it is needed. The exact mechanism employed will vary between systems.
380 For communication using shared memory, the WRAPPER provides a place to
381 invoke the appropriate mechanism to ensure memory consistency for a
382 particular platform.
383
384 \subsubsection{Cache effects and false sharing}
385 \label{sec:cache_effects_and_false_sharing}
386
387 Shared-memory machines often have local to processor memory caches
388 which contain mirrored copies of main memory. Automatic cache-coherence
389 protocols are used to maintain consistency between caches on different
390 processors. These cache-coherence protocols typically enforce consistency
391 between regions of memory with large granularity (typically 128 or 256 byte
392 chunks). The coherency protocols employed can be expensive relative to other
393 memory accesses and so care is taken in the WRAPPER (by padding synchronization
394 structures appropriately) to avoid unnecessary coherence traffic.
395
396 \subsubsection{Operating system support for shared memory.}
397
398 Applications running under multiple threads within a single process
399 can use shared memory communication. In this case {\it all} the
400 memory locations in an application are potentially visible to all the
401 compute threads. Multiple threads operating within a single process is
402 the standard mechanism for supporting shared memory that the WRAPPER
403 utilizes. Configuring and launching code to run in multi-threaded mode
404 on specific platforms is discussed in section
405 \ref{sec:multi_threaded_execution}. However, on many systems,
406 potentially very efficient mechanisms for using shared memory
407 communication between multiple processes (in contrast to multiple
408 threads within a single process) also exist. In most cases this works
409 by making a limited region of memory shared between processes. The
410 MMAP %\ref{magicgarden}
411 and IPC %\ref{magicgarden}
412 facilities in UNIX
413 systems provide this capability as do vendor specific tools like LAPI
414 %\ref{IBMLAPI}
415 and IMC. %\ref{Memorychannel}.
416 Extensions exist for the
417 WRAPPER that allow these mechanisms to be used for shared memory
418 communication. However, these mechanisms are not distributed with the
419 default WRAPPER sources, because of their proprietary nature.
420
421 \subsection{Distributed memory communication}
422 \label{sec:distributed_memory_communication}
423 Many parallel systems are not constructed in a way where it is
424 possible or practical for an application to use shared memory for
425 communication. For example cluster systems consist of individual
426 computers connected by a fast network. On such systems there is no
427 notion of shared memory at the system level. For this sort of system
428 the WRAPPER provides support for communication based on a bespoke
429 communication library (see figure \ref{fig:comm_msg}). The default
430 communication library used is MPI \cite{MPI-std-20}. However, it is
431 relatively straightforward to implement bindings to optimized platform
432 specific communication libraries. For example the work described in
433 \cite{hoe-hill:99} substituted standard MPI communication for a highly
434 optimized library.
435
436 \subsection{Communication primitives}
437 \label{sec:communication_primitives}
438
439 \begin{figure}
440 \begin{center}
441 \resizebox{5in}{!}{
442 \includegraphics{s_software/figs/comm-primm.eps}
443 }
444 \end{center}
445 \caption{Three performance critical parallel primitives are provided
446 by the WRAPPER. These primitives are always used to communicate data
447 between tiles. The figure shows four tiles. The curved arrows
448 indicate exchange primitives which transfer data between the overlap
449 regions at tile edges and interior regions for nearest-neighbor
450 tiles. The straight arrows symbolize global sum operations which
451 connect all tiles. The global sum operation provides both a key
452 arithmetic primitive and can serve as a synchronization primitive. A
453 third barrier primitive is also provided, it behaves much like the
454 global sum primitive. } \label{fig:communication_primitives}
455 \end{figure}
456
457
458 Optimized communication support is assumed to be potentially available
459 for a small number of communication operations. It is also assumed
460 that communication performance optimizations can be achieved by
461 optimizing a small number of communication primitives. Three
462 optimizable primitives are provided by the WRAPPER
463 \begin{itemize}
464 \item{\bf EXCHANGE} This operation is used to transfer data between
465 interior and overlap regions of neighboring tiles. A number of
466 different forms of this operation are supported. These different
467 forms handle
468 \begin{itemize}
469 \item Data type differences. Sixty-four bit and thirty-two bit
470 fields may be handled separately.
471 \item Bindings to different communication methods. Exchange
472 primitives select between using shared memory or distributed
473 memory communication.
474 \item Transformation operations required when transporting data
475 between different grid regions. Transferring data between faces of
476 a cube-sphere grid, for example, involves a rotation of vector
477 components.
478 \item Forward and reverse mode computations. Derivative calculations
479 require tangent linear and adjoint forms of the exchange
480 primitives.
481 \end{itemize}
482
483 \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
484 operation for the pressure inversion phase of the MITgcm algorithm.
485 For certain configurations scaling can be highly sensitive to the
486 performance of the global sum primitive. This operation is a
487 collective operation involving all tiles of the simulated domain.
488 Different forms of the global sum primitive exist for handling
489 \begin{itemize}
490 \item Data type differences. Sixty-four bit and thirty-two bit
491 fields may be handled separately.
492 \item Bindings to different communication methods. Exchange
493 primitives select between using shared memory or distributed
494 memory communication.
495 \item Forward and reverse mode computations. Derivative calculations
496 require tangent linear and adjoint forms of the exchange
497 primitives.
498 \end{itemize}
499
500 \item{\bf BARRIER} The WRAPPER provides a global synchronization
501 function called barrier. This is used to synchronize computations
502 over all tiles. The {\bf BARRIER} and {\bf GLOBAL SUM} primitives
503 have much in common and in some cases use the same underlying code.
504 \end{itemize}
505
506
507 \subsection{Memory architecture}
508
509 The WRAPPER machine model is aimed to target efficiently systems with
510 highly pipelined memory architectures and systems with deep memory
511 hierarchies that favor memory reuse. This is achieved by supporting a
512 flexible tiling strategy as shown in figure \ref{fig:tiling-strategy}.
513 Within a CPU computations are carried out sequentially on each tile
514 in turn. By reshaping tiles according to the target platform it is
515 possible to automatically tune code to improve memory performance.
516 On a vector machine a given domain might be sub-divided into a few
517 long, thin regions. On a commodity microprocessor based system, however,
518 the same region could be simulated use many more smaller
519 sub-domains.
520
521
522 \begin{figure}
523 \begin{center}
524 \resizebox{5in}{!}{
525 \includegraphics{s_software/figs/tiling_detail.eps}
526 }
527 \end{center}
528 \caption{The tiling strategy that the WRAPPER supports allows tiles
529 to be shaped to suit the underlying system memory architecture.
530 Compact tiles that lead to greater memory reuse can be used on cache
531 based systems (upper half of figure) with deep memory hierarchies, long tiles
532 with large inner loops can be used to exploit vector systems having
533 highly pipelined memory systems.
534 } \label{fig:tiling-strategy}
535 \end{figure}
536
537 \newpage
538 \subsection{Summary}
539 Following the discussion above, the machine model that the WRAPPER
540 presents to an application has the following characteristics
541
542 \begin{itemize}
543 \item The machine consists of one or more logical processors.
544 \item Each processor operates on tiles that it owns.
545 \item A processor may own more than one tile.
546 \item Processors may compute concurrently.
547 \item Exchange of information between tiles is handled by the
548 machine (WRAPPER) not by the application.
549 \end{itemize}
550 Behind the scenes this allows the WRAPPER to adapt the machine model
551 functions to exploit hardware on which
552 \begin{itemize}
553 \item Processors may be able to communicate very efficiently with each
554 other using shared memory.
555 \item An alternative communication mechanism based on a relatively
556 simple inter-process communication API may be required.
557 \item Shared memory may not necessarily obey sequential consistency,
558 however some mechanism will exist for enforcing memory consistency.
559 \item Memory consistency that is enforced at the hardware level
560 may be expensive. Unnecessary triggering of consistency protocols
561 should be avoided.
562 \item Memory access patterns may need to either repetitive or highly
563 pipelined for optimum hardware performance.
564 \end{itemize}
565
566 This generic model captures the essential hardware ingredients
567 of almost all successful scientific computer systems designed in the
568 last 50 years.
569
570 \section{Using the WRAPPER}
571 \begin{rawhtml}
572 <!-- CMIREDIR:using_the_wrapper: -->
573 \end{rawhtml}
574
575 In order to support maximum portability the WRAPPER is implemented
576 primarily in sequential Fortran 77. At a practical level the key steps
577 provided by the WRAPPER are
578 \begin{enumerate}
579 \item specifying how a domain will be decomposed
580 \item starting a code in either sequential or parallel modes of operations
581 \item controlling communication between tiles and between concurrently
582 computing CPUs.
583 \end{enumerate}
584 This section describes the details of each of these operations.
585 Section \ref{sec:specifying_a_decomposition} explains how the way in
586 which a domain is decomposed (or composed) is expressed. Section
587 \ref{sec:starting_the_code} describes practical details of running
588 codes in various different parallel modes on contemporary computer
589 systems. Section \ref{sec:controlling_communication} explains the
590 internal information that the WRAPPER uses to control how information
591 is communicated between tiles.
592
593 \subsection{Specifying a domain decomposition}
594 \label{sec:specifying_a_decomposition}
595
596 At its heart much of the WRAPPER works only in terms of a collection of tiles
597 which are interconnected to each other. This is also true of application
598 code operating within the WRAPPER. Application code is written as a series
599 of compute operations, each of which operates on a single tile. If
600 application code needs to perform operations involving data
601 associated with another tile, it uses a WRAPPER function to obtain
602 that data.
603 The specification of how a global domain is constructed from tiles or alternatively
604 how a global domain is decomposed into tiles is made in the file {\em SIZE.h}.
605 This file defines the following parameters \\
606
607 \fbox{
608 \begin{minipage}{4.75in}
609 Parameters: {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx, nPy} \\
610 File: {\em model/inc/SIZE.h}
611 \end{minipage}
612 } \\
613
614 Together these parameters define a tiling decomposition of the style shown in
615 figure \ref{fig:labelled_tile}. The parameters {\em sNx} and {\em sNy} define
616 the size of an individual tile. The parameters {\em OLx} and {\em OLy} define the
617 maximum size of the overlap extent. This must be set to the maximum width
618 of the computation stencil that the numerical code finite-difference operations
619 require between overlap region updates. The maximum overlap required
620 by any of the operations in the MITgcm code distributed with this release is three grid
621 points. This is set by the requirements of the $\nabla^4$ dissipation and
622 diffusion operator. Code modifications and enhancements that involve adding wide
623 finite-difference stencils may require increasing {\em OLx} and {\em OLy}.
624 Setting {\em OLx} and {\em OLy} to a too large value will decrease code
625 performance (because redundant computations will be performed), however it will
626 not cause any other problems.
627
628 \begin{figure}
629 \begin{center}
630 \resizebox{5in}{!}{
631 \includegraphics{s_software/figs/size_h.eps}
632 }
633 \end{center}
634 \caption{ The three level domain decomposition hierarchy employed by the
635 WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated
636 to a single process. Multiple processes can exist, each with multiple tiles.
637 Tiles within a process can be spread over multiple compute threads.
638 } \label{fig:labelled_tile}
639 \end{figure}
640
641 The parameters {\em nSx} and {\em nSy} specify the number of tiles that will
642 be created within a single process. Each of these tiles will have internal
643 dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
644 allocated to different threads of a process that are then bound to
645 different physical processors ( see the multi-threaded
646 execution discussion in section \ref{sec:starting_the_code} ) then
647 computation will be performed concurrently on each tile. However, it is also
648 possible to run the same decomposition within a process running a single thread on
649 a single processor. In this case the tiles will be computed over sequentially.
650 If the decomposition is run in a single process running multiple threads
651 but attached to a single physical processor, then, in general, the computation
652 for different tiles will be interleaved by system level software.
653 This too is a valid mode of operation.
654
655 The parameters {\em sNx, sNy, OLx, OLy, nSx} and{\em nSy} are used extensively by
656 numerical code. The settings of {\em sNx, sNy, OLx} and {\em OLy}
657 are used to form the loop ranges for many numerical calculations and
658 to provide dimensions for arrays holding numerical state.
659 The {\em nSx} and{\em nSy} are used in conjunction with the thread number
660 parameter {\em myThid}. Much of the numerical code operating within the
661 WRAPPER takes the form
662 \begin{verbatim}
663 DO bj=myByLo(myThid),myByHi(myThid)
664 DO bi=myBxLo(myThid),myBxHi(myThid)
665 :
666 a block of computations ranging
667 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
668 :
669 ENDDO
670 ENDDO
671
672 communication code to sum a number or maybe update
673 tile overlap regions
674
675 DO bj=myByLo(myThid),myByHi(myThid)
676 DO bi=myBxLo(myThid),myBxHi(myThid)
677 :
678 another block of computations ranging
679 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
680 :
681 ENDDO
682 ENDDO
683 \end{verbatim}
684 The variables {\em myBxLo(myThid), myBxHi(myThid), myByLo(myThid)} and {\em
685 myByHi(myThid)} set the bounds of the loops in {\em bi} and {\em bj} in this
686 schematic. These variables specify the subset of the tiles in
687 the range 1,{\em nSx} and 1,{\em nSy} that the logical processor bound to
688 thread number {\em myThid} owns. The thread number variable {\em myThid}
689 ranges from 1 to the total number of threads requested at execution time.
690 For each value of {\em myThid} the loop scheme above will step sequentially
691 through the tiles owned by that thread. However, different threads will
692 have different ranges of tiles assigned to them, so that separate threads can
693 compute iterations of the {\em bi}, {\em bj} loop concurrently.
694 Within a {\em bi}, {\em bj} loop
695 computation is performed concurrently over as many processes and threads
696 as there are physical processors available to compute.
697
698 An exception to the the use of {\em bi} and {\em bj} in loops arises in the
699 exchange routines used when the exch2 package is used with the cubed
700 sphere. In this case {\em bj} is generally set to 1 and the loop runs from
701 1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number,
702 which is then used to reference exchange parameters.
703
704 The amount of computation that can be embedded
705 a single loop over {\em bi} and {\em bj} varies for different parts of the
706 MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
707 from the two-dimensional implicit elliptic solver. This portion of the
708 code computes the l2Norm of a vector whose elements are held in
709 the array {\em cg2d\_r} writing the final result to scalar variable
710 {\em err}. In this case, because the l2norm requires a global reduction,
711 the {\em bi},{\em bj} loop only contains one statement. This computation
712 phase is then followed by a communication phase in which all threads and
713 processes must participate. However,
714 in other areas of the MITgcm code entries subsections of code are within
715 a single {\em bi},{\em bj} loop. For example the evaluation of all
716 the momentum equation prognostic terms ( see {\em S/R DYNAMICS()})
717 is within a single {\em bi},{\em bj} loop.
718
719 \begin{figure}
720 \begin{verbatim}
721 REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
722 REAL*8 err
723 :
724 :
725 other computations
726 :
727 :
728 err = 0.
729 DO bj=myByLo(myThid),myByHi(myThid)
730 DO bi=myBxLo(myThid),myBxHi(myThid)
731 DO J=1,sNy
732 DO I=1,sNx
733 err = err +
734 & cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
735 ENDDO
736 ENDDO
737 ENDDO
738 ENDDO
739
740 CALL GLOBAL_SUM_R8( err , myThid )
741 err = SQRT(err)
742
743 \end{verbatim}
744 \caption{Example of numerical code for calculating
745 the l2-norm of a vector within the WRAPPER. Notice that
746 under the WRAPPER arrays such as {\em cg2d\_r} have two extra trailing
747 dimensions. These right most indices are tile indexes. Different
748 threads with a single process operate on different ranges of tile
749 index, as controlled by the settings of
750 {\em myByLo, myByHi, myBxLo} and {\em myBxHi}.
751 } \label{fig:bibj_extract}
752 \end{figure}
753
754 The final decomposition parameters are {\em nPx} and {\em nPy}. These parameters
755 are used to indicate to the WRAPPER level how many processes (each with
756 {\em nSx}$\times${\em nSy} tiles) will be used for this simulation.
757 This information is needed during initialization and during I/O phases.
758 However, unlike the variables {\em sNx, sNy, OLx, OLy, nSx} and {\em nSy}
759 the values of {\em nPx} and {\em nPy} are absent
760 from the core numerical and support code.
761
762 \subsubsection{Examples of {\em SIZE.h} specifications}
763
764 The following different {\em SIZE.h} parameter setting illustrate how to
765 interpret the values of {\em sNx, sNy, OLx, OLy, nSx, nSy, nPx}
766 and {\em nPy}.
767 \begin{enumerate}
768 \item
769 \begin{verbatim}
770 PARAMETER (
771 & sNx = 90,
772 & sNy = 40,
773 & OLx = 3,
774 & OLy = 3,
775 & nSx = 1,
776 & nSy = 1,
777 & nPx = 1,
778 & nPy = 1)
779 \end{verbatim}
780 This sets up a single tile with x-dimension of ninety grid points, y-dimension of
781 forty grid points, and x and y overlaps of three grid points each.
782 \item
783 \begin{verbatim}
784 PARAMETER (
785 & sNx = 45,
786 & sNy = 20,
787 & OLx = 3,
788 & OLy = 3,
789 & nSx = 1,
790 & nSy = 1,
791 & nPx = 2,
792 & nPy = 2)
793 \end{verbatim}
794 This sets up tiles with x-dimension of forty-five grid points, y-dimension of
795 twenty grid points, and x and y overlaps of three grid points each. There are
796 four tiles allocated to four separate processes ({\em nPx=2,nPy=2}) and
797 arranged so that the global domain size is again ninety grid points in x and
798 forty grid points in y. In general the formula for global grid size (held in
799 model variables {\em Nx} and {\em Ny}) is
800 \begin{verbatim}
801 Nx = sNx*nSx*nPx
802 Ny = sNy*nSy*nPy
803 \end{verbatim}
804 \item
805 \begin{verbatim}
806 PARAMETER (
807 & sNx = 90,
808 & sNy = 10,
809 & OLx = 3,
810 & OLy = 3,
811 & nSx = 1,
812 & nSy = 2,
813 & nPx = 1,
814 & nPy = 2)
815 \end{verbatim}
816 This sets up tiles with x-dimension of ninety grid points, y-dimension of
817 ten grid points, and x and y overlaps of three grid points each. There are
818 four tiles allocated to two separate processes ({\em nPy=2}) each of which
819 has two separate sub-domains {\em nSy=2},
820 The global domain size is again ninety grid points in x and
821 forty grid points in y. The two sub-domains in each process will be computed
822 sequentially if they are given to a single thread within a single process.
823 Alternatively if the code is invoked with multiple threads per process
824 the two domains in y may be computed concurrently.
825 \item
826 \begin{verbatim}
827 PARAMETER (
828 & sNx = 32,
829 & sNy = 32,
830 & OLx = 3,
831 & OLy = 3,
832 & nSx = 6,
833 & nSy = 1,
834 & nPx = 1,
835 & nPy = 1)
836 \end{verbatim}
837 This sets up tiles with x-dimension of thirty-two grid points, y-dimension of
838 thirty-two grid points, and x and y overlaps of three grid points each.
839 There are six tiles allocated to six separate logical processors ({\em nSx=6}).
840 This set of values can be used for a cube sphere calculation.
841 Each tile of size $32 \times 32$ represents a face of the
842 cube. Initializing the tile connectivity correctly ( see section
843 \ref{sec:cube_sphere_communication}. allows the rotations associated with
844 moving between the six cube faces to be embedded within the
845 tile-tile communication code.
846 \end{enumerate}
847
848
849 \subsection{Starting the code}
850 \label{sec:starting_the_code}
851 When code is started under the WRAPPER, execution begins in a main routine {\em
852 eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
853 to the application through a routine called {\em THE\_MODEL\_MAIN()}
854 once the WRAPPER has initialized correctly and has created the necessary variables
855 to support subsequent calls to communication routines
856 by the application code. The startup calling sequence followed by the
857 WRAPPER is shown in figure \ref{fig:wrapper_startup}.
858
859 \begin{figure}
860 {\footnotesize
861 \begin{verbatim}
862
863 MAIN
864 |
865 |--EEBOOT :: WRAPPER initialization
866 | |
867 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
868 | | allow basic I/O.
869 | |-- EEINTRO_MSG :: Write startup greeting.
870 | |
871 | |-- EESET_PARMS :: Set WRAPPER parameters
872 | |
873 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
874 | |
875 | |-- INI_PROCS :: Associate processes with grid regions.
876 | |
877 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
878 | |
879 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
880 | :: communication data structures
881 |
882 |
883 |--CHECK_THREADS :: Validate multiple thread start up.
884 |
885 |--THE_MODEL_MAIN :: Numerical code top-level driver routine
886
887
888 \end{verbatim}
889 }
890 \caption{Main stages of the WRAPPER startup procedure.
891 This process proceeds transfer of control to application code, which
892 occurs through the procedure {\em THE\_MODEL\_MAIN()}.
893 } \label{fig:wrapper_startup}
894 \end{figure}
895
896 \subsubsection{Multi-threaded execution}
897 \label{sec:multi_threaded_execution}
898 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
899 WRAPPER may cause several coarse grain threads to be initialized. The routine
900 {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
901 stack argument which is the thread number, stored in the
902 variable {\em myThid}. In addition to specifying a decomposition with
903 multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})
904 configuring and starting a code to run using multiple threads requires the following
905 steps.\\
906
907 \paragraph{Compilation}
908 First the code must be compiled with appropriate multi-threading directives
909 active in the file {\em main.F} and with appropriate compiler flags
910 to request multi-threading support. The header files
911 {\em MAIN\_PDIRECTIVES1.h} and {\em MAIN\_PDIRECTIVES2.h}
912 contain directives compatible with compilers for Sun, Compaq, SGI,
913 Hewlett-Packard SMP systems and CRAY PVP systems. These directives can be
914 activated by using compile time
915 directives {\em -DTARGET\_SUN},
916 {\em -DTARGET\_DEC}, {\em -DTARGET\_SGI}, {\em -DTARGET\_HP}
917 or {\em -DTARGET\_CRAY\_VECTOR} respectively. Compiler options
918 for invoking multi-threaded compilation vary from system to system
919 and from compiler to compiler. The options will be described
920 in the individual compiler documentation. For the Fortran compiler
921 from Sun the following options are needed to correctly compile
922 multi-threaded code
923 \begin{verbatim}
924 -stackvar -explicitpar -vpara -noautopar
925 \end{verbatim}
926 These options are specific to the Sun compiler. Other compilers
927 will use different syntax that will be described in their
928 documentation. The effect of these options is as follows
929 \begin{enumerate}
930 \item {\bf -stackvar} Causes all local variables to be allocated in stack
931 storage. This is necessary for local variables to ensure that they are private
932 to their thread. Note, when using this option it may be necessary to override
933 the default limit on stack-size that the operating system assigns to a process.
934 This can normally be done by changing the settings of the command shells
935 {\em stack-size} limit variable. However, on some systems changing this limit
936 will require privileged administrator access to modify system parameters.
937
938 \item {\bf -explicitpar} Requests that multiple threads be spawned
939 in response to explicit directives in the application code. These
940 directives are inserted with syntax appropriate to the particular target
941 platform when, for example, the {\em -DTARGET\_SUN} flag is selected.
942
943 \item {\bf -vpara} This causes the compiler to describe the multi-threaded
944 configuration it is creating. This is not required
945 but it can be useful when trouble shooting.
946
947 \item {\bf -noautopar} This inhibits any automatic multi-threaded
948 parallelization the compiler may otherwise generate.
949
950 \end{enumerate}
951
952
953 An example of valid settings for the {\em eedata} file for a
954 domain with two subdomains in y and running with two threads is shown
955 below
956 \begin{verbatim}
957 nTx=1,nTy=2
958 \end{verbatim}
959 This set of values will cause computations to stay within a single
960 thread when moving across the {\em nSx} sub-domains. In the y-direction,
961 however, sub-domains will be split equally between two threads.
962
963 \paragraph{Multi-threading files and parameters} The following
964 files and variables are used in setting up multi-threaded execution.\\
965
966 \fbox{
967 \begin{minipage}{4.75in}
968 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h}\\
969 File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
970 File: {\em model/src/THE\_MODEL\_MAIN.F}\\
971 File: {\em eesupp/src/MAIN.F}\\
972 File: {\em tools/genmake2}\\
973 File: {\em eedata}\\
974 CPP: {\em TARGET\_SUN}\\
975 CPP: {\em TARGET\_DEC}\\
976 CPP: {\em TARGET\_HP }\\
977 CPP: {\em TARGET\_SGI}\\
978 CPP: {\em TARGET\_CRAY\_VECTOR}\\
979 Parameter: {\em nTx}\\
980 Parameter: {\em nTy}
981 \end{minipage}
982 } \\
983
984 \subsubsection{Multi-process execution}
985 \label{sec:multi_process_execution}
986
987 Despite its appealing programming model, multi-threaded execution
988 remains less common than multi-process execution. One major reason for
989 this is that many system libraries are still not ``thread-safe''. This
990 means that, for example, on some systems it is not safe to call system
991 routines to perform I/O when running in multi-threaded mode (except,
992 perhaps, in a limited set of circumstances). Another reason is that
993 support for multi-threaded programming models varies between systems.
994
995 Multi-process execution is more ubiquitous. In order to run code in a
996 multi-process configuration a decomposition specification (see section
997 \ref{sec:specifying_a_decomposition}) is given (in which the at least
998 one of the parameters {\em nPx} or {\em nPy} will be greater than one)
999 and then, as for multi-threaded operation, appropriate compile time
1000 and run time steps must be taken.
1001
1002 \paragraph{Compilation} Multi-process execution under the WRAPPER
1003 assumes that the portable, MPI libraries are available for controlling
1004 the start-up of multiple processes. The MPI libraries are not
1005 required, although they are usually used, for performance critical
1006 communication. However, in order to simplify the task of controlling
1007 and coordinating the start up of a large number (hundreds and possibly
1008 even thousands) of copies of the same program, MPI is used. The calls
1009 to the MPI multi-process startup routines must be activated at compile
1010 time. Currently MPI libraries are invoked by specifying the
1011 appropriate options file with the {\tt-of} flag when running the {\em
1012 genmake2} script, which generates the Makefile for compiling and
1013 linking MITgcm. (Previously this was done by setting the {\em
1014 ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
1015 CPP\_EEOPTIONS.h} file.) More detailed information about the use of
1016 {\em genmake2} for specifying
1017 local compiler flags is located in section \ref{sec:genmake}.\\
1018
1019
1020 \fbox{
1021 \begin{minipage}{4.75in}
1022 Directory: {\em tools/build\_options}\\
1023 File: {\em tools/genmake2}
1024 \end{minipage}
1025 } \\
1026 \paragraph{\bf Execution} The mechanics of starting a program in
1027 multi-process mode under MPI is not standardized. Documentation
1028 associated with the distribution of MPI installed on a system will
1029 describe how to start a program using that distribution. For the
1030 open-source MPICH system, the MITgcm program can be started using a
1031 command such as
1032 \begin{verbatim}
1033 mpirun -np 64 -machinefile mf ./mitgcmuv
1034 \end{verbatim}
1035 In this example the text {\em -np 64} specifies the number of
1036 processes that will be created. The numeric value {\em 64} must be
1037 equal to the product of the processor grid settings of {\em nPx} and
1038 {\em nPy} in the file {\em SIZE.h}. The parameter {\em mf} specifies
1039 that a text file called ``mf'' will be read to get a list of processor
1040 names on which the sixty-four processes will execute. The syntax of
1041 this file is specified by the MPI distribution.
1042 \\
1043
1044 \fbox{
1045 \begin{minipage}{4.75in}
1046 File: {\em SIZE.h}\\
1047 Parameter: {\em nPx}\\
1048 Parameter: {\em nPy}
1049 \end{minipage}
1050 } \\
1051
1052
1053 \paragraph{Environment variables}
1054 On most systems multi-threaded execution also requires the setting of
1055 a special environment variable. On many machines this variable is
1056 called PARALLEL and its values should be set to the number of parallel
1057 threads required. Generally the help or manual pages associated with
1058 the multi-threaded compiler on a machine will explain how to set the
1059 required environment variables.
1060
1061 \paragraph{Runtime input parameters}
1062 Finally the file {\em eedata} needs to be configured to indicate the
1063 number of threads to be used in the x and y directions. The variables
1064 {\em nTx} and {\em nTy} in this file are used to specify the
1065 information required. The product of {\em nTx} and {\em nTy} must be
1066 equal to the number of threads spawned i.e. the setting of the
1067 environment variable PARALLEL. The value of {\em nTx} must subdivide
1068 the number of sub-domains in x ({\em nSx}) exactly. The value of {\em
1069 nTy} must subdivide the number of sub-domains in y ({\em nSy})
1070 exactly. The multiprocess startup of the MITgcm executable {\em
1071 mitgcmuv} is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1072 {\em INI\_PROCS()}. The first routine performs basic steps required to
1073 make sure each process is started and has a textual output stream
1074 associated with it. By default two output files are opened for each
1075 process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}. The {\bf
1076 NNNNN} part of the name is filled in with the process number so that
1077 process number 0 will create output files {\bf STDOUT.0000} and {\bf
1078 STDERR.0000}, process number 1 will create output files {\bf
1079 STDOUT.0001} and {\bf STDERR.0001}, etc. These files are used for
1080 reporting status and configuration information and for reporting error
1081 conditions on a process by process basis. The {\em EEBOOT\_MINIMAL()}
1082 procedure also sets the variables {\em myProcId} and {\em
1083 MPI\_COMM\_MODEL}. These variables are related to processor
1084 identification are are used later in the routine {\em INI\_PROCS()} to
1085 allocate tiles to processes.
1086
1087 Allocation of processes to tiles is controlled by the routine {\em
1088 INI\_PROCS()}. For each process this routine sets the variables {\em
1089 myXGlobalLo} and {\em myYGlobalLo}. These variables specify, in
1090 index space, the coordinates of the southernmost and westernmost
1091 corner of the southernmost and westernmost tile owned by this process.
1092 The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are
1093 also set in this routine. These are used to identify processes holding
1094 tiles to the west, east, south and north of a given process. These
1095 values are stored in global storage in the header file {\em
1096 EESUPPORT.h} for use by communication routines. The above does not
1097 hold when the exch2 package is used. The exch2 sets its own
1098 parameters to specify the global indices of tiles and their
1099 relationships to each other. See the documentation on the exch2
1100 package (\ref{sec:exch2}) for details.
1101 \\
1102
1103 \fbox{
1104 \begin{minipage}{4.75in}
1105 File: {\em eesupp/src/eeboot\_minimal.F}\\
1106 File: {\em eesupp/src/ini\_procs.F} \\
1107 File: {\em eesupp/inc/EESUPPORT.h} \\
1108 Parameter: {\em myProcId} \\
1109 Parameter: {\em MPI\_COMM\_MODEL} \\
1110 Parameter: {\em myXGlobalLo} \\
1111 Parameter: {\em myYGlobalLo} \\
1112 Parameter: {\em pidW } \\
1113 Parameter: {\em pidE } \\
1114 Parameter: {\em pidS } \\
1115 Parameter: {\em pidN }
1116 \end{minipage}
1117 } \\
1118
1119
1120 \subsection{Controlling communication}
1121 \label{sec:controlling_communication}
1122 The WRAPPER maintains internal information that is used for communication
1123 operations and that can be customized for different platforms. This section
1124 describes the information that is held and used.
1125
1126 \begin{enumerate}
1127 \item {\bf Tile-tile connectivity information}
1128 For each tile the WRAPPER sets a flag that sets the tile number to
1129 the north, south, east and west of that tile. This number is unique
1130 over all tiles in a configuration. Except when using the cubed
1131 sphere and the exch2 package, the number is held in the variables
1132 {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em
1133 tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also
1134 stored with each tile that specifies the type of communication that
1135 is used between tiles. This information is held in the variables
1136 {\em tileCommModeN}, {\em tileCommModeS}, {\em tileCommModeE} and
1137 {\em tileCommModeW}. This latter set of variables can take one of
1138 the following values {\em COMM\_NONE}, {\em COMM\_MSG}, {\em
1139 COMM\_PUT} and {\em COMM\_GET}. A value of {\em COMM\_NONE} is
1140 used to indicate that a tile has no neighbor to communicate with on
1141 a particular face. A value of {\em COMM\_MSG} is used to indicate
1142 that some form of distributed memory communication is required to
1143 communicate between these tile faces (see section
1144 \ref{sec:distributed_memory_communication}). A value of {\em
1145 COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
1146 memory communication (see section
1147 \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value
1148 indicates that a CPU should communicate by writing to data
1149 structures owned by another CPU. A {\em COMM\_GET} value indicates
1150 that a CPU should communicate by reading from data structures owned
1151 by another CPU. These flags affect the behavior of the WRAPPER
1152 exchange primitive (see figure \ref{fig:communication_primitives}).
1153 The routine {\em ini\_communication\_patterns()} is responsible for
1154 setting the communication mode values for each tile.
1155
1156 When using the cubed sphere configuration with the exch2 package,
1157 the relationships between tiles and their communication methods are
1158 set by the exch2 package and stored in different variables. See the
1159 exch2 package documentation (\ref{sec:exch2} for details.
1160
1161 \fbox{
1162 \begin{minipage}{4.75in}
1163 File: {\em eesupp/src/ini\_communication\_patterns.F}\\
1164 File: {\em eesupp/inc/EESUPPORT.h} \\
1165 Parameter: {\em tileNo} \\
1166 Parameter: {\em tileNoE} \\
1167 Parameter: {\em tileNoW} \\
1168 Parameter: {\em tileNoN} \\
1169 Parameter: {\em tileNoS} \\
1170 Parameter: {\em tileCommModeE} \\
1171 Parameter: {\em tileCommModeW} \\
1172 Parameter: {\em tileCommModeN} \\
1173 Parameter: {\em tileCommModeS} \\
1174 \end{minipage}
1175 } \\
1176
1177 \item {\bf MP directives}
1178 The WRAPPER transfers control to numerical application code through
1179 the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1180 that allows for it to be invoked by several threads. Support for
1181 this is based on either multi-processing (MP) compiler directives or
1182 specific calls to multi-threading libraries (\textit{eg.} POSIX
1183 threads). Most commercially available Fortran compilers support the
1184 generation of code to spawn multiple threads through some form of
1185 compiler directives. Compiler directives are generally more
1186 convenient than writing code to explicitly spawning threads. And,
1187 on some systems, compiler directives may be the only method
1188 available. The WRAPPER is distributed with template MP directives
1189 for a number of systems.
1190
1191 These directives are inserted into the code just before and after
1192 the transfer of control to numerical algorithm code through the
1193 routine {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows
1194 an example of the code that performs this process for a Silicon
1195 Graphics system. This code is extracted from the files {\em main.F}
1196 and {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads}
1197 specifies how many instances of the routine {\em THE\_MODEL\_MAIN}
1198 will be created. The value of {\em nThreads} is set in the routine
1199 {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the the
1200 product of the parameters {\em nTx} and {\em nTy} that are read from
1201 the file {\em eedata}. If the value of {\em nThreads} is
1202 inconsistent with the number of threads requested from the operating
1203 system (for example by using an environment variable as described in
1204 section \ref{sec:multi_threaded_execution}) then usually an error
1205 will be reported by the routine {\em CHECK\_THREADS}.
1206
1207 \fbox{
1208 \begin{minipage}{4.75in}
1209 File: {\em eesupp/src/ini\_threading\_environment.F}\\
1210 File: {\em eesupp/src/check\_threads.F} \\
1211 File: {\em eesupp/src/main.F} \\
1212 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1.h} \\
1213 File: {\em eedata } \\
1214 Parameter: {\em nThreads} \\
1215 Parameter: {\em nTx} \\
1216 Parameter: {\em nTy} \\
1217 \end{minipage}
1218 }
1219
1220 \item {\bf memsync flags}
1221 As discussed in section \ref{sec:memory_consistency}, a low-level
1222 system function may be need to force memory consistency on some
1223 shared memory systems. The routine {\em MEMSYNC()} is used for this
1224 purpose. This routine should not need modifying and the information
1225 below is only provided for completeness. A logical parameter {\em
1226 exchNeedsMemSync} set in the routine {\em
1227 INI\_COMMUNICATION\_PATTERNS()} controls whether the {\em
1228 MEMSYNC()} primitive is called. In general this routine is only
1229 used for multi-threaded execution. The code that goes into the {\em
1230 MEMSYNC()} routine is specific to the compiler and processor used.
1231 In some cases, it must be written using a short code snippet of
1232 assembly language. For an Ultra Sparc system the following code
1233 snippet is used
1234 \begin{verbatim}
1235 asm("membar #LoadStore|#StoreStore");
1236 \end{verbatim}
1237 for an Alpha based system the equivalent code reads
1238 \begin{verbatim}
1239 asm("mb");
1240 \end{verbatim}
1241 while on an x86 system the following code is required
1242 \begin{verbatim}
1243 asm("lock; addl $0,0(%%esp)": : :"memory")
1244 \end{verbatim}
1245
1246 \item {\bf Cache line size}
1247 As discussed in section \ref{sec:cache_effects_and_false_sharing},
1248 milti-threaded codes explicitly avoid penalties associated with
1249 excessive coherence traffic on an SMP system. To do this the shared
1250 memory data structures used by the {\em GLOBAL\_SUM}, {\em
1251 GLOBAL\_MAX} and {\em BARRIER} routines are padded. The variables
1252 that control the padding are set in the header file {\em
1253 EEPARAMS.h}. These variables are called {\em cacheLineSize}, {\em
1254 lShare1}, {\em lShare4} and {\em lShare8}. The default values
1255 should not normally need changing.
1256
1257 \item {\bf \_BARRIER}
1258 This is a CPP macro that is expanded to a call to a routine which
1259 synchronizes all the logical processors running under the WRAPPER.
1260 Using a macro here preserves flexibility to insert a specialized
1261 call in-line into application code. By default this resolves to
1262 calling the procedure {\em BARRIER()}. The default setting for the
1263 \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1264
1265 \item {\bf \_GSUM}
1266 This is a CPP macro that is expanded to a call to a routine which
1267 sums up a floating point number over all the logical processors
1268 running under the WRAPPER. Using a macro here provides extra
1269 flexibility to insert a specialized call in-line into application
1270 code. By default this resolves to calling the procedure {\em
1271 GLOBAL\_SUM\_R8()} ( for 64-bit floating point operands) or {\em
1272 GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The
1273 default setting for the \_GSUM macro is given in the file {\em
1274 CPP\_EEMACROS.h}. The \_GSUM macro is a performance critical
1275 operation, especially for large processor count, small tile size
1276 configurations. The custom communication example discussed in
1277 section \ref{sec:jam_example} shows how the macro is used to invoke
1278 a custom global sum routine for a specific set of hardware.
1279
1280 \item {\bf \_EXCH}
1281 The \_EXCH CPP macro is used to update tile overlap regions. It is
1282 qualified by a suffix indicating whether overlap updates are for
1283 two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1284 physical fields and whether fields are 32-bit floating point (
1285 \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point (
1286 \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined in
1287 the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
1288 operation plays a crucial role in scaling to small tile, large
1289 logical and physical processor count configurations. The example in
1290 section \ref{sec:jam_example} discusses defining an optimized and
1291 specialized form on the \_EXCH operation.
1292
1293 The \_EXCH operation is also central to supporting grids such as the
1294 cube-sphere grid. In this class of grid a rotation may be required
1295 between tiles. Aligning the coordinate requiring rotation with the
1296 tile decomposition, allows the coordinate transformation to be
1297 embedded within a custom form of the \_EXCH primitive. In these
1298 cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1299 package documentation \ref{sec:exch2}.
1300
1301 \item {\bf Reverse Mode}
1302 The communication primitives \_EXCH and \_GSUM both employ
1303 hand-written adjoint forms (or reverse mode) forms. These reverse
1304 mode forms can be found in the source code directory {\em
1305 pkg/autodiff}. For the global sum primitive the reverse mode form
1306 calls are to {\em GLOBAL\_ADSUM\_R4} and {\em GLOBAL\_ADSUM\_R8}.
1307 The reverse mode form of the exchange primitives are found in
1308 routines prefixed {\em ADEXCH}. The exchange routines make calls to
1309 the same low-level communication primitives as the forward mode
1310 operations. However, the routine argument {\em simulationMode} is
1311 set to the value {\em REVERSE\_SIMULATION}. This signifies to the
1312 low-level routines that the adjoint forms of the appropriate
1313 communication operation should be performed.
1314
1315 \item {\bf MAX\_NO\_THREADS}
1316 The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum
1317 number of OS threads that a code will use. This value defaults to
1318 thirty-two and is set in the file {\em EEPARAMS.h}. For single
1319 threaded execution it can be reduced to one if required. The value
1320 is largely private to the WRAPPER and application code will not
1321 normally reference the value, except in the following scenario.
1322
1323 For certain physical parametrization schemes it is necessary to have
1324 a substantial number of work arrays. Where these arrays are
1325 allocated in heap storage (for example COMMON blocks) multi-threaded
1326 execution will require multiple instances of the COMMON block data.
1327 This can be achieved using a Fortran 90 module construct. However,
1328 if this mechanism is unavailable then the work arrays can be extended
1329 with dimensions using the tile dimensioning scheme of {\em nSx} and
1330 {\em nSy} (as described in section
1331 \ref{sec:specifying_a_decomposition}). However, if the
1332 configuration being specified involves many more tiles than OS
1333 threads then it can save memory resources to reduce the variable
1334 {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
1335 that will be used and to declare the physical parameterization work
1336 arrays with a single {\em MAX\_NO\_THREADS} extra dimension. An
1337 example of this is given in the verification experiment {\em
1338 aim.5l\_cs}. Here the default setting of {\em MAX\_NO\_THREADS} is
1339 altered to
1340 \begin{verbatim}
1341 INTEGER MAX_NO_THREADS
1342 PARAMETER ( MAX_NO_THREADS = 6 )
1343 \end{verbatim}
1344 and several work arrays for storing intermediate calculations are
1345 created with declarations of the form.
1346 \begin{verbatim}
1347 common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1348 \end{verbatim}
1349 This declaration scheme is not used widely, because most global data
1350 is used for permanent not temporary storage of state information.
1351 In the case of permanent state information this approach cannot be
1352 used because there has to be enough storage allocated for all tiles.
1353 However, the technique can sometimes be a useful scheme for reducing
1354 memory requirements in complex physical parameterizations.
1355 \end{enumerate}
1356
1357 \begin{figure}
1358 \begin{verbatim}
1359 C--
1360 C-- Parallel directives for MIPS Pro Fortran compiler
1361 C--
1362 C Parallel compiler directives for SGI with IRIX
1363 C$PAR PARALLEL DO
1364 C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1365 C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1366 C
1367 DO I=1,nThreads
1368 myThid = I
1369
1370 C-- Invoke nThreads instances of the numerical model
1371 CALL THE_MODEL_MAIN(myThid)
1372
1373 ENDDO
1374 \end{verbatim}
1375 \caption{Prior to transferring control to the procedure {\em
1376 THE\_MODEL\_MAIN()} the WRAPPER may use MP directives to spawn
1377 multiple threads. } \label{fig:mp_directives}
1378 \end{figure}
1379
1380
1381 \subsubsection{Specializing the Communication Code}
1382
1383 The isolation of performance critical communication primitives and the
1384 sub-division of the simulation domain into tiles is a powerful tool.
1385 Here we show how it can be used to improve application performance and
1386 how it can be used to adapt to new griding approaches.
1387
1388 \subsubsection{JAM example}
1389 \label{sec:jam_example}
1390 On some platforms a big performance boost can be obtained by binding
1391 the communication routines {\em \_EXCH} and {\em \_GSUM} to
1392 specialized native libraries (for example, the shmem library on CRAY
1393 T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag is used as an
1394 illustration of a specialized communication configuration that
1395 substitutes for standard, portable forms of {\em \_EXCH} and {\em
1396 \_GSUM}. It affects three source files {\em eeboot.F}, {\em
1397 CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined is has
1398 the following effects.
1399 \begin{itemize}
1400 \item An extra phase is included at boot time to initialize the custom
1401 communications library ( see {\em ini\_jam.F}).
1402 \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1403 with calls to custom routines (see {\em gsum\_jam.F} and {\em
1404 exch\_jam.F})
1405 \item a highly specialized form of the exchange operator (optimized
1406 for overlap regions of width one) is substituted into the elliptic
1407 solver routine {\em cg2d.F}.
1408 \end{itemize}
1409 Developing specialized code for other libraries follows a similar
1410 pattern.
1411
1412 \subsubsection{Cube sphere communication}
1413 \label{sec:cube_sphere_communication}
1414 Actual {\em \_EXCH} routine code is generated automatically from a
1415 series of template files, for example {\em exch\_rx.template}. This
1416 is done to allow a large number of variations on the exchange process
1417 to be maintained. One set of variations supports the cube sphere grid.
1418 Support for a cube sphere grid in MITgcm is based on having each face
1419 of the cube as a separate tile or tiles. The exchange routines are
1420 then able to absorb much of the detailed rotation and reorientation
1421 required when moving around the cube grid. The set of {\em \_EXCH}
1422 routines that contain the word cube in their name perform these
1423 transformations. They are invoked when the run-time logical parameter
1424 {\em useCubedSphereExchange} is set true. To facilitate the
1425 transformations on a staggered C-grid, exchange operations are defined
1426 separately for both vector and scalar quantities and for grid-centered
1427 and for grid-face and grid-corner quantities. Three sets of exchange
1428 routines are defined. Routines with names of the form {\em exch\_rx}
1429 are used to exchange cell centered scalar quantities. Routines with
1430 names of the form {\em exch\_uv\_rx} are used to exchange vector
1431 quantities located at the C-grid velocity points. The vector
1432 quantities exchanged by the {\em exch\_uv\_rx} routines can either be
1433 signed (for example velocity components) or un-signed (for example
1434 grid-cell separations). Routines with names of the form {\em
1435 exch\_z\_rx} are used to exchange quantities at the C-grid vorticity
1436 point locations.
1437
1438
1439
1440
1441 \section{MITgcm execution under WRAPPER}
1442 \begin{rawhtml}
1443 <!-- CMIREDIR:mitgcm_wrapper: -->
1444 \end{rawhtml}
1445
1446 Fitting together the WRAPPER elements, package elements and
1447 MITgcm core equation elements of the source code produces calling
1448 sequence shown in section \ref{sec:calling_sequence}
1449
1450 \subsection{Annotated call tree for MITgcm and WRAPPER}
1451 \label{sec:calling_sequence}
1452
1453 WRAPPER layer.
1454
1455 {\footnotesize
1456 \begin{verbatim}
1457
1458 MAIN
1459 |
1460 |--EEBOOT :: WRAPPER initialization
1461 | |
1462 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1463 | | allow basic I/O.
1464 | |-- EEINTRO_MSG :: Write startup greeting.
1465 | |
1466 | |-- EESET_PARMS :: Set WRAPPER parameters
1467 | |
1468 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1469 | |
1470 | |-- INI_PROCS :: Associate processes with grid regions.
1471 | |
1472 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1473 | |
1474 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1475 | :: communication data structures
1476 |
1477 |
1478 |--CHECK_THREADS :: Validate multiple thread start up.
1479 |
1480 |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1481
1482 \end{verbatim}
1483 }
1484
1485 Core equations plus packages.
1486
1487 {\footnotesize
1488 \begin{verbatim}
1489 C
1490 C Invocation from WRAPPER level...
1491 C :
1492 C :
1493 C |
1494 C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1495 C | :: Called from WRAPPER level numerical
1496 C | :: code invocation routine. On entry
1497 C | :: to THE_MODEL_MAIN separate thread and
1498 C | :: separate processes will have been established.
1499 C | :: Each thread and process will have a unique ID
1500 C | :: but as yet it will not be associated with a
1501 C | :: specific region in decomposed discrete space.
1502 C |
1503 C |-INITIALISE_FIXED :: Set fixed model arrays such as topography,
1504 C | | :: grid, solver matrices etc..
1505 C | |
1506 C | |-INI_PARMS :: Routine to set kernel model parameters.
1507 C | | :: By default kernel parameters are read from file
1508 C | | :: "data" in directory in which code executes.
1509 C | |
1510 C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1511 C | |
1512 C | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1513 C | | | :: Grid arrays are held and described in GRID.h.
1514 C | | |
1515 C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays.
1516 C | | |
1517 C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization
1518 C | | | :: (calculate grid from kernel parameters).
1519 C | | |
1520 C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1521 C | | | :: initialization (calculate grid from
1522 C | | | :: kernel parameters).
1523 C | | |
1524 C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz.
1525 C | | :: grid initializations. ( input from raw
1526 C | | :: grid files, LONC.bin, DXF.bin etc... )
1527 C | |
1528 C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography.
1529 C | |
1530 C | |-INI_MASKS_ETC :: Derive horizontal and vertical cell fractions and
1531 C | | :: land masking for solid-fluid boundaries.
1532 C | |
1533 C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1534 C | |
1535 C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1536 C | | :: sphere options are coded.
1537 C | |
1538 C | |-PACAKGES_BOOT :: Start up the optional package environment.
1539 C | | :: Runtime selection of active packages.
1540 C | |
1541 C | |-PACKAGES_READPARMS :: Call active package internal parameter load.
1542 C | | |
1543 C | | |-GMREDI_READPARMS :: GM Package. see pkg/gmredi
1544 C | | |-KPP_READPARMS :: KPP Package. see pkg/kpp
1545 C | | |-SHAP_FILT_READPARMS :: Shapiro filter package. see pkg/shap_filt
1546 C | | |-OBCS_READPARMS :: Open bndy package. see pkg/obcs
1547 C | | |-AIM_READPARMS :: Intermediate Atmos. pacakage. see pkg/aim
1548 C | | |-COST_READPARMS :: Cost function package. see pkg/cost
1549 C | | |-CTRL_INIT :: Control vector support package. see pkg/ctrl
1550 C | | |-OPTIM_READPARMS :: Optimisation support package. see pkg/ctrl
1551 C | | |-GRDCHK_READPARMS :: Gradient check package. see pkg/grdchk
1552 C | | |-ECCO_READPARMS :: ECCO Support Package. see pkg/ecco
1553 C | | |-PTRACERS_READPARMS :: multiple tracer package, see pkg/ptracers
1554 C | | |-GCHEM_READPARMS :: tracer interface package, see pkg/gchem
1555 C | |
1556 C | |-PACKAGES_CHECK
1557 C | | |
1558 C | | |-KPP_CHECK :: KPP Package. pkg/kpp
1559 C | | |-OBCS_CHECK :: Open bndy Pacakge. pkg/obcs
1560 C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi
1561 C | |
1562 C | |-PACKAGES_INIT_FIXED
1563 C | | |-OBCS_INIT_FIXED :: Open bndy Package. see pkg/obcs
1564 C | | |-FLT_INIT :: Floats Package. see pkg/flt
1565 C | | |-GCHEM_INIT_FIXED :: tracer interface pachage, see pkg/gchem
1566 C | |
1567 C | |-ZONAL_FILT_INIT :: FFT filter Package. see pkg/zonal_filt
1568 C | |
1569 C | |-INI_CG2D :: 2d con. grad solver initialization.
1570 C | |
1571 C | |-INI_CG3D :: 3d con. grad solver initialization.
1572 C | |
1573 C | |-CONFIG_SUMMARY :: Provide synopsis of kernel setup.
1574 C | :: Includes annotated table of kernel
1575 C | :: parameter settings.
1576 C |
1577 C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1578 C |
1579 C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1580 C ! :: Auotmatically generated by TAMC/TAF.
1581 C |
1582 C |-CTRL_PACK :: Control vector support package. see pkg/ctrl
1583 C |
1584 C |-GRDCHK_MAIN :: Gradient check package. see pkg/grdchk
1585 C |
1586 C |-THE_MAIN_LOOP :: Main timestepping loop routine.
1587 C | |
1588 C | |-INITIALISE_VARIA :: Set the initial conditions for time evolving
1589 C | | | :: variables
1590 C | | |
1591 C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1592 C | | |
1593 C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane,
1594 C | | | :: sphere options are coded.
1595 C | | |
1596 C | | |-INI_CG2D :: 2d con. grad solver initialization.
1597 C | | |-INI_CG3D :: 3d con. grad solver initialization.
1598 C | | |-INI_MIXING :: Initialize diapycnal diffusivity.
1599 C | | |-INI_DYNVARS :: Initialize to zero all DYNVARS.h arrays (dynamical
1600 C | | | :: fields).
1601 C | | |
1602 C | | |-INI_FIELDS :: Control initializing model fields to non-zero
1603 C | | | |-INI_VEL :: Initialize 3D flow field.
1604 C | | | |-INI_THETA :: Set model initial temperature field.
1605 C | | | |-INI_SALT :: Set model initial salinity field.
1606 C | | | |-INI_PSURF :: Set model initial free-surface height/pressure.
1607 C | | | |-INI_PRESSURE :: Compute model initial hydrostatic pressure
1608 C | | | |-READ_CHECKPOINT :: Read the checkpoint
1609 C | | |
1610 C | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1611 C | | | | :: Here applied to move restart conditions
1612 C | | | | :: (saved in mid timestep) to correct level in
1613 C | | | | :: time (only used for pre-c35).
1614 C | | | |
1615 C | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1616 C | | | |-CORRECTION_STEP :: Pressure correction to momentum
1617 C | | | |-CYCLE_TRACER :: Move tracers forward in time.
1618 C | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1619 C | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1620 C | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1621 C | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1622 C | | | | |-FIND_RHO :: Find adjacent densities.
1623 C | | | | |-CONVECT :: Mix static instability.
1624 C | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1625 C | | | |
1626 C | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1627 C | | |
1628 C | | |-CONVECTIVE_ADJUSTMENT_INI :: Control static instability mixing
1629 C | | | | :: Extra time history interactions.
1630 C | | | |
1631 C | | | |-FIND_RHO :: Find adjacent densities.
1632 C | | | |-CONVECT :: Mix static instability.
1633 C | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1634 C | | |
1635 C | | |-PACKAGES_INIT_VARIABLES :: Does initialization of time evolving
1636 C | | | | :: package data.
1637 C | | | |
1638 C | | | |-GMREDI_INIT :: GM package. ( see pkg/gmredi )
1639 C | | | |-KPP_INIT :: KPP package. ( see pkg/kpp )
1640 C | | | |-KPP_OPEN_DIAGS
1641 C | | | |-OBCS_INIT_VARIABLES :: Open bndy. package. ( see pkg/obcs )
1642 C | | | |-PTRACERS_INIT :: multi. tracer package,(see pkg/ptracers)
1643 C | | | |-GCHEM_INIT :: tracer interface pkg (see pkh/gchem)
1644 C | | | |-AIM_INIT :: Interm. atmos package. ( see pkg/aim )
1645 C | | | |-CTRL_MAP_INI :: Control vector package.( see pkg/ctrl )
1646 C | | | |-COST_INIT :: Cost function package. ( see pkg/cost )
1647 C | | | |-ECCO_INIT :: ECCO support package. ( see pkg/ecco )
1648 C | | | |-INI_FORCING :: Set model initial forcing fields.
1649 C | | | | :: Either set in-line or from file as shown.
1650 C | | | |-READ_FLD_XY_RS(zonalWindFile)
1651 C | | | |-READ_FLD_XY_RS(meridWindFile)
1652 C | | | |-READ_FLD_XY_RS(surfQFile)
1653 C | | | |-READ_FLD_XY_RS(EmPmRfile)
1654 C | | | |-READ_FLD_XY_RS(thetaClimFile)
1655 C | | | |-READ_FLD_XY_RS(saltClimFile)
1656 C | | | |-READ_FLD_XY_RS(surfQswFile)
1657 C | | |
1658 C | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1659 C | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1660 C | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1661 C | | |-STATE_SUMMARY :: Summarize model prognostic variables.
1662 C | | |-TIMEAVE_STATVARS :: Time averaging package ( see pkg/timeave ).
1663 C | |
1664 C | |-WRITE_STATE :: Controlling routine for IO to dump model state.
1665 C | | |-WRITE_REC_XYZ_RL :: Single file I/O
1666 C | | |-WRITE_FLD_XYZ_RL :: Multi-file I/O
1667 C | |
1668 C | |-MONITOR :: Monitor state ( see pkg/monitor )
1669 C | |-CTRL_MAP_FORCING :: Control vector support package. ( see pkg/ctrl )
1670 C====|>|
1671 C====|>| ****************************
1672 C====|>| BEGIN MAIN TIMESTEPPING LOOP
1673 C====|>| ****************************
1674 C====|>|
1675 C/\ | |-FORWARD_STEP :: Step forward a time-step ( AT LAST !!! )
1676 C/\ | | |
1677 C/\ | | |-DUMMY_IN_STEPPING :: autodiff package ( pkg/autoduff ).
1678 C/\ | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1679 C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness.
1680 C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1681 C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1682 C/\ | | | | :: Simple interpolation between end-points
1683 C/\ | | | | :: for forcing datasets.
1684 C/\ | | | |
1685 C/\ | | | |-EXCH :: Sync forcing. in overlap regions.
1686 C/\ | | |-SEAICE_MODEL :: Compute sea-ice terms. ( pkg/seaice )
1687 C/\ | | |-FREEZE :: Limit surface temperature.
1688 C/\ | | |-GCHEM_FIELD_LOAD :: load tracer forcing fields (pkg/gchem)
1689 C/\ | | |
1690 C/\ | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
1691 C/\ | | | |
1692 C/\ | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
1693 C/\ | | | |-OBCS_APPLY_W :: Open bndy. package ( see pkg/obcs ).
1694 C/\ | | | |-FIND_RHO :: Calculates [rho(S,T,z)-RhoConst] of a slice
1695 C/\ | | | |-GRAD_SIGMA :: Calculate isoneutral gradients
1696 C/\ | | | |-CALC_IVDC :: Set Implicit Vertical Diffusivity for Convection
1697 C/\ | | | |
1698 C/\ | | | |-OBCS_CALC :: Open bndy. package ( see pkg/obcs ).
1699 C/\ | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
1700 C/\ | | | | | :: forcing terms.
1701 C/\ | | | | |-PTRACERS_FORCING_SURF :: Tracer package ( see pkg/ptracers ).
1702 C/\ | | | |
1703 C/\ | | | |-GMREDI_CALC_TENSOR :: GM package ( see pkg/gmredi ).
1704 C/\ | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
1705 C/\ | | | |-KPP_CALC :: KPP package ( see pkg/kpp ).
1706 C/\ | | | |-KPP_CALC_DUMMY :: KPP package ( see pkg/kpp ).
1707 C/\ | | | |-AIM_DO_ATMOS_PHYSICS :: Intermed. atmos package ( see pkg/aim ).
1708 C/\ | | | |-GAD_ADVECTION :: Generalised advection driver (multi-dim
1709 C/\ | | | | advection case) (see pkg/gad).
1710 C/\ | | | |-CALC_COMMON_FACTORS :: Calculate common data (such as volume flux)
1711 C/\ | | | |-CALC_DIFFUSIVITY :: Calculate net vertical diffusivity
1712 C/\ | | | | |
1713 C/\ | | | | |-GMREDI_CALC_DIFF :: GM package ( see pkg/gmredi ).
1714 C/\ | | | | |-KPP_CALC_DIFF :: KPP package ( see pkg/kpp ).
1715 C/\ | | | |
1716 C/\ | | | |-CALC_GT :: Calculate the temperature tendency terms
1717 C/\ | | | | |
1718 C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1719 C/\ | | | | | | :: ( see pkg/gad )
1720 C/\ | | | | | |-KPP_TRANSPORT_T :: KPP non-local transport ( see pkg/kpp ).
1721 C/\ | | | | |
1722 C/\ | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
1723 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1724 C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
1725 C/\ | | | |
1726 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1727 C/\ | | | |
1728 C/\ | | | |-CALC_GS :: Calculate the salinity tendency terms
1729 C/\ | | | | |
1730 C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1731 C/\ | | | | | | :: ( see pkg/gad )
1732 C/\ | | | | | |-KPP_TRANSPORT_S :: KPP non-local transport ( see pkg/kpp ).
1733 C/\ | | | | |
1734 C/\ | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
1735 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1736 C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1737 C/\ | | | |
1738 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1739 C/\ | | | |
1740 C/\ | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1741 C/\ | | | |
1742 C/\ | | | |-PTRACERS_INTEGRATE :: Integrate other tracer(s) (see pkg/ptracers).
1743 C/\ | | | | |
1744 C/\ | | | | |-GAD_CALC_RHS :: Generalised advection package
1745 C/\ | | | | | | :: ( see pkg/gad )
1746 C/\ | | | | | |-KPP_TRANSPORT_PTR:: KPP non-local transport ( see pkg/kpp ).
1747 C/\ | | | | |
1748 C/\ | | | | |-PTRACERS_FORCING :: Problem specific forcing for tracer.
1749 C/\ | | | | |-GCHEM_FORCING_INT :: tracer forcing for gchem pkg (if all
1750 C/\ | | | | | tendancy terms calcualted together)
1751 C/\ | | | | |-ADAMS_BASHFORTH2 :: Extrapolate tendencies forward in time.
1752 C/\ | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
1753 C/\ | | | | |-TIMESTEP_TRACER :: Step tracer field forward in time
1754 C/\ | | | |
1755 C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1756 C/\ | | | |
1757 C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1758 C/\ | | | |-OBCS_APPLY_TS :: Open bndy. package (see pkg/obcs ).
1759 C/\ | | | |
1760 C/\ | | | |-AIM_AIM2DYN_EXCHANGES :: Inetermed. atmos (see pkg/aim).
1761 C/\ | | | |-EXCH :: Update overlaps
1762 C/\ | | |
1763 C/\ | | |-DYNAMICS :: Momentum equations driver.
1764 C/\ | | | |
1765 C/\ | | | |-CALC_GRAD_PHI_SURF :: Calculate the gradient of the surface
1766 C/\ | | | | Potential anomaly.
1767 C/\ | | | |-CALC_VISCOSITY :: Calculate net vertical viscosity
1768 C/\ | | | | |-KPP_CALC_VISC :: KPP package ( see pkg/kpp ).
1769 C/\ | | | |
1770 C/\ | | | |-CALC_PHI_HYD :: Integrate the hydrostatic relation.
1771 C/\ | | | |-MOM_FLUXFORM :: Flux form mom eqn. package ( see
1772 C/\ | | | | pkg/mom_fluxform ).
1773 C/\ | | | |-MOM_VECINV :: Vector invariant form mom eqn. package ( see
1774 C/\ | | | | pkg/mom_vecinv ).
1775 C/\ | | | |-TIMESTEP :: Step momentum fields forward in time
1776 C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1777 C/\ | | | |
1778 C/\ | | | |-IMPLDIFF :: Solve vertical implicit diffusion equation.
1779 C/\ | | | |-OBCS_APPLY_UV :: Open bndy. package (see pkg/obcs ).
1780 C/\ | | | |
1781 C/\ | | | |-TIMEAVE_CUMUL_1T :: Time averaging package ( see pkg/timeave ).
1782 C/\ | | | |-TIMEAVE_CUMUATE :: Time averaging package ( see pkg/timeave ).
1783 C/\ | | | |-DEBUG_STATS_RL :: Quick debug package ( see pkg/debug ).
1784 C/\ | | |
1785 C/\ | | |-CALC_GW :: vert. momentum tendency terms ( NH, QH only ).
1786 C/\ | | |
1787 C/\ | | |-UPDATE_SURF_DR :: Update the surface-level thickness fraction.
1788 C/\ | | |
1789 C/\ | | |-UPDATE_CG2D :: Update 2d conjugate grad. for Free-Surf.
1790 C/\ | | |
1791 C/\ | | |-SOLVE_FOR_PRESSURE :: Find surface pressure.
1792 C/\ | | | |-CALC_DIV_GHAT :: Form the RHS of the surface pressure eqn.
1793 C/\ | | | |-CG2D :: Two-dim pre-con. conjugate-gradient.
1794 C/\ | | | |-CG3D :: Three-dim pre-con. conjugate-gradient solver.
1795 C/\ | | |
1796 C/\ | | |-THE_CORRECTION_STEP :: Step forward to next time step.
1797 C/\ | | | |
1798 C/\ | | | |-CALC_GRAD_PHI_SURF :: Return DDx and DDy of surface pressure
1799 C/\ | | | |-CORRECTION_STEP :: Pressure correction to momentum
1800 C/\ | | | |-CYCLE_TRACER :: Move tracers forward in time.
1801 C/\ | | | |-OBCS_APPLY :: Open bndy package. see pkg/obcs
1802 C/\ | | | |-SHAP_FILT_APPLY :: Shapiro filter package. see pkg/shap_filt
1803 C/\ | | | |-ZONAL_FILT_APPLY :: FFT filter package. see pkg/zonal_filt
1804 C/\ | | | |-CONVECTIVE_ADJUSTMENT :: Control static instability mixing.
1805 C/\ | | | | |-FIND_RHO :: Find adjacent densities.
1806 C/\ | | | | |-CONVECT :: Mix static instability.
1807 C/\ | | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
1808 C/\ | | | |
1809 C/\ | | | |-CALC_EXACT_ETA :: Change SSH to flow divergence.
1810 C/\ | | |
1811 C/\ | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
1812 C/\ | | | |-EXCH
1813 C/\ | | |
1814 C/\ | | |-GCHEM_FORCING_SEP :: tracer forcing for gchem pkg (if
1815 C/\ | | | tracer dependent tendencies calculated
1816 C/\ | | | separatly)
1817 C/\ | | |
1818 C/\ | | |-FLT_MAIN :: Float package ( pkg/flt ).
1819 C/\ | | |
1820 C/\ | | |-MONITOR :: Monitor package ( pkg/monitor ).
1821 C/\ | | |
1822 C/\ | | |-DO_THE_MODEL_IO :: Standard diagnostic I/O.
1823 C/\ | | | |-WRITE_STATE :: Core state I/O
1824 C/\ | | | |-TIMEAVE_STATV_WRITE :: Time averages. see pkg/timeave
1825 C/\ | | | |-AIM_WRITE_DIAGS :: Intermed. atmos diags. see pkg/aim
1826 C/\ | | | |-GMREDI_DIAGS :: GM diags. see pkg/gmredi
1827 C/\ | | | |-KPP_DO_DIAGS :: KPP diags. see pkg/kpp
1828 C/\ | | | |-SBO_CALC :: SBO diags. see pkg/sbo
1829 C/\ | | | |-SBO_DIAGS :: SBO diags. see pkg/sbo
1830 C/\ | | | |-SEAICE_DO_DIAGS :: SEAICE diags. see pkg/seaice
1831 C/\ | | | |-GCHEM_DIAGS :: gchem diags. see pkg/gchem
1832 C/\ | | |
1833 C/\ | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
1834 C/\ | |
1835 C/\ | |-COST_TILE :: Cost function package. ( see pkg/cost )
1836 C<===|=|
1837 C<===|=| **************************
1838 C<===|=| END MAIN TIMESTEPPING LOOP
1839 C<===|=| **************************
1840 C<===|=|
1841 C | |-COST_FINAL :: Cost function package. ( see pkg/cost )
1842 C |
1843 C |-WRITE_CHECKPOINT :: Final state storage, for restart.
1844 C |
1845 C |-TIMER_PRINTALL :: Computational timing summary
1846 C |
1847 C |-COMM_STATS :: Summarise inter-proc and inter-thread communication
1848 C :: events.
1849 C
1850 \end{verbatim}
1851 }
1852
1853 \subsection{Measuring and Characterizing Performance}
1854
1855 TO BE DONE (CNH)
1856
1857 \subsection{Estimating Resource Requirements}
1858
1859 TO BE DONE (CNH)
1860
1861 \subsubsection{Atlantic 1/6 degree example}
1862 \subsubsection{Dry Run testing}
1863 \subsubsection{Adjoint Resource Requirements}
1864 \subsubsection{State Estimation Environment Resources}
1865

  ViewVC Help
Powered by ViewVC 1.1.22