| 1 |
% $Header$ |
% $Header$ |
| 2 |
|
|
| 3 |
In this chapter we describe the software architecture and |
This chapter focuses on describing the {\bf WRAPPER} environment within which |
| 4 |
implementation strategy for the MITgcm code. The first part of this |
both the core numerics and the pluggable packages operate. The description |
| 5 |
chapter discusses the MITgcm architecture at an abstract level. In the second |
presented here is intended to be a detailed exposition and contains significant |
| 6 |
part of the chapter we described practical details of the MITgcm implementation |
background material, as well as advanced details on working with the WRAPPER. |
| 7 |
and of current tools and operating system features that are employed. |
The tutorial sections of this manual (see sections |
| 8 |
|
\ref{sect:tutorials} and \ref{sect:tutorialIII}) |
| 9 |
|
contain more succinct, step-by-step instructions on running basic numerical |
| 10 |
|
experiments, of varous types, both sequentially and in parallel. For many |
| 11 |
|
projects simply starting from an example code and adapting it to suit a |
| 12 |
|
particular situation |
| 13 |
|
will be all that is required. |
| 14 |
|
The first part of this chapter discusses the MITgcm architecture at an |
| 15 |
|
abstract level. In the second part of the chapter we described practical |
| 16 |
|
details of the MITgcm implementation and of current tools and operating system |
| 17 |
|
features that are employed. |
| 18 |
|
|
| 19 |
\section{Overall architectural goals} |
\section{Overall architectural goals} |
| 20 |
|
|
| 38 |
|
|
| 39 |
\begin{enumerate} |
\begin{enumerate} |
| 40 |
\item A core set of numerical and support code. This is discussed in detail in |
\item A core set of numerical and support code. This is discussed in detail in |
| 41 |
section \ref{sec:partII}. |
section \ref{sect:partII}. |
| 42 |
\item A scheme for supporting optional "pluggable" {\bf packages} (containing |
\item A scheme for supporting optional "pluggable" {\bf packages} (containing |
| 43 |
for example mixed-layer schemes, biogeochemical schemes, atmospheric physics). |
for example mixed-layer schemes, biogeochemical schemes, atmospheric physics). |
| 44 |
These packages are used both to overlay alternate dynamics and to introduce |
These packages are used both to overlay alternate dynamics and to introduce |
| 51 |
|
|
| 52 |
This chapter focuses on describing the {\bf WRAPPER} environment under which |
This chapter focuses on describing the {\bf WRAPPER} environment under which |
| 53 |
both the core numerics and the pluggable packages function. The description |
both the core numerics and the pluggable packages function. The description |
| 54 |
presented here is intended to be a detailed exposistion and contains significant |
presented here is intended to be a detailed exposition and contains significant |
| 55 |
background material, as well as advanced details on working with the WRAPPER. |
background material, as well as advanced details on working with the WRAPPER. |
| 56 |
The examples section of this manual (part \ref{part:example}) contains more |
The examples section of this manual (part \ref{part:example}) contains more |
| 57 |
succinct, step-by-step instructions on running basic numerical |
succinct, step-by-step instructions on running basic numerical |
| 76 |
\end{figure} |
\end{figure} |
| 77 |
|
|
| 78 |
\section{WRAPPER} |
\section{WRAPPER} |
| 79 |
|
\begin{rawhtml} |
| 80 |
|
<!-- CMIREDIR:wrapper: --> |
| 81 |
|
\end{rawhtml} |
| 82 |
|
|
| 83 |
A significant element of the software architecture utilized in |
A significant element of the software architecture utilized in |
| 84 |
MITgcm is a software superstructure and substructure collectively |
MITgcm is a software superstructure and substructure collectively |
| 87 |
to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within |
to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within |
| 88 |
the WRAPPER means that coding has to follow certain, relatively |
the WRAPPER means that coding has to follow certain, relatively |
| 89 |
straightforward, rules and conventions ( these are discussed further in |
straightforward, rules and conventions ( these are discussed further in |
| 90 |
section \ref{sec:specifying_a_decomposition} ). |
section \ref{sect:specifying_a_decomposition} ). |
| 91 |
|
|
| 92 |
The approach taken by the WRAPPER is illustrated in figure |
The approach taken by the WRAPPER is illustrated in figure |
| 93 |
\ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code |
\ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code |
| 100 |
\resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}} |
\resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}} |
| 101 |
\end{center} |
\end{center} |
| 102 |
\caption{ |
\caption{ |
| 103 |
Numerical code is written too fit within a software support |
Numerical code is written to fit within a software support |
| 104 |
infrastructure called WRAPPER. The WRAPPER is portable and |
infrastructure called WRAPPER. The WRAPPER is portable and |
| 105 |
can be sepcialized for a wide range of specific target hardware and |
can be specialized for a wide range of specific target hardware and |
| 106 |
programming environments, without impacting numerical code that fits |
programming environments, without impacting numerical code that fits |
| 107 |
within the WRAPPER. Codes that fit within the WRAPPER can generally be |
within the WRAPPER. Codes that fit within the WRAPPER can generally be |
| 108 |
made to run as fast on a particular platform as codes specially |
made to run as fast on a particular platform as codes specially |
| 111 |
\end{figure} |
\end{figure} |
| 112 |
|
|
| 113 |
\subsection{Target hardware} |
\subsection{Target hardware} |
| 114 |
\label{sec:target_hardware} |
\label{sect:target_hardware} |
| 115 |
|
|
| 116 |
The WRAPPER is designed to target as broad as possible a range of computer |
The WRAPPER is designed to target as broad as possible a range of computer |
| 117 |
systems. The original development of the WRAPPER took place on a |
systems. The original development of the WRAPPER took place on a |
| 123 |
(UMA) and non-uniform memory access (NUMA) designs. Significant work has also |
(UMA) and non-uniform memory access (NUMA) designs. Significant work has also |
| 124 |
been undertaken on x86 cluster systems, Alpha processor based clustered SMP |
been undertaken on x86 cluster systems, Alpha processor based clustered SMP |
| 125 |
systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics. |
systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics. |
| 126 |
The MITgcm code, operating within the WRAPPER, is also used routinely used on |
The MITgcm code, operating within the WRAPPER, is also routinely used on |
| 127 |
large scale MPP systems (for example T3E systems and IBM SP systems). In all |
large scale MPP systems (for example T3E systems and IBM SP systems). In all |
| 128 |
cases numerical code, operating within the WRAPPER, performs and scales very |
cases numerical code, operating within the WRAPPER, performs and scales very |
| 129 |
competitively with equivalent numerical code that has been modified to contain |
competitively with equivalent numerical code that has been modified to contain |
| 131 |
|
|
| 132 |
\subsection{Supporting hardware neutrality} |
\subsection{Supporting hardware neutrality} |
| 133 |
|
|
| 134 |
The different systems listed in section \ref{sec:target_hardware} can be |
The different systems listed in section \ref{sect:target_hardware} can be |
| 135 |
categorized in many different ways. For example, one common distinction is |
categorized in many different ways. For example, one common distinction is |
| 136 |
between shared-memory parallel systems (SMP's, PVP's) and distributed memory |
between shared-memory parallel systems (SMP's, PVP's) and distributed memory |
| 137 |
parallel systems (for example x86 clusters and large MPP systems). This is one |
parallel systems (for example x86 clusters and large MPP systems). This is one |
| 149 |
class of machines (for example Parallel Vector Processor Systems). Instead the |
class of machines (for example Parallel Vector Processor Systems). Instead the |
| 150 |
WRAPPER provides applications with an |
WRAPPER provides applications with an |
| 151 |
abstract {\it machine model}. The machine model is very general, however, it can |
abstract {\it machine model}. The machine model is very general, however, it can |
| 152 |
easily be specialized to fit, in a computationally effificent manner, any |
easily be specialized to fit, in a computationally efficient manner, any |
| 153 |
computer architecture currently available to the scientific computing community. |
computer architecture currently available to the scientific computing community. |
| 154 |
|
|
| 155 |
\subsection{Machine model parallelism} |
\subsection{Machine model parallelism} |
| 156 |
|
\begin{rawhtml} |
| 157 |
|
<!-- CMIREDIR:domain_decomp: --> |
| 158 |
|
\end{rawhtml} |
| 159 |
|
|
| 160 |
Codes operating under the WRAPPER target an abstract machine that is assumed to |
Codes operating under the WRAPPER target an abstract machine that is assumed to |
| 161 |
consist of one or more logical processors that can compute concurrently. |
consist of one or more logical processors that can compute concurrently. |
| 162 |
Computational work is divided amongst the logical |
Computational work is divided among the logical |
| 163 |
processors by allocating ``ownership'' to |
processors by allocating ``ownership'' to |
| 164 |
each processor of a certain set (or sets) of calculations. Each set of |
each processor of a certain set (or sets) of calculations. Each set of |
| 165 |
calculations owned by a particular processor is associated with a specific |
calculations owned by a particular processor is associated with a specific |
| 182 |
space allocated to a particular logical processor, there will be data |
space allocated to a particular logical processor, there will be data |
| 183 |
structures (arrays, scalar variables etc...) that hold the simulated state of |
structures (arrays, scalar variables etc...) that hold the simulated state of |
| 184 |
that region. We refer to these data structures as being {\bf owned} by the |
that region. We refer to these data structures as being {\bf owned} by the |
| 185 |
pprocessor to which their |
processor to which their |
| 186 |
associated region of physical space has been allocated. Individual |
associated region of physical space has been allocated. Individual |
| 187 |
regions that are allocated to processors are called {\bf tiles}. A |
regions that are allocated to processors are called {\bf tiles}. A |
| 188 |
processor can own more |
processor can own more |
| 227 |
whenever it requires values that outside the domain it owns. Periodically |
whenever it requires values that outside the domain it owns. Periodically |
| 228 |
processors will make calls to WRAPPER functions to communicate data between |
processors will make calls to WRAPPER functions to communicate data between |
| 229 |
tiles, in order to keep the overlap regions up to date (see section |
tiles, in order to keep the overlap regions up to date (see section |
| 230 |
\ref{sec:communication_primitives}). The WRAPPER functions can use a |
\ref{sect:communication_primitives}). The WRAPPER functions can use a |
| 231 |
variety of different mechanisms to communicate data between tiles. |
variety of different mechanisms to communicate data between tiles. |
| 232 |
|
|
| 233 |
\begin{figure} |
\begin{figure} |
| 314 |
\end{figure} |
\end{figure} |
| 315 |
|
|
| 316 |
\subsection{Shared memory communication} |
\subsection{Shared memory communication} |
| 317 |
\label{sec:shared_memory_communication} |
\label{sect:shared_memory_communication} |
| 318 |
|
|
| 319 |
Under shared communication independent CPU's are operating |
Under shared communication independent CPU's are operating |
| 320 |
on the exact same global address space at the application level. |
on the exact same global address space at the application level. |
| 340 |
communication very efficient provided it is used appropriately. |
communication very efficient provided it is used appropriately. |
| 341 |
|
|
| 342 |
\subsubsection{Memory consistency} |
\subsubsection{Memory consistency} |
| 343 |
\label{sec:memory_consistency} |
\label{sect:memory_consistency} |
| 344 |
|
|
| 345 |
When using shared memory communication between |
When using shared memory communication between |
| 346 |
multiple processors the WRAPPER level shields user applications from |
multiple processors the WRAPPER level shields user applications from |
| 364 |
ensure memory consistency for a particular platform. |
ensure memory consistency for a particular platform. |
| 365 |
|
|
| 366 |
\subsubsection{Cache effects and false sharing} |
\subsubsection{Cache effects and false sharing} |
| 367 |
\label{sec:cache_effects_and_false_sharing} |
\label{sect:cache_effects_and_false_sharing} |
| 368 |
|
|
| 369 |
Shared-memory machines often have local to processor memory caches |
Shared-memory machines often have local to processor memory caches |
| 370 |
which contain mirrored copies of main memory. Automatic cache-coherence |
which contain mirrored copies of main memory. Automatic cache-coherence |
| 383 |
threads operating within a single process is the standard mechanism for |
threads operating within a single process is the standard mechanism for |
| 384 |
supporting shared memory that the WRAPPER utilizes. Configuring and launching |
supporting shared memory that the WRAPPER utilizes. Configuring and launching |
| 385 |
code to run in multi-threaded mode on specific platforms is discussed in |
code to run in multi-threaded mode on specific platforms is discussed in |
| 386 |
section \ref{sec:running_with_threads}. However, on many systems, potentially |
section \ref{sect:running_with_threads}. However, on many systems, potentially |
| 387 |
very efficient mechanisms for using shared memory communication between |
very efficient mechanisms for using shared memory communication between |
| 388 |
multiple processes (in contrast to multiple threads within a single |
multiple processes (in contrast to multiple threads within a single |
| 389 |
process) also exist. In most cases this works by making a limited region of |
process) also exist. In most cases this works by making a limited region of |
| 396 |
nature. |
nature. |
| 397 |
|
|
| 398 |
\subsection{Distributed memory communication} |
\subsection{Distributed memory communication} |
| 399 |
\label{sec:distributed_memory_communication} |
\label{sect:distributed_memory_communication} |
| 400 |
Many parallel systems are not constructed in a way where it is |
Many parallel systems are not constructed in a way where it is |
| 401 |
possible or practical for an application to use shared memory |
possible or practical for an application to use shared memory |
| 402 |
for communication. For example cluster systems consist of individual computers |
for communication. For example cluster systems consist of individual computers |
| 410 |
highly optimized library. |
highly optimized library. |
| 411 |
|
|
| 412 |
\subsection{Communication primitives} |
\subsection{Communication primitives} |
| 413 |
\label{sec:communication_primitives} |
\label{sect:communication_primitives} |
| 414 |
|
|
| 415 |
\begin{figure} |
\begin{figure} |
| 416 |
\begin{center} |
\begin{center} |
| 418 |
\includegraphics{part4/comm-primm.eps} |
\includegraphics{part4/comm-primm.eps} |
| 419 |
} |
} |
| 420 |
\end{center} |
\end{center} |
| 421 |
\caption{Three performance critical parallel primititives are provided |
\caption{Three performance critical parallel primitives are provided |
| 422 |
by the WRAPPER. These primititives are always used to communicate data |
by the WRAPPER. These primitives are always used to communicate data |
| 423 |
between tiles. The figure shows four tiles. The curved arrows indicate |
between tiles. The figure shows four tiles. The curved arrows indicate |
| 424 |
exchange primitives which transfer data between the overlap regions at tile |
exchange primitives which transfer data between the overlap regions at tile |
| 425 |
edges and interior regions for nearest-neighbor tiles. |
edges and interior regions for nearest-neighbor tiles. |
| 554 |
computing CPU's. |
computing CPU's. |
| 555 |
\end{enumerate} |
\end{enumerate} |
| 556 |
This section describes the details of each of these operations. |
This section describes the details of each of these operations. |
| 557 |
Section \ref{sec:specifying_a_decomposition} explains how the way in which |
Section \ref{sect:specifying_a_decomposition} explains how the way in which |
| 558 |
a domain is decomposed (or composed) is expressed. Section |
a domain is decomposed (or composed) is expressed. Section |
| 559 |
\ref{sec:starting_a_code} describes practical details of running codes |
\ref{sect:starting_a_code} describes practical details of running codes |
| 560 |
in various different parallel modes on contemporary computer systems. |
in various different parallel modes on contemporary computer systems. |
| 561 |
Section \ref{sec:controlling_communication} explains the internal information |
Section \ref{sect:controlling_communication} explains the internal information |
| 562 |
that the WRAPPER uses to control how information is communicated between |
that the WRAPPER uses to control how information is communicated between |
| 563 |
tiles. |
tiles. |
| 564 |
|
|
| 565 |
\subsection{Specifying a domain decomposition} |
\subsection{Specifying a domain decomposition} |
| 566 |
\label{sec:specifying_a_decomposition} |
\label{sect:specifying_a_decomposition} |
| 567 |
|
|
| 568 |
At its heart much of the WRAPPER works only in terms of a collection of tiles |
At its heart much of the WRAPPER works only in terms of a collection of tiles |
| 569 |
which are interconnected to each other. This is also true of application |
which are interconnected to each other. This is also true of application |
| 615 |
dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are |
dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are |
| 616 |
allocated to different threads of a process that are then bound to |
allocated to different threads of a process that are then bound to |
| 617 |
different physical processors ( see the multi-threaded |
different physical processors ( see the multi-threaded |
| 618 |
execution discussion in section \ref{sec:starting_the_code} ) then |
execution discussion in section \ref{sect:starting_the_code} ) then |
| 619 |
computation will be performed concurrently on each tile. However, it is also |
computation will be performed concurrently on each tile. However, it is also |
| 620 |
possible to run the same decomposition within a process running a single thread on |
possible to run the same decomposition within a process running a single thread on |
| 621 |
a single processor. In this case the tiles will be computed over sequentially. |
a single processor. In this case the tiles will be computed over sequentially. |
| 667 |
computation is performed concurrently over as many processes and threads |
computation is performed concurrently over as many processes and threads |
| 668 |
as there are physical processors available to compute. |
as there are physical processors available to compute. |
| 669 |
|
|
| 670 |
|
An exception to the the use of {\em bi} and {\em bj} in loops arises in the |
| 671 |
|
exchange routines used when the exch2 package is used with the cubed |
| 672 |
|
sphere. In this case {\em bj} is generally set to 1 and the loop runs from |
| 673 |
|
1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number, |
| 674 |
|
which is then used to reference exchange parameters. |
| 675 |
|
|
| 676 |
The amount of computation that can be embedded |
The amount of computation that can be embedded |
| 677 |
a single loop over {\em bi} and {\em bj} varies for different parts of the |
a single loop over {\em bi} and {\em bj} varies for different parts of the |
| 678 |
MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract |
MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract |
| 793 |
forty grid points in y. The two sub-domains in each process will be computed |
forty grid points in y. The two sub-domains in each process will be computed |
| 794 |
sequentially if they are given to a single thread within a single process. |
sequentially if they are given to a single thread within a single process. |
| 795 |
Alternatively if the code is invoked with multiple threads per process |
Alternatively if the code is invoked with multiple threads per process |
| 796 |
the two domains in y may be computed on concurrently. |
the two domains in y may be computed concurrently. |
| 797 |
\item |
\item |
| 798 |
\begin{verbatim} |
\begin{verbatim} |
| 799 |
PARAMETER ( |
PARAMETER ( |
| 811 |
There are six tiles allocated to six separate logical processors ({\em nSx=6}). |
There are six tiles allocated to six separate logical processors ({\em nSx=6}). |
| 812 |
This set of values can be used for a cube sphere calculation. |
This set of values can be used for a cube sphere calculation. |
| 813 |
Each tile of size $32 \times 32$ represents a face of the |
Each tile of size $32 \times 32$ represents a face of the |
| 814 |
cube. Initialising the tile connectivity correctly ( see section |
cube. Initializing the tile connectivity correctly ( see section |
| 815 |
\ref{sec:cube_sphere_communication}. allows the rotations associated with |
\ref{sect:cube_sphere_communication}. allows the rotations associated with |
| 816 |
moving between the six cube faces to be embedded within the |
moving between the six cube faces to be embedded within the |
| 817 |
tile-tile communication code. |
tile-tile communication code. |
| 818 |
\end{enumerate} |
\end{enumerate} |
| 819 |
|
|
| 820 |
|
|
| 821 |
\subsection{Starting the code} |
\subsection{Starting the code} |
| 822 |
\label{sec:starting_the_code} |
\label{sect:starting_the_code} |
| 823 |
When code is started under the WRAPPER, execution begins in a main routine {\em |
When code is started under the WRAPPER, execution begins in a main routine {\em |
| 824 |
eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred |
eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred |
| 825 |
to the application through a routine called {\em THE\_MODEL\_MAIN()} |
to the application through a routine called {\em THE\_MODEL\_MAIN()} |
| 829 |
WRAPPER is shown in figure \ref{fig:wrapper_startup}. |
WRAPPER is shown in figure \ref{fig:wrapper_startup}. |
| 830 |
|
|
| 831 |
\begin{figure} |
\begin{figure} |
| 832 |
|
{\footnotesize |
| 833 |
\begin{verbatim} |
\begin{verbatim} |
| 834 |
|
|
| 835 |
MAIN |
MAIN |
| 858 |
|
|
| 859 |
|
|
| 860 |
\end{verbatim} |
\end{verbatim} |
| 861 |
|
} |
| 862 |
\caption{Main stages of the WRAPPER startup procedure. |
\caption{Main stages of the WRAPPER startup procedure. |
| 863 |
This process proceeds transfer of control to application code, which |
This process proceeds transfer of control to application code, which |
| 864 |
occurs through the procedure {\em THE\_MODEL\_MAIN()}. |
occurs through the procedure {\em THE\_MODEL\_MAIN()}. |
| 866 |
\end{figure} |
\end{figure} |
| 867 |
|
|
| 868 |
\subsubsection{Multi-threaded execution} |
\subsubsection{Multi-threaded execution} |
| 869 |
\label{sec:multi-threaded-execution} |
\label{sect:multi-threaded-execution} |
| 870 |
Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the |
Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the |
| 871 |
WRAPPER may cause several coarse grain threads to be initialized. The routine |
WRAPPER may cause several coarse grain threads to be initialized. The routine |
| 872 |
{\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single |
{\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single |
| 873 |
stack argument which is the thread number, stored in the |
stack argument which is the thread number, stored in the |
| 874 |
variable {\em myThid}. In addition to specifying a decomposition with |
variable {\em myThid}. In addition to specifying a decomposition with |
| 875 |
multiple tiles per process ( see section \ref{sec:specifying_a_decomposition}) |
multiple tiles per process ( see section \ref{sect:specifying_a_decomposition}) |
| 876 |
configuring and starting a code to run using multiple threads requires the following |
configuring and starting a code to run using multiple threads requires the following |
| 877 |
steps.\\ |
steps.\\ |
| 878 |
|
|
| 941 |
File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\ |
File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\ |
| 942 |
File: {\em model/src/THE\_MODEL\_MAIN.F}\\ |
File: {\em model/src/THE\_MODEL\_MAIN.F}\\ |
| 943 |
File: {\em eesupp/src/MAIN.F}\\ |
File: {\em eesupp/src/MAIN.F}\\ |
| 944 |
File: {\em tools/genmake}\\ |
File: {\em tools/genmake2}\\ |
| 945 |
File: {\em eedata}\\ |
File: {\em eedata}\\ |
| 946 |
CPP: {\em TARGET\_SUN}\\ |
CPP: {\em TARGET\_SUN}\\ |
| 947 |
CPP: {\em TARGET\_DEC}\\ |
CPP: {\em TARGET\_DEC}\\ |
| 954 |
} \\ |
} \\ |
| 955 |
|
|
| 956 |
\subsubsection{Multi-process execution} |
\subsubsection{Multi-process execution} |
| 957 |
\label{sec:multi-process-execution} |
\label{sect:multi-process-execution} |
| 958 |
|
|
| 959 |
Despite its appealing programming model, multi-threaded execution remains |
Despite its appealing programming model, multi-threaded execution remains |
| 960 |
less common then multi-process execution. One major reason for this |
less common then multi-process execution. One major reason for this |
| 966 |
|
|
| 967 |
Multi-process execution is more ubiquitous. |
Multi-process execution is more ubiquitous. |
| 968 |
In order to run code in a multi-process configuration a decomposition |
In order to run code in a multi-process configuration a decomposition |
| 969 |
specification ( see section \ref{sec:specifying_a_decomposition}) |
specification ( see section \ref{sect:specifying_a_decomposition}) |
| 970 |
is given ( in which the at least one of the |
is given ( in which the at least one of the |
| 971 |
parameters {\em nPx} or {\em nPy} will be greater than one) |
parameters {\em nPx} or {\em nPy} will be greater than one) |
| 972 |
and then, as for multi-threaded operation, |
and then, as for multi-threaded operation, |
| 980 |
of controlling and coordinating the start up of a large number |
of controlling and coordinating the start up of a large number |
| 981 |
(hundreds and possibly even thousands) of copies of the same |
(hundreds and possibly even thousands) of copies of the same |
| 982 |
program, MPI is used. The calls to the MPI multi-process startup |
program, MPI is used. The calls to the MPI multi-process startup |
| 983 |
routines must be activated at compile time. This is done |
routines must be activated at compile time. Currently MPI libraries are |
| 984 |
by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} |
invoked by |
| 985 |
flags in the {\em CPP\_EEOPTIONS.h} file.\\ |
specifying the appropriate options file with the |
| 986 |
|
{\tt-of} flag when running the {\em genmake2} |
| 987 |
\fbox{ |
script, which generates the Makefile for compiling and linking MITgcm. |
| 988 |
\begin{minipage}{4.75in} |
(Previously this was done by setting the {\em ALLOW\_USE\_MPI} and |
| 989 |
File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\ |
{\em ALWAYS\_USE\_MPI} flags in the {\em CPP\_EEOPTIONS.h} file.) More |
| 990 |
CPP: {\em ALLOW\_USE\_MPI}\\ |
detailed information about the use of {\em genmake2} for specifying |
| 991 |
CPP: {\em ALWAYS\_USE\_MPI}\\ |
local compiler flags is located in section \ref{sect:genmake}.\\ |
|
Parameter: {\em nPx}\\ |
|
|
Parameter: {\em nPy} |
|
|
\end{minipage} |
|
|
} \\ |
|
| 992 |
|
|
|
Additionally, compile time options are required to link in the |
|
|
MPI libraries and header files. Examples of these options |
|
|
can be found in the {\em genmake} script that creates makefiles |
|
|
for compilation. When this script is executed with the {bf -mpi} |
|
|
flag it will generate a makefile that includes |
|
|
paths for search for MPI head files and for linking in |
|
|
MPI libraries. For example the {\bf -mpi} flag on a |
|
|
Silicon Graphics IRIX system causes a |
|
|
Makefile with the compilation command |
|
|
Graphics IRIX system \begin{verbatim} |
|
|
mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI |
|
|
\end{verbatim} |
|
|
to be generated. |
|
|
This is the correct set of options for using the MPICH open-source |
|
|
version of MPI, when it has been installed under the subdirectory |
|
|
/usr/local/mpi. |
|
|
However, on many systems there may be several |
|
|
versions of MPI installed. For example many systems have both |
|
|
the open source MPICH set of libraries and a vendor specific native form |
|
|
of the MPI libraries. The correct setup to use will depend on the |
|
|
local configuration of your system.\\ |
|
| 993 |
|
|
| 994 |
\fbox{ |
\fbox{ |
| 995 |
\begin{minipage}{4.75in} |
\begin{minipage}{4.75in} |
| 996 |
File: {\em tools/genmake} |
Directory: {\em tools/build\_options}\\ |
| 997 |
|
File: {\em tools/genmake2} |
| 998 |
\end{minipage} |
\end{minipage} |
| 999 |
} \\ |
} \\ |
| 1000 |
\paragraph{\bf Execution} The mechanics of starting a program in |
\paragraph{\bf Execution} The mechanics of starting a program in |
| 1006 |
\begin{verbatim} |
\begin{verbatim} |
| 1007 |
mpirun -np 64 -machinefile mf ./mitgcmuv |
mpirun -np 64 -machinefile mf ./mitgcmuv |
| 1008 |
\end{verbatim} |
\end{verbatim} |
| 1009 |
In this example the text {\em -np 64} specifices the number of processes |
In this example the text {\em -np 64} specifies the number of processes |
| 1010 |
that will be created. The numeric value {\em 64} must be equal to the |
that will be created. The numeric value {\em 64} must be equal to the |
| 1011 |
product of the processor grid settings of {\em nPx} and {\em nPy} |
product of the processor grid settings of {\em nPx} and {\em nPy} |
| 1012 |
in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file |
in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file |
| 1013 |
called ``mf'' will be read to get a list of processor names on |
called ``mf'' will be read to get a list of processor names on |
| 1014 |
which the sixty-four processes will execute. The syntax of this file |
which the sixty-four processes will execute. The syntax of this file |
| 1015 |
is specified by the MPI distribution |
is specified by the MPI distribution. |
| 1016 |
\\ |
\\ |
| 1017 |
|
|
| 1018 |
\fbox{ |
\fbox{ |
| 1063 |
Allocation of processes to tiles in controlled by the routine |
Allocation of processes to tiles in controlled by the routine |
| 1064 |
{\em INI\_PROCS()}. For each process this routine sets |
{\em INI\_PROCS()}. For each process this routine sets |
| 1065 |
the variables {\em myXGlobalLo} and {\em myYGlobalLo}. |
the variables {\em myXGlobalLo} and {\em myYGlobalLo}. |
| 1066 |
These variables specify (in index space) the coordinate |
These variables specify in index space the coordinates |
| 1067 |
of the southern most and western most corner of the |
of the southernmost and westernmost corner of the |
| 1068 |
southern most and western most tile owned by this process. |
southernmost and westernmost tile owned by this process. |
| 1069 |
The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} |
The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} |
| 1070 |
are also set in this routine. These are used to identify |
are also set in this routine. These are used to identify |
| 1071 |
processes holding tiles to the west, east, south and north |
processes holding tiles to the west, east, south and north |
| 1072 |
of this process. These values are stored in global storage |
of this process. These values are stored in global storage |
| 1073 |
in the header file {\em EESUPPORT.h} for use by |
in the header file {\em EESUPPORT.h} for use by |
| 1074 |
communication routines. |
communication routines. The above does not hold when the |
| 1075 |
|
exch2 package is used -- exch2 sets its own parameters to |
| 1076 |
|
specify the global indices of tiles and their relationships |
| 1077 |
|
to each other. See the documentation on the exch2 package |
| 1078 |
|
(\ref{sec:exch2}) for |
| 1079 |
|
details. |
| 1080 |
\\ |
\\ |
| 1081 |
|
|
| 1082 |
\fbox{ |
\fbox{ |
| 1102 |
describes the information that is held and used. |
describes the information that is held and used. |
| 1103 |
|
|
| 1104 |
\begin{enumerate} |
\begin{enumerate} |
| 1105 |
\item {\bf Tile-tile connectivity information} For each tile the WRAPPER |
\item {\bf Tile-tile connectivity information} |
| 1106 |
sets a flag that sets the tile number to the north, south, east and |
For each tile the WRAPPER |
| 1107 |
|
sets a flag that sets the tile number to the north, |
| 1108 |
|
south, east and |
| 1109 |
west of that tile. This number is unique over all tiles in a |
west of that tile. This number is unique over all tiles in a |
| 1110 |
configuration. The number is held in the variables {\em tileNo} |
configuration. Except when using the cubed sphere and the exch2 package, |
| 1111 |
|
the number is held in the variables {\em tileNo} |
| 1112 |
( this holds the tiles own number), {\em tileNoN}, {\em tileNoS}, |
( this holds the tiles own number), {\em tileNoN}, {\em tileNoS}, |
| 1113 |
{\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile |
{\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile |
| 1114 |
that specifies the type of communication that is used between tiles. |
that specifies the type of communication that is used between tiles. |
| 1117 |
This latter set of variables can take one of the following values |
This latter set of variables can take one of the following values |
| 1118 |
{\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}. |
{\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}. |
| 1119 |
A value of {\em COMM\_NONE} is used to indicate that a tile has no |
A value of {\em COMM\_NONE} is used to indicate that a tile has no |
| 1120 |
neighbor to cummnicate with on a particular face. A value |
neighbor to communicate with on a particular face. A value |
| 1121 |
of {\em COMM\_MSG} is used to indicated that some form of distributed |
of {\em COMM\_MSG} is used to indicated that some form of distributed |
| 1122 |
memory communication is required to communicate between |
memory communication is required to communicate between |
| 1123 |
these tile faces ( see section \ref{sec:distributed_memory_communication}). |
these tile faces ( see section \ref{sect:distributed_memory_communication}). |
| 1124 |
A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate |
A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate |
| 1125 |
forms of shared memory communication ( see section |
forms of shared memory communication ( see section |
| 1126 |
\ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates |
\ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates |
| 1127 |
that a CPU should communicate by writing to data structures owned by another |
that a CPU should communicate by writing to data structures owned by another |
| 1128 |
CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading |
CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading |
| 1129 |
from data structures owned by another CPU. These flags affect the behavior |
from data structures owned by another CPU. These flags affect the behavior |
| 1131 |
(see figure \ref{fig:communication_primitives}). The routine |
(see figure \ref{fig:communication_primitives}). The routine |
| 1132 |
{\em ini\_communication\_patterns()} is responsible for setting the |
{\em ini\_communication\_patterns()} is responsible for setting the |
| 1133 |
communication mode values for each tile. |
communication mode values for each tile. |
| 1134 |
\\ |
|
| 1135 |
|
When using the cubed sphere configuration with the exch2 package, the |
| 1136 |
|
relationships between tiles and their communication methods are set |
| 1137 |
|
by the package in other variables. See the exch2 package documentation |
| 1138 |
|
(\ref{sec:exch2} for details. |
| 1139 |
|
|
| 1140 |
|
|
| 1141 |
|
|
| 1142 |
\fbox{ |
\fbox{ |
| 1143 |
\begin{minipage}{4.75in} |
\begin{minipage}{4.75in} |
| 1180 |
are read from the file {\em eedata}. If the value of {\em nThreads} |
are read from the file {\em eedata}. If the value of {\em nThreads} |
| 1181 |
is inconsistent with the number of threads requested from the |
is inconsistent with the number of threads requested from the |
| 1182 |
operating system (for example by using an environment |
operating system (for example by using an environment |
| 1183 |
varialble as described in section \ref{sec:multi_threaded_execution}) |
variable as described in section \ref{sect:multi_threaded_execution}) |
| 1184 |
then usually an error will be reported by the routine |
then usually an error will be reported by the routine |
| 1185 |
{\em CHECK\_THREADS}.\\ |
{\em CHECK\_THREADS}.\\ |
| 1186 |
|
|
| 1198 |
} |
} |
| 1199 |
|
|
| 1200 |
\item {\bf memsync flags} |
\item {\bf memsync flags} |
| 1201 |
As discussed in section \ref{sec:memory_consistency}, when using shared memory, |
As discussed in section \ref{sect:memory_consistency}, when using shared memory, |
| 1202 |
a low-level system function may be need to force memory consistency. |
a low-level system function may be need to force memory consistency. |
| 1203 |
The routine {\em MEMSYNC()} is used for this purpose. This routine should |
The routine {\em MEMSYNC()} is used for this purpose. This routine should |
| 1204 |
not need modifying and the information below is only provided for |
not need modifying and the information below is only provided for |
| 1214 |
\begin{verbatim} |
\begin{verbatim} |
| 1215 |
asm("membar #LoadStore|#StoreStore"); |
asm("membar #LoadStore|#StoreStore"); |
| 1216 |
\end{verbatim} |
\end{verbatim} |
| 1217 |
for an Alpha based sytem the euivalent code reads |
for an Alpha based system the equivalent code reads |
| 1218 |
\begin{verbatim} |
\begin{verbatim} |
| 1219 |
asm("mb"); |
asm("mb"); |
| 1220 |
\end{verbatim} |
\end{verbatim} |
| 1224 |
\end{verbatim} |
\end{verbatim} |
| 1225 |
|
|
| 1226 |
\item {\bf Cache line size} |
\item {\bf Cache line size} |
| 1227 |
As discussed in section \ref{sec:cache_effects_and_false_sharing}, |
As discussed in section \ref{sect:cache_effects_and_false_sharing}, |
| 1228 |
milti-threaded codes explicitly avoid penalties associated with excessive |
milti-threaded codes explicitly avoid penalties associated with excessive |
| 1229 |
coherence traffic on an SMP system. To do this the sgared memory data structures |
coherence traffic on an SMP system. To do this the shared memory data structures |
| 1230 |
used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines |
used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines |
| 1231 |
are padded. The variables that control the padding are set in the |
are padded. The variables that control the padding are set in the |
| 1232 |
header file {\em EEPARAMS.h}. These variables are called |
header file {\em EEPARAMS.h}. These variables are called |
| 1234 |
{\em lShare8}. The default values should not normally need changing. |
{\em lShare8}. The default values should not normally need changing. |
| 1235 |
\item {\bf \_BARRIER} |
\item {\bf \_BARRIER} |
| 1236 |
This is a CPP macro that is expanded to a call to a routine |
This is a CPP macro that is expanded to a call to a routine |
| 1237 |
which synchronises all the logical processors running under the |
which synchronizes all the logical processors running under the |
| 1238 |
WRAPPER. Using a macro here preserves flexibility to insert |
WRAPPER. Using a macro here preserves flexibility to insert |
| 1239 |
a specialized call in-line into application code. By default this |
a specialized call in-line into application code. By default this |
| 1240 |
resolves to calling the procedure {\em BARRIER()}. The default |
resolves to calling the procedure {\em BARRIER()}. The default |
| 1242 |
|
|
| 1243 |
\item {\bf \_GSUM} |
\item {\bf \_GSUM} |
| 1244 |
This is a CPP macro that is expanded to a call to a routine |
This is a CPP macro that is expanded to a call to a routine |
| 1245 |
which sums up a floating point numner |
which sums up a floating point number |
| 1246 |
over all the logical processors running under the |
over all the logical processors running under the |
| 1247 |
WRAPPER. Using a macro here provides extra flexibility to insert |
WRAPPER. Using a macro here provides extra flexibility to insert |
| 1248 |
a specialized call in-line into application code. By default this |
a specialized call in-line into application code. By default this |
| 1249 |
resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for |
resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for |
| 1250 |
84=bit floating point operands) |
64-bit floating point operands) |
| 1251 |
or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default |
or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default |
| 1252 |
setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}. |
setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}. |
| 1253 |
The \_GSUM macro is a performance critical operation, especially for |
The \_GSUM macro is a performance critical operation, especially for |
| 1254 |
large processor count, small tile size configurations. |
large processor count, small tile size configurations. |
| 1255 |
The custom communication example discussed in section \ref{sec:jam_example} |
The custom communication example discussed in section \ref{sect:jam_example} |
| 1256 |
shows how the macro is used to invoke a custom global sum routine |
shows how the macro is used to invoke a custom global sum routine |
| 1257 |
for a specific set of hardware. |
for a specific set of hardware. |
| 1258 |
|
|
| 1266 |
in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the |
in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the |
| 1267 |
\_EXCH operation plays a crucial role in scaling to small tile, |
\_EXCH operation plays a crucial role in scaling to small tile, |
| 1268 |
large logical and physical processor count configurations. |
large logical and physical processor count configurations. |
| 1269 |
The example in section \ref{sec:jam_example} discusses defining an |
The example in section \ref{sect:jam_example} discusses defining an |
| 1270 |
optimised and specialized form on the \_EXCH operation. |
optimized and specialized form on the \_EXCH operation. |
| 1271 |
|
|
| 1272 |
The \_EXCH operation is also central to supporting grids such as |
The \_EXCH operation is also central to supporting grids such as |
| 1273 |
the cube-sphere grid. In this class of grid a rotation may be required |
the cube-sphere grid. In this class of grid a rotation may be required |
| 1274 |
between tiles. Aligning the coordinate requiring rotation with the |
between tiles. Aligning the coordinate requiring rotation with the |
| 1275 |
tile decomposistion, allows the coordinate transformation to |
tile decomposition, allows the coordinate transformation to |
| 1276 |
be embedded within a custom form of the \_EXCH primitive. |
be embedded within a custom form of the \_EXCH primitive. In these |
| 1277 |
|
cases \_EXCH is mapped to exch2 routines, as detailed in the exch2 |
| 1278 |
|
package documentation \ref{sec:exch2}. |
| 1279 |
|
|
| 1280 |
\item {\bf Reverse Mode} |
\item {\bf Reverse Mode} |
| 1281 |
The communication primitives \_EXCH and \_GSUM both employ |
The communication primitives \_EXCH and \_GSUM both employ |
| 1282 |
hand-written adjoint forms (or reverse mode) forms. |
hand-written adjoint forms (or reverse mode) forms. |
| 1283 |
These reverse mode forms can be found in the |
These reverse mode forms can be found in the |
| 1284 |
sourc code directory {\em pkg/autodiff}. |
source code directory {\em pkg/autodiff}. |
| 1285 |
For the global sum primitive the reverse mode form |
For the global sum primitive the reverse mode form |
| 1286 |
calls are to {\em GLOBAL\_ADSUM\_R4} and |
calls are to {\em GLOBAL\_ADSUM\_R4} and |
| 1287 |
{\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the |
{\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the |
| 1288 |
exchamge primitives are found in routines |
exchange primitives are found in routines |
| 1289 |
prefixed {\em ADEXCH}. The exchange routines make calls to |
prefixed {\em ADEXCH}. The exchange routines make calls to |
| 1290 |
the same low-level communication primitives as the forward mode |
the same low-level communication primitives as the forward mode |
| 1291 |
operations. However, the routine argument {\em simulationMode} |
operations. However, the routine argument {\em simulationMode} |
| 1292 |
is set to the value {\em REVERSE\_SIMULATION}. This signifies |
is set to the value {\em REVERSE\_SIMULATION}. This signifies |
| 1293 |
ti the low-level routines that the adjoint forms of the |
ti the low-level routines that the adjoint forms of the |
| 1294 |
appropriate communication operation should be performed. |
appropriate communication operation should be performed. |
| 1295 |
|
|
| 1296 |
\item {\bf MAX\_NO\_THREADS} |
\item {\bf MAX\_NO\_THREADS} |
| 1297 |
The variable {\em MAX\_NO\_THREADS} is used to indicate the |
The variable {\em MAX\_NO\_THREADS} is used to indicate the |
| 1298 |
maximum number of OS threads that a code will use. This |
maximum number of OS threads that a code will use. This |
| 1299 |
value defaults to thirty-two and is set in the file {\em EEPARAMS.h}. |
value defaults to thirty-two and is set in the file {\em EEPARAMS.h}. |
| 1300 |
For single threaded execution it can be reduced to one if required. |
For single threaded execution it can be reduced to one if required. |
| 1301 |
The va;lue is largely private to the WRAPPER and application code |
The value; is largely private to the WRAPPER and application code |
| 1302 |
will nor normally reference the value, except in the following scenario. |
will nor normally reference the value, except in the following scenario. |
| 1303 |
|
|
| 1304 |
For certain physical parametrization schemes it is necessary to have |
For certain physical parametrization schemes it is necessary to have |
| 1309 |
if this might be unavailable then the work arrays can be extended |
if this might be unavailable then the work arrays can be extended |
| 1310 |
with dimensions use the tile dimensioning scheme of {\em nSx} |
with dimensions use the tile dimensioning scheme of {\em nSx} |
| 1311 |
and {\em nSy} ( as described in section |
and {\em nSy} ( as described in section |
| 1312 |
\ref{sec:specifying_a_decomposition}). However, if the configuration |
\ref{sect:specifying_a_decomposition}). However, if the configuration |
| 1313 |
being specified involves many more tiles than OS threads then |
being specified involves many more tiles than OS threads then |
| 1314 |
it can save memory resources to reduce the variable |
it can save memory resources to reduce the variable |
| 1315 |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads that |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads that |
| 1316 |
will be used and to declare the physical parameterisation |
will be used and to declare the physical parameterization |
| 1317 |
work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension. |
work arrays with a single {\em MAX\_NO\_THREADS} extra dimension. |
| 1318 |
An example of this is given in the verification experiment |
An example of this is given in the verification experiment |
| 1319 |
{\em aim.5l\_cs}. Here the default setting of |
{\em aim.5l\_cs}. Here the default setting of |
| 1320 |
{\em MAX\_NO\_THREADS} is altered to |
{\em MAX\_NO\_THREADS} is altered to |
| 1327 |
\begin{verbatim} |
\begin{verbatim} |
| 1328 |
common /FORCIN/ sst1(ngp,MAX_NO_THREADS) |
common /FORCIN/ sst1(ngp,MAX_NO_THREADS) |
| 1329 |
\end{verbatim} |
\end{verbatim} |
| 1330 |
This declaration scheme is not used widely, becuase most global data |
This declaration scheme is not used widely, because most global data |
| 1331 |
is used for permanent not temporary storage of state information. |
is used for permanent not temporary storage of state information. |
| 1332 |
In the case of permanent state information this approach cannot be used |
In the case of permanent state information this approach cannot be used |
| 1333 |
because there has to be enough storage allocated for all tiles. |
because there has to be enough storage allocated for all tiles. |
| 1334 |
However, the technique can sometimes be a useful scheme for reducing memory |
However, the technique can sometimes be a useful scheme for reducing memory |
| 1335 |
requirements in complex physical paramterisations. |
requirements in complex physical parameterizations. |
| 1336 |
\end{enumerate} |
\end{enumerate} |
| 1337 |
|
|
| 1338 |
\begin{figure} |
\begin{figure} |
| 1365 |
The isolation of performance critical communication primitives and the |
The isolation of performance critical communication primitives and the |
| 1366 |
sub-division of the simulation domain into tiles is a powerful tool. |
sub-division of the simulation domain into tiles is a powerful tool. |
| 1367 |
Here we show how it can be used to improve application performance and |
Here we show how it can be used to improve application performance and |
| 1368 |
how it can be used to adapt to new gridding approaches. |
how it can be used to adapt to new griding approaches. |
| 1369 |
|
|
| 1370 |
\subsubsection{JAM example} |
\subsubsection{JAM example} |
| 1371 |
\label{sec:jam_example} |
\label{sect:jam_example} |
| 1372 |
On some platforms a big performance boost can be obtained by |
On some platforms a big performance boost can be obtained by |
| 1373 |
binding the communication routines {\em \_EXCH} and |
binding the communication routines {\em \_EXCH} and |
| 1374 |
{\em \_GSUM} to specialized native libraries ) fro example the |
{\em \_GSUM} to specialized native libraries ) fro example the |
| 1384 |
\item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced |
\item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced |
| 1385 |
with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F}) |
with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F}) |
| 1386 |
\item a highly specialized form of the exchange operator (optimized |
\item a highly specialized form of the exchange operator (optimized |
| 1387 |
for overlap regions of width one) is substitued into the elliptic |
for overlap regions of width one) is substituted into the elliptic |
| 1388 |
solver routine {\em cg2d.F}. |
solver routine {\em cg2d.F}. |
| 1389 |
\end{itemize} |
\end{itemize} |
| 1390 |
Developing specialized code for other libraries follows a similar |
Developing specialized code for other libraries follows a similar |
| 1391 |
pattern. |
pattern. |
| 1392 |
|
|
| 1393 |
\subsubsection{Cube sphere communication} |
\subsubsection{Cube sphere communication} |
| 1394 |
\label{sec:cube_sphere_communication} |
\label{sect:cube_sphere_communication} |
| 1395 |
Actual {\em \_EXCH} routine code is generated automatically from |
Actual {\em \_EXCH} routine code is generated automatically from |
| 1396 |
a series of template files, for example {\em exch\_rx.template}. |
a series of template files, for example {\em exch\_rx.template}. |
| 1397 |
This is done to allow a large number of variations on the exchange |
This is done to allow a large number of variations on the exchange |
| 1398 |
process to be maintained. One set of variations supports the |
process to be maintained. One set of variations supports the |
| 1399 |
cube sphere grid. Support for a cube sphere gris in MITgcm is based |
cube sphere grid. Support for a cube sphere grid in MITgcm is based |
| 1400 |
on having each face of the cube as a separate tile (or tiles). |
on having each face of the cube as a separate tile or tiles. |
| 1401 |
The exchage routines are then able to absorb much of the |
The exchange routines are then able to absorb much of the |
| 1402 |
detailed rotation and reorientation required when moving around the |
detailed rotation and reorientation required when moving around the |
| 1403 |
cube grid. The set of {\em \_EXCH} routines that contain the |
cube grid. The set of {\em \_EXCH} routines that contain the |
| 1404 |
word cube in their name perform these transformations. |
word cube in their name perform these transformations. |
| 1405 |
They are invoked when the run-time logical parameter |
They are invoked when the run-time logical parameter |
| 1406 |
{\em useCubedSphereExchange} is set true. To facilitate the |
{\em useCubedSphereExchange} is set true. To facilitate the |
| 1407 |
transformations on a staggered C-grid, exchange operations are defined |
transformations on a staggered C-grid, exchange operations are defined |
| 1408 |
separately for both vector and scalar quantitities and for |
separately for both vector and scalar quantities and for |
| 1409 |
grid-centered and for grid-face and corner quantities. |
grid-centered and for grid-face and corner quantities. |
| 1410 |
Three sets of exchange routines are defined. Routines |
Three sets of exchange routines are defined. Routines |
| 1411 |
with names of the form {\em exch\_rx} are used to exchange |
with names of the form {\em exch\_rx} are used to exchange |
| 1421 |
|
|
| 1422 |
|
|
| 1423 |
\section{MITgcm execution under WRAPPER} |
\section{MITgcm execution under WRAPPER} |
| 1424 |
|
\begin{rawhtml} |
| 1425 |
|
<!-- CMIREDIR:mitgcm_wrapper: --> |
| 1426 |
|
\end{rawhtml} |
| 1427 |
|
|
| 1428 |
Fitting together the WRAPPER elements, package elements and |
Fitting together the WRAPPER elements, package elements and |
| 1429 |
MITgcm core equation elements of the source code produces calling |
MITgcm core equation elements of the source code produces calling |
| 1430 |
sequence shown in section \ref{sec:calling_sequence} |
sequence shown in section \ref{sect:calling_sequence} |
| 1431 |
|
|
| 1432 |
\subsection{Annotated call tree for MITgcm and WRAPPER} |
\subsection{Annotated call tree for MITgcm and WRAPPER} |
| 1433 |
\label{sec:calling_sequence} |
\label{sect:calling_sequence} |
| 1434 |
|
|
| 1435 |
WRAPPER layer. |
WRAPPER layer. |
| 1436 |
|
|
| 1437 |
|
{\footnotesize |
| 1438 |
\begin{verbatim} |
\begin{verbatim} |
| 1439 |
|
|
| 1440 |
MAIN |
MAIN |
| 1462 |
|--THE_MODEL_MAIN :: Numerical code top-level driver routine |
|--THE_MODEL_MAIN :: Numerical code top-level driver routine |
| 1463 |
|
|
| 1464 |
\end{verbatim} |
\end{verbatim} |
| 1465 |
|
} |
| 1466 |
|
|
| 1467 |
Core equations plus packages. |
Core equations plus packages. |
| 1468 |
|
|
| 1469 |
|
{\footnotesize |
| 1470 |
\begin{verbatim} |
\begin{verbatim} |
| 1471 |
C |
C |
| 1472 |
C |
C |
| 1476 |
C | |
C | |
| 1477 |
C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm |
C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm |
| 1478 |
C | :: Called from WRAPPER level numerical |
C | :: Called from WRAPPER level numerical |
| 1479 |
C | :: code innvocation routine. On entry |
C | :: code invocation routine. On entry |
| 1480 |
C | :: to THE_MODEL_MAIN separate thread and |
C | :: to THE_MODEL_MAIN separate thread and |
| 1481 |
C | :: separate processes will have been established. |
C | :: separate processes will have been established. |
| 1482 |
C | :: Each thread and process will have a unique ID |
C | :: Each thread and process will have a unique ID |
| 1490 |
C | | :: By default kernel parameters are read from file |
C | | :: By default kernel parameters are read from file |
| 1491 |
C | | :: "data" in directory in which code executes. |
C | | :: "data" in directory in which code executes. |
| 1492 |
C | | |
C | | |
| 1493 |
C | |-MON_INIT :: Initialises monitor pacakge ( see pkg/monitor ) |
C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor ) |
| 1494 |
C | | |
C | | |
| 1495 |
C | |-INI_GRID :: Control grid array (vert. and hori.) initialisation. |
C | |-INI_GRID :: Control grid array (vert. and hori.) initialization. |
| 1496 |
C | | | :: Grid arrays are held and described in GRID.h. |
C | | | :: Grid arrays are held and described in GRID.h. |
| 1497 |
C | | | |
C | | | |
| 1498 |
C | | |-INI_VERTICAL_GRID :: Initialise vertical grid arrays. |
C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays. |
| 1499 |
C | | | |
C | | | |
| 1500 |
C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialisation |
C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization |
| 1501 |
C | | | :: (calculate grid from kernel parameters). |
C | | | :: (calculate grid from kernel parameters). |
| 1502 |
C | | | |
C | | | |
| 1503 |
C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid |
C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid |
| 1504 |
C | | | :: initialisation (calculate grid from |
C | | | :: initialization (calculate grid from |
| 1505 |
C | | | :: kernel parameters). |
C | | | :: kernel parameters). |
| 1506 |
C | | | |
C | | | |
| 1507 |
C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz. |
C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz. |
| 1508 |
C | | :: grid initialisations. ( input from raw |
C | | :: grid initializations. ( input from raw |
| 1509 |
C | | :: grid files, LONC.bin, DXF.bin etc... ) |
C | | :: grid files, LONC.bin, DXF.bin etc... ) |
| 1510 |
C | | |
C | | |
| 1511 |
C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography. |
C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography. |
| 1516 |
C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf |
C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf |
| 1517 |
C | | |
C | | |
| 1518 |
C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane, |
C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane, |
| 1519 |
C | | :: sphere optins are coded. |
C | | :: sphere options are coded. |
| 1520 |
C | | |
C | | |
| 1521 |
C | |-PACAKGES_BOOT :: Start up the optional package environment. |
C | |-PACAKGES_BOOT :: Start up the optional package environment. |
| 1522 |
C | | :: Runtime selection of active packages. |
C | | :: Runtime selection of active packages. |
| 1537 |
C | |-PACKAGES_CHECK |
C | |-PACKAGES_CHECK |
| 1538 |
C | | | |
C | | | |
| 1539 |
C | | |-KPP_CHECK :: KPP Package. pkg/kpp |
C | | |-KPP_CHECK :: KPP Package. pkg/kpp |
| 1540 |
C | | |-OBCS_CHECK :: Open bndy Pacakge. pkg/obcs |
C | | |-OBCS_CHECK :: Open bndy Package. pkg/obcs |
| 1541 |
C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi |
C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi |
| 1542 |
C | | |
C | | |
| 1543 |
C | |-PACKAGES_INIT_FIXED |
C | |-PACKAGES_INIT_FIXED |
| 1557 |
C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl |
C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl |
| 1558 |
C | |
C | |
| 1559 |
C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop |
C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop |
| 1560 |
C ! :: Auotmatically gerenrated by TAMC/TAF. |
C ! :: Auotmatically generated by TAMC/TAF. |
| 1561 |
C | |
C | |
| 1562 |
C |-CTRL_PACK :: Control vector support package. see pkg/ctrl |
C |-CTRL_PACK :: Control vector support package. see pkg/ctrl |
| 1563 |
C | |
C | |
| 1571 |
C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf |
C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf |
| 1572 |
C | | | |
C | | | |
| 1573 |
C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane, |
C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane, |
| 1574 |
C | | | :: sphere optins are coded. |
C | | | :: sphere options are coded. |
| 1575 |
C | | | |
C | | | |
| 1576 |
C | | |-INI_CG2D :: 2d con. grad solver initialisation. |
C | | |-INI_CG2D :: 2d con. grad solver initialisation. |
| 1577 |
C | | |-INI_CG3D :: 3d con. grad solver initialisation. |
C | | |-INI_CG3D :: 3d con. grad solver initialisation. |
| 1579 |
C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical |
C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical |
| 1580 |
C | | | :: fields). |
C | | | :: fields). |
| 1581 |
C | | | |
C | | | |
| 1582 |
C | | |-INI_FIELDS :: Control initialising model fields to non-zero |
C | | |-INI_FIELDS :: Control initializing model fields to non-zero |
| 1583 |
C | | | |-INI_VEL :: Initialize 3D flow field. |
C | | | |-INI_VEL :: Initialize 3D flow field. |
| 1584 |
C | | | |-INI_THETA :: Set model initial temperature field. |
C | | | |-INI_THETA :: Set model initial temperature field. |
| 1585 |
C | | | |-INI_SALT :: Set model initial salinity field. |
C | | | |-INI_SALT :: Set model initial salinity field. |
| 1657 |
C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness. |
C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness. |
| 1658 |
C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf ) |
C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf ) |
| 1659 |
C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data. |
C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data. |
| 1660 |
C/\ | | | | :: Simple interpolcation between end-points |
C/\ | | | | :: Simple interpolation between end-points |
| 1661 |
C/\ | | | | :: for forcing datasets. |
C/\ | | | | :: for forcing datasets. |
| 1662 |
C/\ | | | | |
C/\ | | | | |
| 1663 |
C/\ | | | |-EXCH :: Sync forcing. in overlap regions. |
C/\ | | | |-EXCH :: Sync forcing. in overlap regions. |
| 1805 |
C :: events. |
C :: events. |
| 1806 |
C |
C |
| 1807 |
\end{verbatim} |
\end{verbatim} |
| 1808 |
|
} |
| 1809 |
|
|
| 1810 |
\subsection{Measuring and Characterizing Performance} |
\subsection{Measuring and Characterizing Performance} |
| 1811 |
|
|