1 |
|
% $Header$ |
2 |
|
|
3 |
In this chapter we describe the software architecture and |
In this chapter we describe the software architecture and |
4 |
implementation strategy for the MITgcm code. The first part of this |
implementation strategy for the MITgcm code. The first part of this |
12 |
three-fold |
three-fold |
13 |
|
|
14 |
\begin{itemize} |
\begin{itemize} |
|
|
|
15 |
\item We wish to be able to study a very broad range |
\item We wish to be able to study a very broad range |
16 |
of interesting and challenging rotating fluids problems. |
of interesting and challenging rotating fluids problems. |
|
|
|
17 |
\item We wish the model code to be readily targeted to |
\item We wish the model code to be readily targeted to |
18 |
a wide range of platforms |
a wide range of platforms |
|
|
|
19 |
\item On any given platform we would like to be |
\item On any given platform we would like to be |
20 |
able to achieve performance comparable to an implementation |
able to achieve performance comparable to an implementation |
21 |
developed and specialized specifically for that platform. |
developed and specialized specifically for that platform. |
|
|
|
22 |
\end{itemize} |
\end{itemize} |
23 |
|
|
24 |
These points are summarized in figure \ref{fig:mitgcm_architecture_goals} |
These points are summarized in figure \ref{fig:mitgcm_architecture_goals} |
27 |
of |
of |
28 |
|
|
29 |
\begin{enumerate} |
\begin{enumerate} |
|
|
|
30 |
\item A core set of numerical and support code. This is discussed in detail in |
\item A core set of numerical and support code. This is discussed in detail in |
31 |
section \ref{sec:partII}. |
section \ref{sec:partII}. |
|
|
|
32 |
\item A scheme for supporting optional "pluggable" {\bf packages} (containing |
\item A scheme for supporting optional "pluggable" {\bf packages} (containing |
33 |
for example mixed-layer schemes, biogeochemical schemes, atmospheric physics). |
for example mixed-layer schemes, biogeochemical schemes, atmospheric physics). |
34 |
These packages are used both to overlay alternate dynamics and to introduce |
These packages are used both to overlay alternate dynamics and to introduce |
35 |
specialized physical content onto the core numerical code. An overview of |
specialized physical content onto the core numerical code. An overview of |
36 |
the {\bf package} scheme is given at the start of part \ref{part:packages}. |
the {\bf package} scheme is given at the start of part \ref{part:packages}. |
|
|
|
|
|
|
37 |
\item A support framework called {\bf WRAPPER} (Wrappable Application Parallel |
\item A support framework called {\bf WRAPPER} (Wrappable Application Parallel |
38 |
Programming Environment Resource), within which the core numerics and pluggable |
Programming Environment Resource), within which the core numerics and pluggable |
39 |
packages operate. |
packages operate. |
|
|
|
40 |
\end{enumerate} |
\end{enumerate} |
41 |
|
|
42 |
This chapter focuses on describing the {\bf WRAPPER} environment under which |
This chapter focuses on describing the {\bf WRAPPER} environment under which |
43 |
both the core numerics and the pluggable packages function. The description |
both the core numerics and the pluggable packages function. The description |
44 |
presented here is intended to be a detailed exposistion and contains significant |
presented here is intended to be a detailed exposition and contains significant |
45 |
background material, as well as advanced details on working with the WRAPPER. |
background material, as well as advanced details on working with the WRAPPER. |
46 |
The examples section of this manual (part \ref{part:example}) contains more |
The examples section of this manual (part \ref{part:example}) contains more |
47 |
succinct, step-by-step instructions on running basic numerical |
succinct, step-by-step instructions on running basic numerical |
49 |
starting from an example code and adapting it to suit a particular situation |
starting from an example code and adapting it to suit a particular situation |
50 |
will be all that is required. |
will be all that is required. |
51 |
|
|
52 |
|
|
53 |
\begin{figure} |
\begin{figure} |
54 |
\begin{center} |
\begin{center} |
55 |
\resizebox{!}{2.5in}{ |
\resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}} |
|
\includegraphics*[1.5in,2.4in][9.5in,6.3in]{part4/mitgcm_goals.eps} |
|
|
} |
|
56 |
\end{center} |
\end{center} |
57 |
\caption{The MITgcm architecture is designed to allow simulation of a wide |
\caption{ |
58 |
|
The MITgcm architecture is designed to allow simulation of a wide |
59 |
range of physical problems on a wide range of hardware. The computational |
range of physical problems on a wide range of hardware. The computational |
60 |
resource requirements of the applications targeted range from around |
resource requirements of the applications targeted range from around |
61 |
$10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes |
$10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes |
62 |
( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of |
( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of |
63 |
interest range from $10^{9}$ floating point operations to more than $10^{17}$ |
interest range from $10^{9}$ floating point operations to more than $10^{17}$ |
64 |
floating point operations.} \label{fig:mitgcm_architecture_goals} |
floating point operations.} |
65 |
|
\label{fig:mitgcm_architecture_goals} |
66 |
\end{figure} |
\end{figure} |
67 |
|
|
68 |
\section{WRAPPER} |
\section{WRAPPER} |
80 |
\ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code |
\ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code |
81 |
that fits within it from architectural differences between hardware platforms |
that fits within it from architectural differences between hardware platforms |
82 |
and operating systems. This allows numerical code to be easily retargetted. |
and operating systems. This allows numerical code to be easily retargetted. |
83 |
|
|
84 |
|
|
85 |
\begin{figure} |
\begin{figure} |
86 |
\begin{center} |
\begin{center} |
87 |
\resizebox{6in}{4.5in}{ |
\resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}} |
|
\includegraphics*[0.6in,0.7in][9.0in,8.5in]{part4/fit_in_wrapper.eps} |
|
|
} |
|
88 |
\end{center} |
\end{center} |
89 |
\caption{ Numerical code is written too fit within a software support |
\caption{ |
90 |
|
Numerical code is written too fit within a software support |
91 |
infrastructure called WRAPPER. The WRAPPER is portable and |
infrastructure called WRAPPER. The WRAPPER is portable and |
92 |
can be sepcialized for a wide range of specific target hardware and |
can be specialized for a wide range of specific target hardware and |
93 |
programming environments, without impacting numerical code that fits |
programming environments, without impacting numerical code that fits |
94 |
within the WRAPPER. Codes that fit within the WRAPPER can generally be |
within the WRAPPER. Codes that fit within the WRAPPER can generally be |
95 |
made to run as fast on a particular platform as codes specially |
made to run as fast on a particular platform as codes specially |
96 |
optimized for that platform. |
optimized for that platform.} |
97 |
} \label{fig:fit_in_wrapper} |
\label{fig:fit_in_wrapper} |
98 |
\end{figure} |
\end{figure} |
99 |
|
|
100 |
\subsection{Target hardware} |
\subsection{Target hardware} |
136 |
class of machines (for example Parallel Vector Processor Systems). Instead the |
class of machines (for example Parallel Vector Processor Systems). Instead the |
137 |
WRAPPER provides applications with an |
WRAPPER provides applications with an |
138 |
abstract {\it machine model}. The machine model is very general, however, it can |
abstract {\it machine model}. The machine model is very general, however, it can |
139 |
easily be specialized to fit, in a computationally effificent manner, any |
easily be specialized to fit, in a computationally efficient manner, any |
140 |
computer architecture currently available to the scientific computing community. |
computer architecture currently available to the scientific computing community. |
141 |
|
|
142 |
\subsection{Machine model parallelism} |
\subsection{Machine model parallelism} |
143 |
|
|
144 |
Codes operating under the WRAPPER target an abstract machine that is assumed to |
Codes operating under the WRAPPER target an abstract machine that is assumed to |
145 |
consist of one or more logical processors that can compute concurrently. |
consist of one or more logical processors that can compute concurrently. |
146 |
Computational work is divided amongst the logical |
Computational work is divided among the logical |
147 |
processors by allocating ``ownership'' to |
processors by allocating ``ownership'' to |
148 |
each processor of a certain set (or sets) of calculations. Each set of |
each processor of a certain set (or sets) of calculations. Each set of |
149 |
calculations owned by a particular processor is associated with a specific |
calculations owned by a particular processor is associated with a specific |
166 |
space allocated to a particular logical processor, there will be data |
space allocated to a particular logical processor, there will be data |
167 |
structures (arrays, scalar variables etc...) that hold the simulated state of |
structures (arrays, scalar variables etc...) that hold the simulated state of |
168 |
that region. We refer to these data structures as being {\bf owned} by the |
that region. We refer to these data structures as being {\bf owned} by the |
169 |
pprocessor to which their |
processor to which their |
170 |
associated region of physical space has been allocated. Individual |
associated region of physical space has been allocated. Individual |
171 |
regions that are allocated to processors are called {\bf tiles}. A |
regions that are allocated to processors are called {\bf tiles}. A |
172 |
processor can own more |
processor can own more |
180 |
|
|
181 |
\begin{figure} |
\begin{figure} |
182 |
\begin{center} |
\begin{center} |
183 |
\resizebox{7in}{3in}{ |
\resizebox{5in}{!}{ |
184 |
\includegraphics*[0.5in,2.7in][12.5in,6.4in]{part4/domain_decomp.eps} |
\includegraphics{part4/domain_decomp.eps} |
185 |
} |
} |
186 |
\end{center} |
\end{center} |
187 |
\caption{ The WRAPPER provides support for one and two dimensional |
\caption{ The WRAPPER provides support for one and two dimensional |
216 |
|
|
217 |
\begin{figure} |
\begin{figure} |
218 |
\begin{center} |
\begin{center} |
219 |
\resizebox{7in}{3in}{ |
\resizebox{5in}{!}{ |
220 |
\includegraphics*[4.5in,3.7in][12.5in,6.7in]{part4/tiled-world.eps} |
\includegraphics{part4/tiled-world.eps} |
221 |
} |
} |
222 |
\end{center} |
\end{center} |
223 |
\caption{ A global grid subdivided into tiles. |
\caption{ A global grid subdivided into tiles. |
398 |
|
|
399 |
\begin{figure} |
\begin{figure} |
400 |
\begin{center} |
\begin{center} |
401 |
\resizebox{5in}{3in}{ |
\resizebox{5in}{!}{ |
402 |
\includegraphics*[1.5in,0.7in][7.9in,4.4in]{part4/comm-primm.eps} |
\includegraphics{part4/comm-primm.eps} |
403 |
} |
} |
404 |
\end{center} |
\end{center} |
405 |
\caption{Three performance critical parallel primititives are provided |
\caption{Three performance critical parallel primitives are provided |
406 |
by the WRAPPER. These primititives are always used to communicate data |
by the WRAPPER. These primitives are always used to communicate data |
407 |
between tiles. The figure shows four tiles. The curved arrows indicate |
between tiles. The figure shows four tiles. The curved arrows indicate |
408 |
exchange primitives which transfer data between the overlap regions at tile |
exchange primitives which transfer data between the overlap regions at tile |
409 |
edges and interior regions for nearest-neighbor tiles. |
edges and interior regions for nearest-neighbor tiles. |
479 |
|
|
480 |
\begin{figure} |
\begin{figure} |
481 |
\begin{center} |
\begin{center} |
482 |
\resizebox{5in}{3in}{ |
\resizebox{5in}{!}{ |
483 |
\includegraphics*[0.5in,1.3in][7.9in,5.7in]{part4/tiling_detail.eps} |
\includegraphics{part4/tiling_detail.eps} |
484 |
} |
} |
485 |
\end{center} |
\end{center} |
486 |
\caption{The tiling strategy that the WRAPPER supports allows tiles |
\caption{The tiling strategy that the WRAPPER supports allows tiles |
583 |
|
|
584 |
\begin{figure} |
\begin{figure} |
585 |
\begin{center} |
\begin{center} |
586 |
\resizebox{5in}{7in}{ |
\resizebox{5in}{!}{ |
587 |
\includegraphics*[0.5in,0.3in][7.9in,10.7in]{part4/size_h.eps} |
\includegraphics{part4/size_h.eps} |
588 |
} |
} |
589 |
\end{center} |
\end{center} |
590 |
\caption{ The three level domain decomposition hierarchy employed by the |
\caption{ The three level domain decomposition hierarchy employed by the |
789 |
There are six tiles allocated to six separate logical processors ({\em nSx=6}). |
There are six tiles allocated to six separate logical processors ({\em nSx=6}). |
790 |
This set of values can be used for a cube sphere calculation. |
This set of values can be used for a cube sphere calculation. |
791 |
Each tile of size $32 \times 32$ represents a face of the |
Each tile of size $32 \times 32$ represents a face of the |
792 |
cube. Initialising the tile connectivity correctly ( see section |
cube. Initializing the tile connectivity correctly ( see section |
793 |
\ref{sec:cube_sphere_communication}. allows the rotations associated with |
\ref{sec:cube_sphere_communication}. allows the rotations associated with |
794 |
moving between the six cube faces to be embedded within the |
moving between the six cube faces to be embedded within the |
795 |
tile-tile communication code. |
tile-tile communication code. |
806 |
by the application code. The startup calling sequence followed by the |
by the application code. The startup calling sequence followed by the |
807 |
WRAPPER is shown in figure \ref{fig:wrapper_startup}. |
WRAPPER is shown in figure \ref{fig:wrapper_startup}. |
808 |
|
|
|
|
|
809 |
\begin{figure} |
\begin{figure} |
810 |
\begin{verbatim} |
\begin{verbatim} |
811 |
|
|
842 |
\end{figure} |
\end{figure} |
843 |
|
|
844 |
\subsubsection{Multi-threaded execution} |
\subsubsection{Multi-threaded execution} |
845 |
|
\label{sec:multi-threaded-execution} |
846 |
Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the |
Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the |
847 |
WRAPPER may cause several coarse grain threads to be initialized. The routine |
WRAPPER may cause several coarse grain threads to be initialized. The routine |
848 |
{\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single |
{\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single |
898 |
\end{enumerate} |
\end{enumerate} |
899 |
|
|
900 |
|
|
|
\paragraph{Environment variables} |
|
|
On most systems multi-threaded execution also requires the setting |
|
|
of a special environment variable. On many machines this variable |
|
|
is called PARALLEL and its values should be set to the number |
|
|
of parallel threads required. Generally the help pages associated |
|
|
with the multi-threaded compiler on a machine will explain |
|
|
how to set the required environment variables for that machines. |
|
|
|
|
|
\paragraph{Runtime input parameters} |
|
|
Finally the file {\em eedata} needs to be configured to indicate |
|
|
the number of threads to be used in the x and y directions. |
|
|
The variables {\em nTx} and {\em nTy} in this file are used to |
|
|
specify the information required. The product of {\em nTx} and |
|
|
{\em nTy} must be equal to the number of threads spawned i.e. |
|
|
the setting of the environment variable PARALLEL. |
|
|
The value of {\em nTx} must subdivide the number of sub-domains |
|
|
in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the |
|
|
number of sub-domains in y ({\em nSy}) exactly. |
|
|
|
|
901 |
An example of valid settings for the {\em eedata} file for a |
An example of valid settings for the {\em eedata} file for a |
902 |
domain with two subdomains in y and running with two threads is shown |
domain with two subdomains in y and running with two threads is shown |
903 |
below |
below |
930 |
} \\ |
} \\ |
931 |
|
|
932 |
\subsubsection{Multi-process execution} |
\subsubsection{Multi-process execution} |
933 |
|
\label{sec:multi-process-execution} |
934 |
|
|
935 |
Despite its appealing programming model, multi-threaded execution remains |
Despite its appealing programming model, multi-threaded execution remains |
936 |
less common then multi-process execution. One major reason for this |
less common then multi-process execution. One major reason for this |
942 |
|
|
943 |
Multi-process execution is more ubiquitous. |
Multi-process execution is more ubiquitous. |
944 |
In order to run code in a multi-process configuration a decomposition |
In order to run code in a multi-process configuration a decomposition |
945 |
specification is given ( in which the at least one of the |
specification ( see section \ref{sec:specifying_a_decomposition}) |
946 |
|
is given ( in which the at least one of the |
947 |
parameters {\em nPx} or {\em nPy} will be greater than one) |
parameters {\em nPx} or {\em nPy} will be greater than one) |
948 |
and then, as for multi-threaded operation, |
and then, as for multi-threaded operation, |
949 |
appropriate compile time and run time steps must be taken. |
appropriate compile time and run time steps must be taken. |
1006 |
\begin{verbatim} |
\begin{verbatim} |
1007 |
mpirun -np 64 -machinefile mf ./mitgcmuv |
mpirun -np 64 -machinefile mf ./mitgcmuv |
1008 |
\end{verbatim} |
\end{verbatim} |
1009 |
In this example the text {\em -np 64} specifices the number of processes |
In this example the text {\em -np 64} specifies the number of processes |
1010 |
that will be created. The numeric value {\em 64} must be equal to the |
that will be created. The numeric value {\em 64} must be equal to the |
1011 |
product of the processor grid settings of {\em nPx} and {\em nPy} |
product of the processor grid settings of {\em nPx} and {\em nPy} |
1012 |
in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file |
in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file |
1023 |
\end{minipage} |
\end{minipage} |
1024 |
} \\ |
} \\ |
1025 |
|
|
1026 |
|
|
1027 |
|
\paragraph{Environment variables} |
1028 |
|
On most systems multi-threaded execution also requires the setting |
1029 |
|
of a special environment variable. On many machines this variable |
1030 |
|
is called PARALLEL and its values should be set to the number |
1031 |
|
of parallel threads required. Generally the help pages associated |
1032 |
|
with the multi-threaded compiler on a machine will explain |
1033 |
|
how to set the required environment variables for that machines. |
1034 |
|
|
1035 |
|
\paragraph{Runtime input parameters} |
1036 |
|
Finally the file {\em eedata} needs to be configured to indicate |
1037 |
|
the number of threads to be used in the x and y directions. |
1038 |
|
The variables {\em nTx} and {\em nTy} in this file are used to |
1039 |
|
specify the information required. The product of {\em nTx} and |
1040 |
|
{\em nTy} must be equal to the number of threads spawned i.e. |
1041 |
|
the setting of the environment variable PARALLEL. |
1042 |
|
The value of {\em nTx} must subdivide the number of sub-domains |
1043 |
|
in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the |
1044 |
|
number of sub-domains in y ({\em nSy}) exactly. |
1045 |
The multiprocess startup of the MITgcm executable {\em mitgcmuv} |
The multiprocess startup of the MITgcm executable {\em mitgcmuv} |
1046 |
is controlled by the routines {\em EEBOOT\_MINIMAL()} and |
is controlled by the routines {\em EEBOOT\_MINIMAL()} and |
1047 |
{\em INI\_PROCS()}. The first routine performs basic steps required |
{\em INI\_PROCS()}. The first routine performs basic steps required |
1054 |
output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files |
output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files |
1055 |
are used for reporting status and configuration information and |
are used for reporting status and configuration information and |
1056 |
for reporting error conditions on a process by process basis. |
for reporting error conditions on a process by process basis. |
1057 |
The {{\em EEBOOT\_MINIMAL()} procedure also sets the variables |
The {\em EEBOOT\_MINIMAL()} procedure also sets the variables |
1058 |
{\em myProcId} and {\em MPI\_COMM\_MODEL}. |
{\em myProcId} and {\em MPI\_COMM\_MODEL}. |
1059 |
These variables are related |
These variables are related |
1060 |
to processor identification are are used later in the routine |
to processor identification are are used later in the routine |
1095 |
The WRAPPER maintains internal information that is used for communication |
The WRAPPER maintains internal information that is used for communication |
1096 |
operations and that can be customized for different platforms. This section |
operations and that can be customized for different platforms. This section |
1097 |
describes the information that is held and used. |
describes the information that is held and used. |
1098 |
|
|
1099 |
\begin{enumerate} |
\begin{enumerate} |
1100 |
\item {\bf Tile-tile connectivity information} For each tile the WRAPPER |
\item {\bf Tile-tile connectivity information} For each tile the WRAPPER |
1101 |
sets a flag that sets the tile number to the north, south, east and |
sets a flag that sets the tile number to the north, south, east and |
1109 |
This latter set of variables can take one of the following values |
This latter set of variables can take one of the following values |
1110 |
{\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}. |
{\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}. |
1111 |
A value of {\em COMM\_NONE} is used to indicate that a tile has no |
A value of {\em COMM\_NONE} is used to indicate that a tile has no |
1112 |
neighbor to cummnicate with on a particular face. A value |
neighbor to communicate with on a particular face. A value |
1113 |
of {\em COMM\_MSG} is used to indicated that some form of distributed |
of {\em COMM\_MSG} is used to indicated that some form of distributed |
1114 |
memory communication is required to communicate between |
memory communication is required to communicate between |
1115 |
these tile faces ( see section \ref{sec:distributed_memory_communication}). |
these tile faces ( see section \ref{sec:distributed_memory_communication}). |
1166 |
are read from the file {\em eedata}. If the value of {\em nThreads} |
are read from the file {\em eedata}. If the value of {\em nThreads} |
1167 |
is inconsistent with the number of threads requested from the |
is inconsistent with the number of threads requested from the |
1168 |
operating system (for example by using an environment |
operating system (for example by using an environment |
1169 |
varialble as described in section \ref{sec:multi_threaded_execution}) |
variable as described in section \ref{sec:multi_threaded_execution}) |
1170 |
then usually an error will be reported by the routine |
then usually an error will be reported by the routine |
1171 |
{\em CHECK\_THREADS}.\\ |
{\em CHECK\_THREADS}.\\ |
1172 |
|
|
1183 |
\end{minipage} |
\end{minipage} |
1184 |
} |
} |
1185 |
|
|
|
\begin{figure} |
|
|
\begin{verbatim} |
|
|
C-- |
|
|
C-- Parallel directives for MIPS Pro Fortran compiler |
|
|
C-- |
|
|
C Parallel compiler directives for SGI with IRIX |
|
|
C$PAR PARALLEL DO |
|
|
C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE, |
|
|
C$PAR& SHARE(nThreads),LOCAL(myThid,I) |
|
|
C |
|
|
DO I=1,nThreads |
|
|
myThid = I |
|
|
|
|
|
C-- Invoke nThreads instances of the numerical model |
|
|
CALL THE_MODEL_MAIN(myThid) |
|
|
|
|
|
ENDDO |
|
|
\end{verbatim} |
|
|
\caption{Prior to transferring control to |
|
|
the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use |
|
|
MP directives to spawn multiple threads. |
|
|
} \label{fig:mp_directives} |
|
|
\end{figure} |
|
|
|
|
|
|
|
1186 |
\item {\bf memsync flags} |
\item {\bf memsync flags} |
1187 |
As discussed in section \ref{sec:memory_consistency}, when using shared memory, |
As discussed in section \ref{sec:memory_consistency}, when using shared memory, |
1188 |
a low-level system function may be need to force memory consistency. |
a low-level system function may be need to force memory consistency. |
1200 |
\begin{verbatim} |
\begin{verbatim} |
1201 |
asm("membar #LoadStore|#StoreStore"); |
asm("membar #LoadStore|#StoreStore"); |
1202 |
\end{verbatim} |
\end{verbatim} |
1203 |
for an Alpha based sytem the euivalent code reads |
for an Alpha based system the equivalent code reads |
1204 |
\begin{verbatim} |
\begin{verbatim} |
1205 |
asm("mb"); |
asm("mb"); |
1206 |
\end{verbatim} |
\end{verbatim} |
1212 |
\item {\bf Cache line size} |
\item {\bf Cache line size} |
1213 |
As discussed in section \ref{sec:cache_effects_and_false_sharing}, |
As discussed in section \ref{sec:cache_effects_and_false_sharing}, |
1214 |
milti-threaded codes explicitly avoid penalties associated with excessive |
milti-threaded codes explicitly avoid penalties associated with excessive |
1215 |
coherence traffic on an SMP system. To do this the sgared memory data structures |
coherence traffic on an SMP system. To do this the shared memory data structures |
1216 |
used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines |
used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines |
1217 |
are padded. The variables that control the padding are set in the |
are padded. The variables that control the padding are set in the |
1218 |
header file {\em EEPARAMS.h}. These variables are called |
header file {\em EEPARAMS.h}. These variables are called |
1220 |
{\em lShare8}. The default values should not normally need changing. |
{\em lShare8}. The default values should not normally need changing. |
1221 |
\item {\bf \_BARRIER} |
\item {\bf \_BARRIER} |
1222 |
This is a CPP macro that is expanded to a call to a routine |
This is a CPP macro that is expanded to a call to a routine |
1223 |
which synchronises all the logical processors running under the |
which synchronizes all the logical processors running under the |
1224 |
WRAPPER. Using a macro here preserves flexibility to insert |
WRAPPER. Using a macro here preserves flexibility to insert |
1225 |
a specialized call in-line into application code. By default this |
a specialized call in-line into application code. By default this |
1226 |
resolves to calling the procedure {\em BARRIER()}. The default |
resolves to calling the procedure {\em BARRIER()}. The default |
1228 |
|
|
1229 |
\item {\bf \_GSUM} |
\item {\bf \_GSUM} |
1230 |
This is a CPP macro that is expanded to a call to a routine |
This is a CPP macro that is expanded to a call to a routine |
1231 |
which sums up a floating point numner |
which sums up a floating point number |
1232 |
over all the logical processors running under the |
over all the logical processors running under the |
1233 |
WRAPPER. Using a macro here provides extra flexibility to insert |
WRAPPER. Using a macro here provides extra flexibility to insert |
1234 |
a specialized call in-line into application code. By default this |
a specialized call in-line into application code. By default this |
1235 |
resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for |
resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for |
1236 |
84=bit floating point operands) |
64-bit floating point operands) |
1237 |
or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default |
or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default |
1238 |
setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}. |
setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}. |
1239 |
The \_GSUM macro is a performance critical operation, especially for |
The \_GSUM macro is a performance critical operation, especially for |
1240 |
large processor count, small tile size configurations. |
large processor count, small tile size configurations. |
1253 |
\_EXCH operation plays a crucial role in scaling to small tile, |
\_EXCH operation plays a crucial role in scaling to small tile, |
1254 |
large logical and physical processor count configurations. |
large logical and physical processor count configurations. |
1255 |
The example in section \ref{sec:jam_example} discusses defining an |
The example in section \ref{sec:jam_example} discusses defining an |
1256 |
optimised and specialized form on the \_EXCH operation. |
optimized and specialized form on the \_EXCH operation. |
1257 |
|
|
1258 |
The \_EXCH operation is also central to supporting grids such as |
The \_EXCH operation is also central to supporting grids such as |
1259 |
the cube-sphere grid. In this class of grid a rotation may be required |
the cube-sphere grid. In this class of grid a rotation may be required |
1260 |
between tiles. Aligning the coordinate requiring rotation with the |
between tiles. Aligning the coordinate requiring rotation with the |
1261 |
tile decomposistion, allows the coordinate transformation to |
tile decomposition, allows the coordinate transformation to |
1262 |
be embedded within a custom form of the \_EXCH primitive. |
be embedded within a custom form of the \_EXCH primitive. |
1263 |
|
|
1264 |
\item {\bf Reverse Mode} |
\item {\bf Reverse Mode} |
1265 |
The communication primitives \_EXCH and \_GSUM both employ |
The communication primitives \_EXCH and \_GSUM both employ |
1266 |
hand-written adjoint forms (or reverse mode) forms. |
hand-written adjoint forms (or reverse mode) forms. |
1267 |
These reverse mode forms can be found in the |
These reverse mode forms can be found in the |
1268 |
sourc code directory {\em pkg/autodiff}. |
source code directory {\em pkg/autodiff}. |
1269 |
For the global sum primitive the reverse mode form |
For the global sum primitive the reverse mode form |
1270 |
calls are to {\em GLOBAL\_ADSUM\_R4} and |
calls are to {\em GLOBAL\_ADSUM\_R4} and |
1271 |
{\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the |
{\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the |
1272 |
exchamge primitives are found in routines |
exchange primitives are found in routines |
1273 |
prefixed {\em ADEXCH}. The exchange routines make calls to |
prefixed {\em ADEXCH}. The exchange routines make calls to |
1274 |
the same low-level communication primitives as the forward mode |
the same low-level communication primitives as the forward mode |
1275 |
operations. However, the routine argument {\em simulationMode} |
operations. However, the routine argument {\em simulationMode} |
1281 |
maximum number of OS threads that a code will use. This |
maximum number of OS threads that a code will use. This |
1282 |
value defaults to thirty-two and is set in the file {\em EEPARAMS.h}. |
value defaults to thirty-two and is set in the file {\em EEPARAMS.h}. |
1283 |
For single threaded execution it can be reduced to one if required. |
For single threaded execution it can be reduced to one if required. |
1284 |
The va;lue is largely private to the WRAPPER and application code |
The value; is largely private to the WRAPPER and application code |
1285 |
will nor normally reference the value, except in the following scenario. |
will nor normally reference the value, except in the following scenario. |
1286 |
|
|
1287 |
For certain physical parametrization schemes it is necessary to have |
For certain physical parametrization schemes it is necessary to have |
1296 |
being specified involves many more tiles than OS threads then |
being specified involves many more tiles than OS threads then |
1297 |
it can save memory resources to reduce the variable |
it can save memory resources to reduce the variable |
1298 |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads that |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads that |
1299 |
will be used and to declare the physical parameterisation |
will be used and to declare the physical parameterization |
1300 |
work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension. |
work arrays with a single {\em MAX\_NO\_THREADS} extra dimension. |
1301 |
An example of this is given in the verification experiment |
An example of this is given in the verification experiment |
1302 |
{\em aim.5l\_cs}. Here the default setting of |
{\em aim.5l\_cs}. Here the default setting of |
1303 |
{\em MAX\_NO\_THREADS} is altered to |
{\em MAX\_NO\_THREADS} is altered to |
1310 |
\begin{verbatim} |
\begin{verbatim} |
1311 |
common /FORCIN/ sst1(ngp,MAX_NO_THREADS) |
common /FORCIN/ sst1(ngp,MAX_NO_THREADS) |
1312 |
\end{verbatim} |
\end{verbatim} |
1313 |
This declaration scheme is not used widely, becuase most global data |
This declaration scheme is not used widely, because most global data |
1314 |
is used for permanent not temporary storage of state information. |
is used for permanent not temporary storage of state information. |
1315 |
In the case of permanent state information this approach cannot be used |
In the case of permanent state information this approach cannot be used |
1316 |
because there has to be enough storage allocated for all tiles. |
because there has to be enough storage allocated for all tiles. |
1317 |
However, the technique can sometimes be a useful scheme for reducing memory |
However, the technique can sometimes be a useful scheme for reducing memory |
1318 |
requirements in complex physical paramterisations. |
requirements in complex physical parameterizations. |
|
|
|
1319 |
\end{enumerate} |
\end{enumerate} |
1320 |
|
|
1321 |
|
\begin{figure} |
1322 |
|
\begin{verbatim} |
1323 |
|
C-- |
1324 |
|
C-- Parallel directives for MIPS Pro Fortran compiler |
1325 |
|
C-- |
1326 |
|
C Parallel compiler directives for SGI with IRIX |
1327 |
|
C$PAR PARALLEL DO |
1328 |
|
C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE, |
1329 |
|
C$PAR& SHARE(nThreads),LOCAL(myThid,I) |
1330 |
|
C |
1331 |
|
DO I=1,nThreads |
1332 |
|
myThid = I |
1333 |
|
|
1334 |
|
C-- Invoke nThreads instances of the numerical model |
1335 |
|
CALL THE_MODEL_MAIN(myThid) |
1336 |
|
|
1337 |
|
ENDDO |
1338 |
|
\end{verbatim} |
1339 |
|
\caption{Prior to transferring control to |
1340 |
|
the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use |
1341 |
|
MP directives to spawn multiple threads. |
1342 |
|
} \label{fig:mp_directives} |
1343 |
|
\end{figure} |
1344 |
|
|
1345 |
|
|
1346 |
\subsubsection{Specializing the Communication Code} |
\subsubsection{Specializing the Communication Code} |
1347 |
|
|
1348 |
The isolation of performance critical communication primitives and the |
The isolation of performance critical communication primitives and the |
1349 |
sub-division of the simulation domain into tiles is a powerful tool. |
sub-division of the simulation domain into tiles is a powerful tool. |
1350 |
Here we show how it can be used to improve application performance and |
Here we show how it can be used to improve application performance and |
1351 |
how it can be used to adapt to new gridding approaches. |
how it can be used to adapt to new griding approaches. |
1352 |
|
|
1353 |
\subsubsection{JAM example} |
\subsubsection{JAM example} |
1354 |
\label{sec:jam_example} |
\label{sec:jam_example} |
1367 |
\item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced |
\item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced |
1368 |
with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F}) |
with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F}) |
1369 |
\item a highly specialized form of the exchange operator (optimized |
\item a highly specialized form of the exchange operator (optimized |
1370 |
for overlap regions of width one) is substitued into the elliptic |
for overlap regions of width one) is substituted into the elliptic |
1371 |
solver routine {\em cg2d.F}. |
solver routine {\em cg2d.F}. |
1372 |
\end{itemize} |
\end{itemize} |
1373 |
Developing specialized code for other libraries follows a similar |
Developing specialized code for other libraries follows a similar |
1379 |
a series of template files, for example {\em exch\_rx.template}. |
a series of template files, for example {\em exch\_rx.template}. |
1380 |
This is done to allow a large number of variations on the exchange |
This is done to allow a large number of variations on the exchange |
1381 |
process to be maintained. One set of variations supports the |
process to be maintained. One set of variations supports the |
1382 |
cube sphere grid. Support for a cube sphere gris in MITgcm is based |
cube sphere grid. Support for a cube sphere grid in MITgcm is based |
1383 |
on having each face of the cube as a separate tile (or tiles). |
on having each face of the cube as a separate tile (or tiles). |
1384 |
The exchage routines are then able to absorb much of the |
The exchange routines are then able to absorb much of the |
1385 |
detailed rotation and reorientation required when moving around the |
detailed rotation and reorientation required when moving around the |
1386 |
cube grid. The set of {\em \_EXCH} routines that contain the |
cube grid. The set of {\em \_EXCH} routines that contain the |
1387 |
word cube in their name perform these transformations. |
word cube in their name perform these transformations. |
1388 |
They are invoked when the run-time logical parameter |
They are invoked when the run-time logical parameter |
1389 |
{\em useCubedSphereExchange} is set true. To facilitate the |
{\em useCubedSphereExchange} is set true. To facilitate the |
1390 |
transformations on a staggered C-grid, exchange operations are defined |
transformations on a staggered C-grid, exchange operations are defined |
1391 |
separately for both vector and scalar quantitities and for |
separately for both vector and scalar quantities and for |
1392 |
grid-centered and for grid-face and corner quantities. |
grid-centered and for grid-face and corner quantities. |
1393 |
Three sets of exchange routines are defined. Routines |
Three sets of exchange routines are defined. Routines |
1394 |
with names of the form {\em exch\_rx} are used to exchange |
with names of the form {\em exch\_rx} are used to exchange |
1453 |
C | |
C | |
1454 |
C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm |
C |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm |
1455 |
C | :: Called from WRAPPER level numerical |
C | :: Called from WRAPPER level numerical |
1456 |
C | :: code innvocation routine. On entry |
C | :: code invocation routine. On entry |
1457 |
C | :: to THE_MODEL_MAIN separate thread and |
C | :: to THE_MODEL_MAIN separate thread and |
1458 |
C | :: separate processes will have been established. |
C | :: separate processes will have been established. |
1459 |
C | :: Each thread and process will have a unique ID |
C | :: Each thread and process will have a unique ID |
1467 |
C | | :: By default kernel parameters are read from file |
C | | :: By default kernel parameters are read from file |
1468 |
C | | :: "data" in directory in which code executes. |
C | | :: "data" in directory in which code executes. |
1469 |
C | | |
C | | |
1470 |
C | |-MON_INIT :: Initialises monitor pacakge ( see pkg/monitor ) |
C | |-MON_INIT :: Initializes monitor package ( see pkg/monitor ) |
1471 |
C | | |
C | | |
1472 |
C | |-INI_GRID :: Control grid array (vert. and hori.) initialisation. |
C | |-INI_GRID :: Control grid array (vert. and hori.) initialization. |
1473 |
C | | | :: Grid arrays are held and described in GRID.h. |
C | | | :: Grid arrays are held and described in GRID.h. |
1474 |
C | | | |
C | | | |
1475 |
C | | |-INI_VERTICAL_GRID :: Initialise vertical grid arrays. |
C | | |-INI_VERTICAL_GRID :: Initialize vertical grid arrays. |
1476 |
C | | | |
C | | | |
1477 |
C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialisation |
C | | |-INI_CARTESIAN_GRID :: Cartesian horiz. grid initialization |
1478 |
C | | | :: (calculate grid from kernel parameters). |
C | | | :: (calculate grid from kernel parameters). |
1479 |
C | | | |
C | | | |
1480 |
C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid |
C | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid |
1481 |
C | | | :: initialisation (calculate grid from |
C | | | :: initialization (calculate grid from |
1482 |
C | | | :: kernel parameters). |
C | | | :: kernel parameters). |
1483 |
C | | | |
C | | | |
1484 |
C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz. |
C | | |-INI_CURVILINEAR_GRID :: General orthogonal, structured horiz. |
1485 |
C | | :: grid initialisations. ( input from raw |
C | | :: grid initializations. ( input from raw |
1486 |
C | | :: grid files, LONC.bin, DXF.bin etc... ) |
C | | :: grid files, LONC.bin, DXF.bin etc... ) |
1487 |
C | | |
C | | |
1488 |
C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography. |
C | |-INI_DEPTHS :: Read (from "bathyFile") or set bathymetry/orgography. |
1493 |
C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf |
C | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf |
1494 |
C | | |
C | | |
1495 |
C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane, |
C | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane, |
1496 |
C | | :: sphere optins are coded. |
C | | :: sphere options are coded. |
1497 |
C | | |
C | | |
1498 |
C | |-PACAKGES_BOOT :: Start up the optional package environment. |
C | |-PACAKGES_BOOT :: Start up the optional package environment. |
1499 |
C | | :: Runtime selection of active packages. |
C | | :: Runtime selection of active packages. |
1514 |
C | |-PACKAGES_CHECK |
C | |-PACKAGES_CHECK |
1515 |
C | | | |
C | | | |
1516 |
C | | |-KPP_CHECK :: KPP Package. pkg/kpp |
C | | |-KPP_CHECK :: KPP Package. pkg/kpp |
1517 |
C | | |-OBCS_CHECK :: Open bndy Pacakge. pkg/obcs |
C | | |-OBCS_CHECK :: Open bndy Package. pkg/obcs |
1518 |
C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi |
C | | |-GMREDI_CHECK :: GM Package. pkg/gmredi |
1519 |
C | | |
C | | |
1520 |
C | |-PACKAGES_INIT_FIXED |
C | |-PACKAGES_INIT_FIXED |
1534 |
C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl |
C |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl |
1535 |
C | |
C | |
1536 |
C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop |
C |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop |
1537 |
C ! :: Auotmatically gerenrated by TAMC/TAF. |
C ! :: Auotmatically generated by TAMC/TAF. |
1538 |
C | |
C | |
1539 |
C |-CTRL_PACK :: Control vector support package. see pkg/ctrl |
C |-CTRL_PACK :: Control vector support package. see pkg/ctrl |
1540 |
C | |
C | |
1548 |
C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf |
C | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf |
1549 |
C | | | |
C | | | |
1550 |
C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane, |
C | | |-INI_CORI :: Set coriolis term. zero, f-plane, beta-plane, |
1551 |
C | | | :: sphere optins are coded. |
C | | | :: sphere options are coded. |
1552 |
C | | | |
C | | | |
1553 |
C | | |-INI_CG2D :: 2d con. grad solver initialisation. |
C | | |-INI_CG2D :: 2d con. grad solver initialisation. |
1554 |
C | | |-INI_CG3D :: 3d con. grad solver initialisation. |
C | | |-INI_CG3D :: 3d con. grad solver initialisation. |
1556 |
C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical |
C | | |-INI_DYNVARS :: Initialise to zero all DYNVARS.h arrays (dynamical |
1557 |
C | | | :: fields). |
C | | | :: fields). |
1558 |
C | | | |
C | | | |
1559 |
C | | |-INI_FIELDS :: Control initialising model fields to non-zero |
C | | |-INI_FIELDS :: Control initializing model fields to non-zero |
1560 |
C | | | |-INI_VEL :: Initialize 3D flow field. |
C | | | |-INI_VEL :: Initialize 3D flow field. |
1561 |
C | | | |-INI_THETA :: Set model initial temperature field. |
C | | | |-INI_THETA :: Set model initial temperature field. |
1562 |
C | | | |-INI_SALT :: Set model initial salinity field. |
C | | | |-INI_SALT :: Set model initial salinity field. |
1634 |
C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness. |
C/\ | | |-CALC_SURF_DR :: Calculate the new surface level thickness. |
1635 |
C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf ) |
C/\ | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf ) |
1636 |
C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data. |
C/\ | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data. |
1637 |
C/\ | | | | :: Simple interpolcation between end-points |
C/\ | | | | :: Simple interpolation between end-points |
1638 |
C/\ | | | | :: for forcing datasets. |
C/\ | | | | :: for forcing datasets. |
1639 |
C/\ | | | | |
C/\ | | | | |
1640 |
C/\ | | | |-EXCH :: Sync forcing. in overlap regions. |
C/\ | | | |-EXCH :: Sync forcing. in overlap regions. |