136 |
class of machines (for example Parallel Vector Processor Systems). Instead the |
class of machines (for example Parallel Vector Processor Systems). Instead the |
137 |
WRAPPER provides applications with an |
WRAPPER provides applications with an |
138 |
abstract {\it machine model}. The machine model is very general, however, it can |
abstract {\it machine model}. The machine model is very general, however, it can |
139 |
easily be specialized to fit, in a computationally effificent manner, any |
easily be specialized to fit, in a computationally efficient manner, any |
140 |
computer architecture currently available to the scientific computing community. |
computer architecture currently available to the scientific computing community. |
141 |
|
|
142 |
\subsection{Machine model parallelism} |
\subsection{Machine model parallelism} |
143 |
|
|
144 |
Codes operating under the WRAPPER target an abstract machine that is assumed to |
Codes operating under the WRAPPER target an abstract machine that is assumed to |
145 |
consist of one or more logical processors that can compute concurrently. |
consist of one or more logical processors that can compute concurrently. |
146 |
Computational work is divided amongst the logical |
Computational work is divided among the logical |
147 |
processors by allocating ``ownership'' to |
processors by allocating ``ownership'' to |
148 |
each processor of a certain set (or sets) of calculations. Each set of |
each processor of a certain set (or sets) of calculations. Each set of |
149 |
calculations owned by a particular processor is associated with a specific |
calculations owned by a particular processor is associated with a specific |
402 |
\includegraphics{part4/comm-primm.eps} |
\includegraphics{part4/comm-primm.eps} |
403 |
} |
} |
404 |
\end{center} |
\end{center} |
405 |
\caption{Three performance critical parallel primititives are provided |
\caption{Three performance critical parallel primitives are provided |
406 |
by the WRAPPER. These primititives are always used to communicate data |
by the WRAPPER. These primitives are always used to communicate data |
407 |
between tiles. The figure shows four tiles. The curved arrows indicate |
between tiles. The figure shows four tiles. The curved arrows indicate |
408 |
exchange primitives which transfer data between the overlap regions at tile |
exchange primitives which transfer data between the overlap regions at tile |
409 |
edges and interior regions for nearest-neighbor tiles. |
edges and interior regions for nearest-neighbor tiles. |
1006 |
\begin{verbatim} |
\begin{verbatim} |
1007 |
mpirun -np 64 -machinefile mf ./mitgcmuv |
mpirun -np 64 -machinefile mf ./mitgcmuv |
1008 |
\end{verbatim} |
\end{verbatim} |
1009 |
In this example the text {\em -np 64} specifices the number of processes |
In this example the text {\em -np 64} specifies the number of processes |
1010 |
that will be created. The numeric value {\em 64} must be equal to the |
that will be created. The numeric value {\em 64} must be equal to the |
1011 |
product of the processor grid settings of {\em nPx} and {\em nPy} |
product of the processor grid settings of {\em nPx} and {\em nPy} |
1012 |
in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file |
in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file |
1212 |
\item {\bf Cache line size} |
\item {\bf Cache line size} |
1213 |
As discussed in section \ref{sec:cache_effects_and_false_sharing}, |
As discussed in section \ref{sec:cache_effects_and_false_sharing}, |
1214 |
milti-threaded codes explicitly avoid penalties associated with excessive |
milti-threaded codes explicitly avoid penalties associated with excessive |
1215 |
coherence traffic on an SMP system. To do this the sgared memory data structures |
coherence traffic on an SMP system. To do this the shared memory data structures |
1216 |
used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines |
used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines |
1217 |
are padded. The variables that control the padding are set in the |
are padded. The variables that control the padding are set in the |
1218 |
header file {\em EEPARAMS.h}. These variables are called |
header file {\em EEPARAMS.h}. These variables are called |
1220 |
{\em lShare8}. The default values should not normally need changing. |
{\em lShare8}. The default values should not normally need changing. |
1221 |
\item {\bf \_BARRIER} |
\item {\bf \_BARRIER} |
1222 |
This is a CPP macro that is expanded to a call to a routine |
This is a CPP macro that is expanded to a call to a routine |
1223 |
which synchronises all the logical processors running under the |
which synchronizes all the logical processors running under the |
1224 |
WRAPPER. Using a macro here preserves flexibility to insert |
WRAPPER. Using a macro here preserves flexibility to insert |
1225 |
a specialized call in-line into application code. By default this |
a specialized call in-line into application code. By default this |
1226 |
resolves to calling the procedure {\em BARRIER()}. The default |
resolves to calling the procedure {\em BARRIER()}. The default |
1228 |
|
|
1229 |
\item {\bf \_GSUM} |
\item {\bf \_GSUM} |
1230 |
This is a CPP macro that is expanded to a call to a routine |
This is a CPP macro that is expanded to a call to a routine |
1231 |
which sums up a floating point numner |
which sums up a floating point number |
1232 |
over all the logical processors running under the |
over all the logical processors running under the |
1233 |
WRAPPER. Using a macro here provides extra flexibility to insert |
WRAPPER. Using a macro here provides extra flexibility to insert |
1234 |
a specialized call in-line into application code. By default this |
a specialized call in-line into application code. By default this |
1235 |
resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for |
resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for |
1236 |
84=bit floating point operands) |
64-bit floating point operands) |
1237 |
or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default |
or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default |
1238 |
setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}. |
setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}. |
1239 |
The \_GSUM macro is a performance critical operation, especially for |
The \_GSUM macro is a performance critical operation, especially for |
1240 |
large processor count, small tile size configurations. |
large processor count, small tile size configurations. |
1253 |
\_EXCH operation plays a crucial role in scaling to small tile, |
\_EXCH operation plays a crucial role in scaling to small tile, |
1254 |
large logical and physical processor count configurations. |
large logical and physical processor count configurations. |
1255 |
The example in section \ref{sec:jam_example} discusses defining an |
The example in section \ref{sec:jam_example} discusses defining an |
1256 |
optimised and specialized form on the \_EXCH operation. |
optimized and specialized form on the \_EXCH operation. |
1257 |
|
|
1258 |
The \_EXCH operation is also central to supporting grids such as |
The \_EXCH operation is also central to supporting grids such as |
1259 |
the cube-sphere grid. In this class of grid a rotation may be required |
the cube-sphere grid. In this class of grid a rotation may be required |
1260 |
between tiles. Aligning the coordinate requiring rotation with the |
between tiles. Aligning the coordinate requiring rotation with the |
1261 |
tile decomposistion, allows the coordinate transformation to |
tile decomposition, allows the coordinate transformation to |
1262 |
be embedded within a custom form of the \_EXCH primitive. |
be embedded within a custom form of the \_EXCH primitive. |
1263 |
|
|
1264 |
\item {\bf Reverse Mode} |
\item {\bf Reverse Mode} |
1265 |
The communication primitives \_EXCH and \_GSUM both employ |
The communication primitives \_EXCH and \_GSUM both employ |
1266 |
hand-written adjoint forms (or reverse mode) forms. |
hand-written adjoint forms (or reverse mode) forms. |
1267 |
These reverse mode forms can be found in the |
These reverse mode forms can be found in the |
1268 |
sourc code directory {\em pkg/autodiff}. |
source code directory {\em pkg/autodiff}. |
1269 |
For the global sum primitive the reverse mode form |
For the global sum primitive the reverse mode form |
1270 |
calls are to {\em GLOBAL\_ADSUM\_R4} and |
calls are to {\em GLOBAL\_ADSUM\_R4} and |
1271 |
{\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the |
{\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the |
1272 |
exchamge primitives are found in routines |
exchange primitives are found in routines |
1273 |
prefixed {\em ADEXCH}. The exchange routines make calls to |
prefixed {\em ADEXCH}. The exchange routines make calls to |
1274 |
the same low-level communication primitives as the forward mode |
the same low-level communication primitives as the forward mode |
1275 |
operations. However, the routine argument {\em simulationMode} |
operations. However, the routine argument {\em simulationMode} |
1281 |
maximum number of OS threads that a code will use. This |
maximum number of OS threads that a code will use. This |
1282 |
value defaults to thirty-two and is set in the file {\em EEPARAMS.h}. |
value defaults to thirty-two and is set in the file {\em EEPARAMS.h}. |
1283 |
For single threaded execution it can be reduced to one if required. |
For single threaded execution it can be reduced to one if required. |
1284 |
The va;lue is largely private to the WRAPPER and application code |
The value; is largely private to the WRAPPER and application code |
1285 |
will nor normally reference the value, except in the following scenario. |
will nor normally reference the value, except in the following scenario. |
1286 |
|
|
1287 |
For certain physical parametrization schemes it is necessary to have |
For certain physical parametrization schemes it is necessary to have |
1296 |
being specified involves many more tiles than OS threads then |
being specified involves many more tiles than OS threads then |
1297 |
it can save memory resources to reduce the variable |
it can save memory resources to reduce the variable |
1298 |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads that |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads that |
1299 |
will be used and to declare the physical parameterisation |
will be used and to declare the physical parameterization |
1300 |
work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension. |
work arrays with a single {\em MAX\_NO\_THREADS} extra dimension. |
1301 |
An example of this is given in the verification experiment |
An example of this is given in the verification experiment |
1302 |
{\em aim.5l\_cs}. Here the default setting of |
{\em aim.5l\_cs}. Here the default setting of |
1303 |
{\em MAX\_NO\_THREADS} is altered to |
{\em MAX\_NO\_THREADS} is altered to |
1310 |
\begin{verbatim} |
\begin{verbatim} |
1311 |
common /FORCIN/ sst1(ngp,MAX_NO_THREADS) |
common /FORCIN/ sst1(ngp,MAX_NO_THREADS) |
1312 |
\end{verbatim} |
\end{verbatim} |
1313 |
This declaration scheme is not used widely, becuase most global data |
This declaration scheme is not used widely, because most global data |
1314 |
is used for permanent not temporary storage of state information. |
is used for permanent not temporary storage of state information. |
1315 |
In the case of permanent state information this approach cannot be used |
In the case of permanent state information this approach cannot be used |
1316 |
because there has to be enough storage allocated for all tiles. |
because there has to be enough storage allocated for all tiles. |
1317 |
However, the technique can sometimes be a useful scheme for reducing memory |
However, the technique can sometimes be a useful scheme for reducing memory |
1318 |
requirements in complex physical paramterisations. |
requirements in complex physical parameterizations. |
1319 |
\end{enumerate} |
\end{enumerate} |
1320 |
|
|
1321 |
\begin{figure} |
\begin{figure} |
1348 |
The isolation of performance critical communication primitives and the |
The isolation of performance critical communication primitives and the |
1349 |
sub-division of the simulation domain into tiles is a powerful tool. |
sub-division of the simulation domain into tiles is a powerful tool. |
1350 |
Here we show how it can be used to improve application performance and |
Here we show how it can be used to improve application performance and |
1351 |
how it can be used to adapt to new gridding approaches. |
how it can be used to adapt to new griding approaches. |
1352 |
|
|
1353 |
\subsubsection{JAM example} |
\subsubsection{JAM example} |
1354 |
\label{sec:jam_example} |
\label{sec:jam_example} |