1 |
% $Header$ |
% $Header$ |
2 |
|
|
3 |
In this chapter we describe the software architecture and |
This chapter focuses on describing the {\bf WRAPPER} environment within which |
4 |
implementation strategy for the MITgcm code. The first part of this |
both the core numerics and the pluggable packages operate. The description |
5 |
chapter discusses the MITgcm architecture at an abstract level. In the second |
presented here is intended to be a detailed exposition and contains significant |
6 |
part of the chapter we described practical details of the MITgcm implementation |
background material, as well as advanced details on working with the WRAPPER. |
7 |
and of current tools and operating system features that are employed. |
The tutorial sections of this manual (see sections |
8 |
|
\ref{sect:tutorials} and \ref{sect:tutorialIII}) |
9 |
|
contain more succinct, step-by-step instructions on running basic numerical |
10 |
|
experiments, of varous types, both sequentially and in parallel. For many |
11 |
|
projects simply starting from an example code and adapting it to suit a |
12 |
|
particular situation |
13 |
|
will be all that is required. |
14 |
|
The first part of this chapter discusses the MITgcm architecture at an |
15 |
|
abstract level. In the second part of the chapter we described practical |
16 |
|
details of the MITgcm implementation and of current tools and operating system |
17 |
|
features that are employed. |
18 |
|
|
19 |
\section{Overall architectural goals} |
\section{Overall architectural goals} |
20 |
|
|
38 |
|
|
39 |
\begin{enumerate} |
\begin{enumerate} |
40 |
\item A core set of numerical and support code. This is discussed in detail in |
\item A core set of numerical and support code. This is discussed in detail in |
41 |
section \ref{sec:partII}. |
section \ref{sect:partII}. |
42 |
\item A scheme for supporting optional "pluggable" {\bf packages} (containing |
\item A scheme for supporting optional "pluggable" {\bf packages} (containing |
43 |
for example mixed-layer schemes, biogeochemical schemes, atmospheric physics). |
for example mixed-layer schemes, biogeochemical schemes, atmospheric physics). |
44 |
These packages are used both to overlay alternate dynamics and to introduce |
These packages are used both to overlay alternate dynamics and to introduce |
84 |
to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within |
to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within |
85 |
the WRAPPER means that coding has to follow certain, relatively |
the WRAPPER means that coding has to follow certain, relatively |
86 |
straightforward, rules and conventions ( these are discussed further in |
straightforward, rules and conventions ( these are discussed further in |
87 |
section \ref{sec:specifying_a_decomposition} ). |
section \ref{sect:specifying_a_decomposition} ). |
88 |
|
|
89 |
The approach taken by the WRAPPER is illustrated in figure |
The approach taken by the WRAPPER is illustrated in figure |
90 |
\ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code |
\ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code |
97 |
\resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}} |
\resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}} |
98 |
\end{center} |
\end{center} |
99 |
\caption{ |
\caption{ |
100 |
Numerical code is written too fit within a software support |
Numerical code is written to fit within a software support |
101 |
infrastructure called WRAPPER. The WRAPPER is portable and |
infrastructure called WRAPPER. The WRAPPER is portable and |
102 |
can be specialized for a wide range of specific target hardware and |
can be specialized for a wide range of specific target hardware and |
103 |
programming environments, without impacting numerical code that fits |
programming environments, without impacting numerical code that fits |
108 |
\end{figure} |
\end{figure} |
109 |
|
|
110 |
\subsection{Target hardware} |
\subsection{Target hardware} |
111 |
\label{sec:target_hardware} |
\label{sect:target_hardware} |
112 |
|
|
113 |
The WRAPPER is designed to target as broad as possible a range of computer |
The WRAPPER is designed to target as broad as possible a range of computer |
114 |
systems. The original development of the WRAPPER took place on a |
systems. The original development of the WRAPPER took place on a |
120 |
(UMA) and non-uniform memory access (NUMA) designs. Significant work has also |
(UMA) and non-uniform memory access (NUMA) designs. Significant work has also |
121 |
been undertaken on x86 cluster systems, Alpha processor based clustered SMP |
been undertaken on x86 cluster systems, Alpha processor based clustered SMP |
122 |
systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics. |
systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics. |
123 |
The MITgcm code, operating within the WRAPPER, is also used routinely used on |
The MITgcm code, operating within the WRAPPER, is also routinely used on |
124 |
large scale MPP systems (for example T3E systems and IBM SP systems). In all |
large scale MPP systems (for example T3E systems and IBM SP systems). In all |
125 |
cases numerical code, operating within the WRAPPER, performs and scales very |
cases numerical code, operating within the WRAPPER, performs and scales very |
126 |
competitively with equivalent numerical code that has been modified to contain |
competitively with equivalent numerical code that has been modified to contain |
128 |
|
|
129 |
\subsection{Supporting hardware neutrality} |
\subsection{Supporting hardware neutrality} |
130 |
|
|
131 |
The different systems listed in section \ref{sec:target_hardware} can be |
The different systems listed in section \ref{sect:target_hardware} can be |
132 |
categorized in many different ways. For example, one common distinction is |
categorized in many different ways. For example, one common distinction is |
133 |
between shared-memory parallel systems (SMP's, PVP's) and distributed memory |
between shared-memory parallel systems (SMP's, PVP's) and distributed memory |
134 |
parallel systems (for example x86 clusters and large MPP systems). This is one |
parallel systems (for example x86 clusters and large MPP systems). This is one |
146 |
class of machines (for example Parallel Vector Processor Systems). Instead the |
class of machines (for example Parallel Vector Processor Systems). Instead the |
147 |
WRAPPER provides applications with an |
WRAPPER provides applications with an |
148 |
abstract {\it machine model}. The machine model is very general, however, it can |
abstract {\it machine model}. The machine model is very general, however, it can |
149 |
easily be specialized to fit, in a computationally effificent manner, any |
easily be specialized to fit, in a computationally efficient manner, any |
150 |
computer architecture currently available to the scientific computing community. |
computer architecture currently available to the scientific computing community. |
151 |
|
|
152 |
\subsection{Machine model parallelism} |
\subsection{Machine model parallelism} |
153 |
|
|
154 |
Codes operating under the WRAPPER target an abstract machine that is assumed to |
Codes operating under the WRAPPER target an abstract machine that is assumed to |
155 |
consist of one or more logical processors that can compute concurrently. |
consist of one or more logical processors that can compute concurrently. |
156 |
Computational work is divided amongst the logical |
Computational work is divided among the logical |
157 |
processors by allocating ``ownership'' to |
processors by allocating ``ownership'' to |
158 |
each processor of a certain set (or sets) of calculations. Each set of |
each processor of a certain set (or sets) of calculations. Each set of |
159 |
calculations owned by a particular processor is associated with a specific |
calculations owned by a particular processor is associated with a specific |
221 |
whenever it requires values that outside the domain it owns. Periodically |
whenever it requires values that outside the domain it owns. Periodically |
222 |
processors will make calls to WRAPPER functions to communicate data between |
processors will make calls to WRAPPER functions to communicate data between |
223 |
tiles, in order to keep the overlap regions up to date (see section |
tiles, in order to keep the overlap regions up to date (see section |
224 |
\ref{sec:communication_primitives}). The WRAPPER functions can use a |
\ref{sect:communication_primitives}). The WRAPPER functions can use a |
225 |
variety of different mechanisms to communicate data between tiles. |
variety of different mechanisms to communicate data between tiles. |
226 |
|
|
227 |
\begin{figure} |
\begin{figure} |
308 |
\end{figure} |
\end{figure} |
309 |
|
|
310 |
\subsection{Shared memory communication} |
\subsection{Shared memory communication} |
311 |
\label{sec:shared_memory_communication} |
\label{sect:shared_memory_communication} |
312 |
|
|
313 |
Under shared communication independent CPU's are operating |
Under shared communication independent CPU's are operating |
314 |
on the exact same global address space at the application level. |
on the exact same global address space at the application level. |
334 |
communication very efficient provided it is used appropriately. |
communication very efficient provided it is used appropriately. |
335 |
|
|
336 |
\subsubsection{Memory consistency} |
\subsubsection{Memory consistency} |
337 |
\label{sec:memory_consistency} |
\label{sect:memory_consistency} |
338 |
|
|
339 |
When using shared memory communication between |
When using shared memory communication between |
340 |
multiple processors the WRAPPER level shields user applications from |
multiple processors the WRAPPER level shields user applications from |
358 |
ensure memory consistency for a particular platform. |
ensure memory consistency for a particular platform. |
359 |
|
|
360 |
\subsubsection{Cache effects and false sharing} |
\subsubsection{Cache effects and false sharing} |
361 |
\label{sec:cache_effects_and_false_sharing} |
\label{sect:cache_effects_and_false_sharing} |
362 |
|
|
363 |
Shared-memory machines often have local to processor memory caches |
Shared-memory machines often have local to processor memory caches |
364 |
which contain mirrored copies of main memory. Automatic cache-coherence |
which contain mirrored copies of main memory. Automatic cache-coherence |
377 |
threads operating within a single process is the standard mechanism for |
threads operating within a single process is the standard mechanism for |
378 |
supporting shared memory that the WRAPPER utilizes. Configuring and launching |
supporting shared memory that the WRAPPER utilizes. Configuring and launching |
379 |
code to run in multi-threaded mode on specific platforms is discussed in |
code to run in multi-threaded mode on specific platforms is discussed in |
380 |
section \ref{sec:running_with_threads}. However, on many systems, potentially |
section \ref{sect:running_with_threads}. However, on many systems, potentially |
381 |
very efficient mechanisms for using shared memory communication between |
very efficient mechanisms for using shared memory communication between |
382 |
multiple processes (in contrast to multiple threads within a single |
multiple processes (in contrast to multiple threads within a single |
383 |
process) also exist. In most cases this works by making a limited region of |
process) also exist. In most cases this works by making a limited region of |
390 |
nature. |
nature. |
391 |
|
|
392 |
\subsection{Distributed memory communication} |
\subsection{Distributed memory communication} |
393 |
\label{sec:distributed_memory_communication} |
\label{sect:distributed_memory_communication} |
394 |
Many parallel systems are not constructed in a way where it is |
Many parallel systems are not constructed in a way where it is |
395 |
possible or practical for an application to use shared memory |
possible or practical for an application to use shared memory |
396 |
for communication. For example cluster systems consist of individual computers |
for communication. For example cluster systems consist of individual computers |
404 |
highly optimized library. |
highly optimized library. |
405 |
|
|
406 |
\subsection{Communication primitives} |
\subsection{Communication primitives} |
407 |
\label{sec:communication_primitives} |
\label{sect:communication_primitives} |
408 |
|
|
409 |
\begin{figure} |
\begin{figure} |
410 |
\begin{center} |
\begin{center} |
412 |
\includegraphics{part4/comm-primm.eps} |
\includegraphics{part4/comm-primm.eps} |
413 |
} |
} |
414 |
\end{center} |
\end{center} |
415 |
\caption{Three performance critical parallel primititives are provided |
\caption{Three performance critical parallel primitives are provided |
416 |
by the WRAPPER. These primititives are always used to communicate data |
by the WRAPPER. These primitives are always used to communicate data |
417 |
between tiles. The figure shows four tiles. The curved arrows indicate |
between tiles. The figure shows four tiles. The curved arrows indicate |
418 |
exchange primitives which transfer data between the overlap regions at tile |
exchange primitives which transfer data between the overlap regions at tile |
419 |
edges and interior regions for nearest-neighbor tiles. |
edges and interior regions for nearest-neighbor tiles. |
548 |
computing CPU's. |
computing CPU's. |
549 |
\end{enumerate} |
\end{enumerate} |
550 |
This section describes the details of each of these operations. |
This section describes the details of each of these operations. |
551 |
Section \ref{sec:specifying_a_decomposition} explains how the way in which |
Section \ref{sect:specifying_a_decomposition} explains how the way in which |
552 |
a domain is decomposed (or composed) is expressed. Section |
a domain is decomposed (or composed) is expressed. Section |
553 |
\ref{sec:starting_a_code} describes practical details of running codes |
\ref{sect:starting_a_code} describes practical details of running codes |
554 |
in various different parallel modes on contemporary computer systems. |
in various different parallel modes on contemporary computer systems. |
555 |
Section \ref{sec:controlling_communication} explains the internal information |
Section \ref{sect:controlling_communication} explains the internal information |
556 |
that the WRAPPER uses to control how information is communicated between |
that the WRAPPER uses to control how information is communicated between |
557 |
tiles. |
tiles. |
558 |
|
|
559 |
\subsection{Specifying a domain decomposition} |
\subsection{Specifying a domain decomposition} |
560 |
\label{sec:specifying_a_decomposition} |
\label{sect:specifying_a_decomposition} |
561 |
|
|
562 |
At its heart much of the WRAPPER works only in terms of a collection of tiles |
At its heart much of the WRAPPER works only in terms of a collection of tiles |
563 |
which are interconnected to each other. This is also true of application |
which are interconnected to each other. This is also true of application |
609 |
dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are |
dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are |
610 |
allocated to different threads of a process that are then bound to |
allocated to different threads of a process that are then bound to |
611 |
different physical processors ( see the multi-threaded |
different physical processors ( see the multi-threaded |
612 |
execution discussion in section \ref{sec:starting_the_code} ) then |
execution discussion in section \ref{sect:starting_the_code} ) then |
613 |
computation will be performed concurrently on each tile. However, it is also |
computation will be performed concurrently on each tile. However, it is also |
614 |
possible to run the same decomposition within a process running a single thread on |
possible to run the same decomposition within a process running a single thread on |
615 |
a single processor. In this case the tiles will be computed over sequentially. |
a single processor. In this case the tiles will be computed over sequentially. |
661 |
computation is performed concurrently over as many processes and threads |
computation is performed concurrently over as many processes and threads |
662 |
as there are physical processors available to compute. |
as there are physical processors available to compute. |
663 |
|
|
664 |
|
An exception to the the use of {\em bi} and {\em bj} in loops arises in the |
665 |
|
exchange routines used when the exch2 package is used with the cubed |
666 |
|
sphere. In this case {\em bj} is generally set to 1 and the loop runs from |
667 |
|
1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number, |
668 |
|
which is then used to reference exchange parameters. |
669 |
|
|
670 |
The amount of computation that can be embedded |
The amount of computation that can be embedded |
671 |
a single loop over {\em bi} and {\em bj} varies for different parts of the |
a single loop over {\em bi} and {\em bj} varies for different parts of the |
672 |
MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract |
MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract |
787 |
forty grid points in y. The two sub-domains in each process will be computed |
forty grid points in y. The two sub-domains in each process will be computed |
788 |
sequentially if they are given to a single thread within a single process. |
sequentially if they are given to a single thread within a single process. |
789 |
Alternatively if the code is invoked with multiple threads per process |
Alternatively if the code is invoked with multiple threads per process |
790 |
the two domains in y may be computed on concurrently. |
the two domains in y may be computed concurrently. |
791 |
\item |
\item |
792 |
\begin{verbatim} |
\begin{verbatim} |
793 |
PARAMETER ( |
PARAMETER ( |
806 |
This set of values can be used for a cube sphere calculation. |
This set of values can be used for a cube sphere calculation. |
807 |
Each tile of size $32 \times 32$ represents a face of the |
Each tile of size $32 \times 32$ represents a face of the |
808 |
cube. Initializing the tile connectivity correctly ( see section |
cube. Initializing the tile connectivity correctly ( see section |
809 |
\ref{sec:cube_sphere_communication}. allows the rotations associated with |
\ref{sect:cube_sphere_communication}. allows the rotations associated with |
810 |
moving between the six cube faces to be embedded within the |
moving between the six cube faces to be embedded within the |
811 |
tile-tile communication code. |
tile-tile communication code. |
812 |
\end{enumerate} |
\end{enumerate} |
813 |
|
|
814 |
|
|
815 |
\subsection{Starting the code} |
\subsection{Starting the code} |
816 |
\label{sec:starting_the_code} |
\label{sect:starting_the_code} |
817 |
When code is started under the WRAPPER, execution begins in a main routine {\em |
When code is started under the WRAPPER, execution begins in a main routine {\em |
818 |
eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred |
eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred |
819 |
to the application through a routine called {\em THE\_MODEL\_MAIN()} |
to the application through a routine called {\em THE\_MODEL\_MAIN()} |
823 |
WRAPPER is shown in figure \ref{fig:wrapper_startup}. |
WRAPPER is shown in figure \ref{fig:wrapper_startup}. |
824 |
|
|
825 |
\begin{figure} |
\begin{figure} |
826 |
|
{\footnotesize |
827 |
\begin{verbatim} |
\begin{verbatim} |
828 |
|
|
829 |
MAIN |
MAIN |
852 |
|
|
853 |
|
|
854 |
\end{verbatim} |
\end{verbatim} |
855 |
|
} |
856 |
\caption{Main stages of the WRAPPER startup procedure. |
\caption{Main stages of the WRAPPER startup procedure. |
857 |
This process proceeds transfer of control to application code, which |
This process proceeds transfer of control to application code, which |
858 |
occurs through the procedure {\em THE\_MODEL\_MAIN()}. |
occurs through the procedure {\em THE\_MODEL\_MAIN()}. |
860 |
\end{figure} |
\end{figure} |
861 |
|
|
862 |
\subsubsection{Multi-threaded execution} |
\subsubsection{Multi-threaded execution} |
863 |
\label{sec:multi-threaded-execution} |
\label{sect:multi-threaded-execution} |
864 |
Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the |
Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the |
865 |
WRAPPER may cause several coarse grain threads to be initialized. The routine |
WRAPPER may cause several coarse grain threads to be initialized. The routine |
866 |
{\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single |
{\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single |
867 |
stack argument which is the thread number, stored in the |
stack argument which is the thread number, stored in the |
868 |
variable {\em myThid}. In addition to specifying a decomposition with |
variable {\em myThid}. In addition to specifying a decomposition with |
869 |
multiple tiles per process ( see section \ref{sec:specifying_a_decomposition}) |
multiple tiles per process ( see section \ref{sect:specifying_a_decomposition}) |
870 |
configuring and starting a code to run using multiple threads requires the following |
configuring and starting a code to run using multiple threads requires the following |
871 |
steps.\\ |
steps.\\ |
872 |
|
|
935 |
File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\ |
File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\ |
936 |
File: {\em model/src/THE\_MODEL\_MAIN.F}\\ |
File: {\em model/src/THE\_MODEL\_MAIN.F}\\ |
937 |
File: {\em eesupp/src/MAIN.F}\\ |
File: {\em eesupp/src/MAIN.F}\\ |
938 |
File: {\em tools/genmake}\\ |
File: {\em tools/genmake2}\\ |
939 |
File: {\em eedata}\\ |
File: {\em eedata}\\ |
940 |
CPP: {\em TARGET\_SUN}\\ |
CPP: {\em TARGET\_SUN}\\ |
941 |
CPP: {\em TARGET\_DEC}\\ |
CPP: {\em TARGET\_DEC}\\ |
948 |
} \\ |
} \\ |
949 |
|
|
950 |
\subsubsection{Multi-process execution} |
\subsubsection{Multi-process execution} |
951 |
\label{sec:multi-process-execution} |
\label{sect:multi-process-execution} |
952 |
|
|
953 |
Despite its appealing programming model, multi-threaded execution remains |
Despite its appealing programming model, multi-threaded execution remains |
954 |
less common then multi-process execution. One major reason for this |
less common then multi-process execution. One major reason for this |
960 |
|
|
961 |
Multi-process execution is more ubiquitous. |
Multi-process execution is more ubiquitous. |
962 |
In order to run code in a multi-process configuration a decomposition |
In order to run code in a multi-process configuration a decomposition |
963 |
specification ( see section \ref{sec:specifying_a_decomposition}) |
specification ( see section \ref{sect:specifying_a_decomposition}) |
964 |
is given ( in which the at least one of the |
is given ( in which the at least one of the |
965 |
parameters {\em nPx} or {\em nPy} will be greater than one) |
parameters {\em nPx} or {\em nPy} will be greater than one) |
966 |
and then, as for multi-threaded operation, |
and then, as for multi-threaded operation, |
990 |
|
|
991 |
Additionally, compile time options are required to link in the |
Additionally, compile time options are required to link in the |
992 |
MPI libraries and header files. Examples of these options |
MPI libraries and header files. Examples of these options |
993 |
can be found in the {\em genmake} script that creates makefiles |
can be found in the {\em genmake2} script that creates makefiles |
994 |
for compilation. When this script is executed with the {bf -mpi} |
for compilation. When this script is executed with the {bf -mpi} |
995 |
flag it will generate a makefile that includes |
flag it will generate a makefile that includes |
996 |
paths for search for MPI head files and for linking in |
paths for search for MPI head files and for linking in |
1012 |
|
|
1013 |
\fbox{ |
\fbox{ |
1014 |
\begin{minipage}{4.75in} |
\begin{minipage}{4.75in} |
1015 |
File: {\em tools/genmake} |
File: {\em tools/genmake2} |
1016 |
\end{minipage} |
\end{minipage} |
1017 |
} \\ |
} \\ |
1018 |
\paragraph{\bf Execution} The mechanics of starting a program in |
\paragraph{\bf Execution} The mechanics of starting a program in |
1024 |
\begin{verbatim} |
\begin{verbatim} |
1025 |
mpirun -np 64 -machinefile mf ./mitgcmuv |
mpirun -np 64 -machinefile mf ./mitgcmuv |
1026 |
\end{verbatim} |
\end{verbatim} |
1027 |
In this example the text {\em -np 64} specifices the number of processes |
In this example the text {\em -np 64} specifies the number of processes |
1028 |
that will be created. The numeric value {\em 64} must be equal to the |
that will be created. The numeric value {\em 64} must be equal to the |
1029 |
product of the processor grid settings of {\em nPx} and {\em nPy} |
product of the processor grid settings of {\em nPx} and {\em nPy} |
1030 |
in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file |
in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file |
1130 |
neighbor to communicate with on a particular face. A value |
neighbor to communicate with on a particular face. A value |
1131 |
of {\em COMM\_MSG} is used to indicated that some form of distributed |
of {\em COMM\_MSG} is used to indicated that some form of distributed |
1132 |
memory communication is required to communicate between |
memory communication is required to communicate between |
1133 |
these tile faces ( see section \ref{sec:distributed_memory_communication}). |
these tile faces ( see section \ref{sect:distributed_memory_communication}). |
1134 |
A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate |
A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate |
1135 |
forms of shared memory communication ( see section |
forms of shared memory communication ( see section |
1136 |
\ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates |
\ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates |
1137 |
that a CPU should communicate by writing to data structures owned by another |
that a CPU should communicate by writing to data structures owned by another |
1138 |
CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading |
CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading |
1139 |
from data structures owned by another CPU. These flags affect the behavior |
from data structures owned by another CPU. These flags affect the behavior |
1184 |
are read from the file {\em eedata}. If the value of {\em nThreads} |
are read from the file {\em eedata}. If the value of {\em nThreads} |
1185 |
is inconsistent with the number of threads requested from the |
is inconsistent with the number of threads requested from the |
1186 |
operating system (for example by using an environment |
operating system (for example by using an environment |
1187 |
variable as described in section \ref{sec:multi_threaded_execution}) |
variable as described in section \ref{sect:multi_threaded_execution}) |
1188 |
then usually an error will be reported by the routine |
then usually an error will be reported by the routine |
1189 |
{\em CHECK\_THREADS}.\\ |
{\em CHECK\_THREADS}.\\ |
1190 |
|
|
1202 |
} |
} |
1203 |
|
|
1204 |
\item {\bf memsync flags} |
\item {\bf memsync flags} |
1205 |
As discussed in section \ref{sec:memory_consistency}, when using shared memory, |
As discussed in section \ref{sect:memory_consistency}, when using shared memory, |
1206 |
a low-level system function may be need to force memory consistency. |
a low-level system function may be need to force memory consistency. |
1207 |
The routine {\em MEMSYNC()} is used for this purpose. This routine should |
The routine {\em MEMSYNC()} is used for this purpose. This routine should |
1208 |
not need modifying and the information below is only provided for |
not need modifying and the information below is only provided for |
1228 |
\end{verbatim} |
\end{verbatim} |
1229 |
|
|
1230 |
\item {\bf Cache line size} |
\item {\bf Cache line size} |
1231 |
As discussed in section \ref{sec:cache_effects_and_false_sharing}, |
As discussed in section \ref{sect:cache_effects_and_false_sharing}, |
1232 |
milti-threaded codes explicitly avoid penalties associated with excessive |
milti-threaded codes explicitly avoid penalties associated with excessive |
1233 |
coherence traffic on an SMP system. To do this the sgared memory data structures |
coherence traffic on an SMP system. To do this the shared memory data structures |
1234 |
used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines |
used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines |
1235 |
are padded. The variables that control the padding are set in the |
are padded. The variables that control the padding are set in the |
1236 |
header file {\em EEPARAMS.h}. These variables are called |
header file {\em EEPARAMS.h}. These variables are called |
1238 |
{\em lShare8}. The default values should not normally need changing. |
{\em lShare8}. The default values should not normally need changing. |
1239 |
\item {\bf \_BARRIER} |
\item {\bf \_BARRIER} |
1240 |
This is a CPP macro that is expanded to a call to a routine |
This is a CPP macro that is expanded to a call to a routine |
1241 |
which synchronises all the logical processors running under the |
which synchronizes all the logical processors running under the |
1242 |
WRAPPER. Using a macro here preserves flexibility to insert |
WRAPPER. Using a macro here preserves flexibility to insert |
1243 |
a specialized call in-line into application code. By default this |
a specialized call in-line into application code. By default this |
1244 |
resolves to calling the procedure {\em BARRIER()}. The default |
resolves to calling the procedure {\em BARRIER()}. The default |
1246 |
|
|
1247 |
\item {\bf \_GSUM} |
\item {\bf \_GSUM} |
1248 |
This is a CPP macro that is expanded to a call to a routine |
This is a CPP macro that is expanded to a call to a routine |
1249 |
which sums up a floating point numner |
which sums up a floating point number |
1250 |
over all the logical processors running under the |
over all the logical processors running under the |
1251 |
WRAPPER. Using a macro here provides extra flexibility to insert |
WRAPPER. Using a macro here provides extra flexibility to insert |
1252 |
a specialized call in-line into application code. By default this |
a specialized call in-line into application code. By default this |
1253 |
resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for |
resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for |
1254 |
84=bit floating point operands) |
64-bit floating point operands) |
1255 |
or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default |
or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default |
1256 |
setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}. |
setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}. |
1257 |
The \_GSUM macro is a performance critical operation, especially for |
The \_GSUM macro is a performance critical operation, especially for |
1258 |
large processor count, small tile size configurations. |
large processor count, small tile size configurations. |
1259 |
The custom communication example discussed in section \ref{sec:jam_example} |
The custom communication example discussed in section \ref{sect:jam_example} |
1260 |
shows how the macro is used to invoke a custom global sum routine |
shows how the macro is used to invoke a custom global sum routine |
1261 |
for a specific set of hardware. |
for a specific set of hardware. |
1262 |
|
|
1270 |
in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the |
in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the |
1271 |
\_EXCH operation plays a crucial role in scaling to small tile, |
\_EXCH operation plays a crucial role in scaling to small tile, |
1272 |
large logical and physical processor count configurations. |
large logical and physical processor count configurations. |
1273 |
The example in section \ref{sec:jam_example} discusses defining an |
The example in section \ref{sect:jam_example} discusses defining an |
1274 |
optimised and specialized form on the \_EXCH operation. |
optimized and specialized form on the \_EXCH operation. |
1275 |
|
|
1276 |
The \_EXCH operation is also central to supporting grids such as |
The \_EXCH operation is also central to supporting grids such as |
1277 |
the cube-sphere grid. In this class of grid a rotation may be required |
the cube-sphere grid. In this class of grid a rotation may be required |
1278 |
between tiles. Aligning the coordinate requiring rotation with the |
between tiles. Aligning the coordinate requiring rotation with the |
1279 |
tile decomposistion, allows the coordinate transformation to |
tile decomposition, allows the coordinate transformation to |
1280 |
be embedded within a custom form of the \_EXCH primitive. |
be embedded within a custom form of the \_EXCH primitive. |
1281 |
|
|
1282 |
\item {\bf Reverse Mode} |
\item {\bf Reverse Mode} |
1283 |
The communication primitives \_EXCH and \_GSUM both employ |
The communication primitives \_EXCH and \_GSUM both employ |
1284 |
hand-written adjoint forms (or reverse mode) forms. |
hand-written adjoint forms (or reverse mode) forms. |
1285 |
These reverse mode forms can be found in the |
These reverse mode forms can be found in the |
1286 |
sourc code directory {\em pkg/autodiff}. |
source code directory {\em pkg/autodiff}. |
1287 |
For the global sum primitive the reverse mode form |
For the global sum primitive the reverse mode form |
1288 |
calls are to {\em GLOBAL\_ADSUM\_R4} and |
calls are to {\em GLOBAL\_ADSUM\_R4} and |
1289 |
{\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the |
{\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the |
1290 |
exchamge primitives are found in routines |
exchange primitives are found in routines |
1291 |
prefixed {\em ADEXCH}. The exchange routines make calls to |
prefixed {\em ADEXCH}. The exchange routines make calls to |
1292 |
the same low-level communication primitives as the forward mode |
the same low-level communication primitives as the forward mode |
1293 |
operations. However, the routine argument {\em simulationMode} |
operations. However, the routine argument {\em simulationMode} |
1299 |
maximum number of OS threads that a code will use. This |
maximum number of OS threads that a code will use. This |
1300 |
value defaults to thirty-two and is set in the file {\em EEPARAMS.h}. |
value defaults to thirty-two and is set in the file {\em EEPARAMS.h}. |
1301 |
For single threaded execution it can be reduced to one if required. |
For single threaded execution it can be reduced to one if required. |
1302 |
The va;lue is largely private to the WRAPPER and application code |
The value; is largely private to the WRAPPER and application code |
1303 |
will nor normally reference the value, except in the following scenario. |
will nor normally reference the value, except in the following scenario. |
1304 |
|
|
1305 |
For certain physical parametrization schemes it is necessary to have |
For certain physical parametrization schemes it is necessary to have |
1310 |
if this might be unavailable then the work arrays can be extended |
if this might be unavailable then the work arrays can be extended |
1311 |
with dimensions use the tile dimensioning scheme of {\em nSx} |
with dimensions use the tile dimensioning scheme of {\em nSx} |
1312 |
and {\em nSy} ( as described in section |
and {\em nSy} ( as described in section |
1313 |
\ref{sec:specifying_a_decomposition}). However, if the configuration |
\ref{sect:specifying_a_decomposition}). However, if the configuration |
1314 |
being specified involves many more tiles than OS threads then |
being specified involves many more tiles than OS threads then |
1315 |
it can save memory resources to reduce the variable |
it can save memory resources to reduce the variable |
1316 |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads that |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads that |
1317 |
will be used and to declare the physical parameterisation |
will be used and to declare the physical parameterization |
1318 |
work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension. |
work arrays with a single {\em MAX\_NO\_THREADS} extra dimension. |
1319 |
An example of this is given in the verification experiment |
An example of this is given in the verification experiment |
1320 |
{\em aim.5l\_cs}. Here the default setting of |
{\em aim.5l\_cs}. Here the default setting of |
1321 |
{\em MAX\_NO\_THREADS} is altered to |
{\em MAX\_NO\_THREADS} is altered to |
1328 |
\begin{verbatim} |
\begin{verbatim} |
1329 |
common /FORCIN/ sst1(ngp,MAX_NO_THREADS) |
common /FORCIN/ sst1(ngp,MAX_NO_THREADS) |
1330 |
\end{verbatim} |
\end{verbatim} |
1331 |
This declaration scheme is not used widely, becuase most global data |
This declaration scheme is not used widely, because most global data |
1332 |
is used for permanent not temporary storage of state information. |
is used for permanent not temporary storage of state information. |
1333 |
In the case of permanent state information this approach cannot be used |
In the case of permanent state information this approach cannot be used |
1334 |
because there has to be enough storage allocated for all tiles. |
because there has to be enough storage allocated for all tiles. |
1335 |
However, the technique can sometimes be a useful scheme for reducing memory |
However, the technique can sometimes be a useful scheme for reducing memory |
1336 |
requirements in complex physical paramterisations. |
requirements in complex physical parameterizations. |
1337 |
\end{enumerate} |
\end{enumerate} |
1338 |
|
|
1339 |
\begin{figure} |
\begin{figure} |
1366 |
The isolation of performance critical communication primitives and the |
The isolation of performance critical communication primitives and the |
1367 |
sub-division of the simulation domain into tiles is a powerful tool. |
sub-division of the simulation domain into tiles is a powerful tool. |
1368 |
Here we show how it can be used to improve application performance and |
Here we show how it can be used to improve application performance and |
1369 |
how it can be used to adapt to new gridding approaches. |
how it can be used to adapt to new griding approaches. |
1370 |
|
|
1371 |
\subsubsection{JAM example} |
\subsubsection{JAM example} |
1372 |
\label{sec:jam_example} |
\label{sect:jam_example} |
1373 |
On some platforms a big performance boost can be obtained by |
On some platforms a big performance boost can be obtained by |
1374 |
binding the communication routines {\em \_EXCH} and |
binding the communication routines {\em \_EXCH} and |
1375 |
{\em \_GSUM} to specialized native libraries ) fro example the |
{\em \_GSUM} to specialized native libraries ) fro example the |
1392 |
pattern. |
pattern. |
1393 |
|
|
1394 |
\subsubsection{Cube sphere communication} |
\subsubsection{Cube sphere communication} |
1395 |
\label{sec:cube_sphere_communication} |
\label{sect:cube_sphere_communication} |
1396 |
Actual {\em \_EXCH} routine code is generated automatically from |
Actual {\em \_EXCH} routine code is generated automatically from |
1397 |
a series of template files, for example {\em exch\_rx.template}. |
a series of template files, for example {\em exch\_rx.template}. |
1398 |
This is done to allow a large number of variations on the exchange |
This is done to allow a large number of variations on the exchange |
1425 |
|
|
1426 |
Fitting together the WRAPPER elements, package elements and |
Fitting together the WRAPPER elements, package elements and |
1427 |
MITgcm core equation elements of the source code produces calling |
MITgcm core equation elements of the source code produces calling |
1428 |
sequence shown in section \ref{sec:calling_sequence} |
sequence shown in section \ref{sect:calling_sequence} |
1429 |
|
|
1430 |
\subsection{Annotated call tree for MITgcm and WRAPPER} |
\subsection{Annotated call tree for MITgcm and WRAPPER} |
1431 |
\label{sec:calling_sequence} |
\label{sect:calling_sequence} |
1432 |
|
|
1433 |
WRAPPER layer. |
WRAPPER layer. |
1434 |
|
|
1435 |
|
{\footnotesize |
1436 |
\begin{verbatim} |
\begin{verbatim} |
1437 |
|
|
1438 |
MAIN |
MAIN |
1460 |
|--THE_MODEL_MAIN :: Numerical code top-level driver routine |
|--THE_MODEL_MAIN :: Numerical code top-level driver routine |
1461 |
|
|
1462 |
\end{verbatim} |
\end{verbatim} |
1463 |
|
} |
1464 |
|
|
1465 |
Core equations plus packages. |
Core equations plus packages. |
1466 |
|
|
1467 |
|
{\footnotesize |
1468 |
\begin{verbatim} |
\begin{verbatim} |
1469 |
C |
C |
1470 |
C |
C |
1803 |
C :: events. |
C :: events. |
1804 |
C |
C |
1805 |
\end{verbatim} |
\end{verbatim} |
1806 |
|
} |
1807 |
|
|
1808 |
\subsection{Measuring and Characterizing Performance} |
\subsection{Measuring and Characterizing Performance} |
1809 |
|
|