5 |
operate. The description presented here is intended to be a detailed |
operate. The description presented here is intended to be a detailed |
6 |
exposition and contains significant background material, as well as |
exposition and contains significant background material, as well as |
7 |
advanced details on working with the WRAPPER. The tutorial sections |
advanced details on working with the WRAPPER. The tutorial sections |
8 |
of this manual (see sections \ref{sect:tutorials} and |
of this manual (see sections \ref{sec:modelExamples} and |
9 |
\ref{sect:tutorialIII}) contain more succinct, step-by-step |
\ref{sec:tutorialIII}) contain more succinct, step-by-step |
10 |
instructions on running basic numerical experiments, of varous types, |
instructions on running basic numerical experiments, of varous types, |
11 |
both sequentially and in parallel. For many projects simply starting |
both sequentially and in parallel. For many projects simply starting |
12 |
from an example code and adapting it to suit a particular situation |
from an example code and adapting it to suit a particular situation |
69 |
|
|
70 |
\begin{figure} |
\begin{figure} |
71 |
\begin{center} |
\begin{center} |
72 |
\resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}} |
\resizebox{!}{2.5in}{\includegraphics{s_software/figs/mitgcm_goals.eps}} |
73 |
\end{center} |
\end{center} |
74 |
\caption{ The MITgcm architecture is designed to allow simulation of a |
\caption{ The MITgcm architecture is designed to allow simulation of a |
75 |
wide range of physical problems on a wide range of hardware. The |
wide range of physical problems on a wide range of hardware. The |
93 |
``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' |
``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' |
94 |
within the WRAPPER means that coding has to follow certain, relatively |
within the WRAPPER means that coding has to follow certain, relatively |
95 |
straightforward, rules and conventions (these are discussed further in |
straightforward, rules and conventions (these are discussed further in |
96 |
section \ref{sect:specifying_a_decomposition}). |
section \ref{sec:specifying_a_decomposition}). |
97 |
|
|
98 |
The approach taken by the WRAPPER is illustrated in figure |
The approach taken by the WRAPPER is illustrated in figure |
99 |
\ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to |
\ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to |
104 |
|
|
105 |
\begin{figure} |
\begin{figure} |
106 |
\begin{center} |
\begin{center} |
107 |
\resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}} |
\resizebox{!}{4.5in}{\includegraphics{s_software/figs/fit_in_wrapper.eps}} |
108 |
\end{center} |
\end{center} |
109 |
\caption{ |
\caption{ |
110 |
Numerical code is written to fit within a software support |
Numerical code is written to fit within a software support |
118 |
\end{figure} |
\end{figure} |
119 |
|
|
120 |
\subsection{Target hardware} |
\subsection{Target hardware} |
121 |
\label{sect:target_hardware} |
\label{sec:target_hardware} |
122 |
|
|
123 |
The WRAPPER is designed to target as broad as possible a range of |
The WRAPPER is designed to target as broad as possible a range of |
124 |
computer systems. The original development of the WRAPPER took place |
computer systems. The original development of the WRAPPER took place |
136 |
IBM SP systems). In all cases numerical code, operating within the |
IBM SP systems). In all cases numerical code, operating within the |
137 |
WRAPPER, performs and scales very competitively with equivalent |
WRAPPER, performs and scales very competitively with equivalent |
138 |
numerical code that has been modified to contain native optimizations |
numerical code that has been modified to contain native optimizations |
139 |
for a particular system \ref{ref hoe and hill, ecmwf}. |
for a particular system \cite{hoe-hill:99}. |
140 |
|
|
141 |
\subsection{Supporting hardware neutrality} |
\subsection{Supporting hardware neutrality} |
142 |
|
|
143 |
The different systems listed in section \ref{sect:target_hardware} can |
The different systems listed in section \ref{sec:target_hardware} can |
144 |
be categorized in many different ways. For example, one common |
be categorized in many different ways. For example, one common |
145 |
distinction is between shared-memory parallel systems (SMP and PVP) |
distinction is between shared-memory parallel systems (SMP and PVP) |
146 |
and distributed memory parallel systems (for example x86 clusters and |
and distributed memory parallel systems (for example x86 clusters and |
165 |
scientific computing community. |
scientific computing community. |
166 |
|
|
167 |
\subsection{Machine model parallelism} |
\subsection{Machine model parallelism} |
168 |
|
\label{sec:domain_decomposition} |
169 |
\begin{rawhtml} |
\begin{rawhtml} |
170 |
<!-- CMIREDIR:domain_decomp: --> |
<!-- CMIREDIR:domain_decomp: --> |
171 |
\end{rawhtml} |
\end{rawhtml} |
211 |
\begin{figure} |
\begin{figure} |
212 |
\begin{center} |
\begin{center} |
213 |
\resizebox{5in}{!}{ |
\resizebox{5in}{!}{ |
214 |
\includegraphics{part4/domain_decomp.eps} |
\includegraphics{s_software/figs/domain_decomp.eps} |
215 |
} |
} |
216 |
\end{center} |
\end{center} |
217 |
\caption{ The WRAPPER provides support for one and two dimensional |
\caption{ The WRAPPER provides support for one and two dimensional |
240 |
domain it owns. Periodically processors will make calls to WRAPPER |
domain it owns. Periodically processors will make calls to WRAPPER |
241 |
functions to communicate data between tiles, in order to keep the |
functions to communicate data between tiles, in order to keep the |
242 |
overlap regions up to date (see section |
overlap regions up to date (see section |
243 |
\ref{sect:communication_primitives}). The WRAPPER functions can use a |
\ref{sec:communication_primitives}). The WRAPPER functions can use a |
244 |
variety of different mechanisms to communicate data between tiles. |
variety of different mechanisms to communicate data between tiles. |
245 |
|
|
246 |
\begin{figure} |
\begin{figure} |
247 |
\begin{center} |
\begin{center} |
248 |
\resizebox{5in}{!}{ |
\resizebox{5in}{!}{ |
249 |
\includegraphics{part4/tiled-world.eps} |
\includegraphics{s_software/figs/tiled-world.eps} |
250 |
} |
} |
251 |
\end{center} |
\end{center} |
252 |
\caption{ A global grid subdivided into tiles. |
\caption{ A global grid subdivided into tiles. |
280 |
call a function in the API of the communication library to |
call a function in the API of the communication library to |
281 |
communicate data from a tile that it owns to a tile that another CPU |
communicate data from a tile that it owns to a tile that another CPU |
282 |
owns. By default the WRAPPER binds to the MPI communication library |
owns. By default the WRAPPER binds to the MPI communication library |
283 |
\ref{MPI} for this style of communication. |
\cite{MPI-std-20} for this style of communication. |
284 |
\end{itemize} |
\end{itemize} |
285 |
|
|
286 |
The WRAPPER assumes that communication will use one of these two styles |
The WRAPPER assumes that communication will use one of these two styles |
329 |
\end{figure} |
\end{figure} |
330 |
|
|
331 |
\subsection{Shared memory communication} |
\subsection{Shared memory communication} |
332 |
\label{sect:shared_memory_communication} |
\label{sec:shared_memory_communication} |
333 |
|
|
334 |
Under shared communication independent CPUs are operating on the |
Under shared communication independent CPUs are operating on the |
335 |
exact same global address space at the application level. This means |
exact same global address space at the application level. This means |
356 |
appropriately. |
appropriately. |
357 |
|
|
358 |
\subsubsection{Memory consistency} |
\subsubsection{Memory consistency} |
359 |
\label{sect:memory_consistency} |
\label{sec:memory_consistency} |
360 |
|
|
361 |
When using shared memory communication between multiple processors the |
When using shared memory communication between multiple processors the |
362 |
WRAPPER level shields user applications from certain counter-intuitive |
WRAPPER level shields user applications from certain counter-intuitive |
382 |
particular platform. |
particular platform. |
383 |
|
|
384 |
\subsubsection{Cache effects and false sharing} |
\subsubsection{Cache effects and false sharing} |
385 |
\label{sect:cache_effects_and_false_sharing} |
\label{sec:cache_effects_and_false_sharing} |
386 |
|
|
387 |
Shared-memory machines often have local to processor memory caches |
Shared-memory machines often have local to processor memory caches |
388 |
which contain mirrored copies of main memory. Automatic cache-coherence |
which contain mirrored copies of main memory. Automatic cache-coherence |
402 |
the standard mechanism for supporting shared memory that the WRAPPER |
the standard mechanism for supporting shared memory that the WRAPPER |
403 |
utilizes. Configuring and launching code to run in multi-threaded mode |
utilizes. Configuring and launching code to run in multi-threaded mode |
404 |
on specific platforms is discussed in section |
on specific platforms is discussed in section |
405 |
\ref{sect:multi-threaded-execution}. However, on many systems, |
\ref{sec:multi_threaded_execution}. However, on many systems, |
406 |
potentially very efficient mechanisms for using shared memory |
potentially very efficient mechanisms for using shared memory |
407 |
communication between multiple processes (in contrast to multiple |
communication between multiple processes (in contrast to multiple |
408 |
threads within a single process) also exist. In most cases this works |
threads within a single process) also exist. In most cases this works |
409 |
by making a limited region of memory shared between processes. The |
by making a limited region of memory shared between processes. The |
410 |
MMAP \ref{magicgarden} and IPC \ref{magicgarden} facilities in UNIX |
MMAP %\ref{magicgarden} |
411 |
|
and IPC %\ref{magicgarden} |
412 |
|
facilities in UNIX |
413 |
systems provide this capability as do vendor specific tools like LAPI |
systems provide this capability as do vendor specific tools like LAPI |
414 |
\ref{IBMLAPI} and IMC \ref{Memorychannel}. Extensions exist for the |
%\ref{IBMLAPI} |
415 |
|
and IMC. %\ref{Memorychannel}. |
416 |
|
Extensions exist for the |
417 |
WRAPPER that allow these mechanisms to be used for shared memory |
WRAPPER that allow these mechanisms to be used for shared memory |
418 |
communication. However, these mechanisms are not distributed with the |
communication. However, these mechanisms are not distributed with the |
419 |
default WRAPPER sources, because of their proprietary nature. |
default WRAPPER sources, because of their proprietary nature. |
420 |
|
|
421 |
\subsection{Distributed memory communication} |
\subsection{Distributed memory communication} |
422 |
\label{sect:distributed_memory_communication} |
\label{sec:distributed_memory_communication} |
423 |
Many parallel systems are not constructed in a way where it is |
Many parallel systems are not constructed in a way where it is |
424 |
possible or practical for an application to use shared memory for |
possible or practical for an application to use shared memory for |
425 |
communication. For example cluster systems consist of individual |
communication. For example cluster systems consist of individual |
430 |
communication library used is MPI \cite{MPI-std-20}. However, it is |
communication library used is MPI \cite{MPI-std-20}. However, it is |
431 |
relatively straightforward to implement bindings to optimized platform |
relatively straightforward to implement bindings to optimized platform |
432 |
specific communication libraries. For example the work described in |
specific communication libraries. For example the work described in |
433 |
\ref{hoe-hill:99} substituted standard MPI communication for a highly |
\cite{hoe-hill:99} substituted standard MPI communication for a highly |
434 |
optimized library. |
optimized library. |
435 |
|
|
436 |
\subsection{Communication primitives} |
\subsection{Communication primitives} |
437 |
\label{sect:communication_primitives} |
\label{sec:communication_primitives} |
438 |
|
|
439 |
\begin{figure} |
\begin{figure} |
440 |
\begin{center} |
\begin{center} |
441 |
\resizebox{5in}{!}{ |
\resizebox{5in}{!}{ |
442 |
\includegraphics{part4/comm-primm.eps} |
\includegraphics{s_software/figs/comm-primm.eps} |
443 |
} |
} |
444 |
\end{center} |
\end{center} |
445 |
\caption{Three performance critical parallel primitives are provided |
\caption{Three performance critical parallel primitives are provided |
522 |
\begin{figure} |
\begin{figure} |
523 |
\begin{center} |
\begin{center} |
524 |
\resizebox{5in}{!}{ |
\resizebox{5in}{!}{ |
525 |
\includegraphics{part4/tiling_detail.eps} |
\includegraphics{s_software/figs/tiling_detail.eps} |
526 |
} |
} |
527 |
\end{center} |
\end{center} |
528 |
\caption{The tiling strategy that the WRAPPER supports allows tiles |
\caption{The tiling strategy that the WRAPPER supports allows tiles |
582 |
computing CPUs. |
computing CPUs. |
583 |
\end{enumerate} |
\end{enumerate} |
584 |
This section describes the details of each of these operations. |
This section describes the details of each of these operations. |
585 |
Section \ref{sect:specifying_a_decomposition} explains how the way in |
Section \ref{sec:specifying_a_decomposition} explains how the way in |
586 |
which a domain is decomposed (or composed) is expressed. Section |
which a domain is decomposed (or composed) is expressed. Section |
587 |
\ref{sect:starting_a_code} describes practical details of running |
\ref{sec:starting_the_code} describes practical details of running |
588 |
codes in various different parallel modes on contemporary computer |
codes in various different parallel modes on contemporary computer |
589 |
systems. Section \ref{sect:controlling_communication} explains the |
systems. Section \ref{sec:controlling_communication} explains the |
590 |
internal information that the WRAPPER uses to control how information |
internal information that the WRAPPER uses to control how information |
591 |
is communicated between tiles. |
is communicated between tiles. |
592 |
|
|
593 |
\subsection{Specifying a domain decomposition} |
\subsection{Specifying a domain decomposition} |
594 |
\label{sect:specifying_a_decomposition} |
\label{sec:specifying_a_decomposition} |
595 |
|
|
596 |
At its heart much of the WRAPPER works only in terms of a collection of tiles |
At its heart much of the WRAPPER works only in terms of a collection of tiles |
597 |
which are interconnected to each other. This is also true of application |
which are interconnected to each other. This is also true of application |
628 |
\begin{figure} |
\begin{figure} |
629 |
\begin{center} |
\begin{center} |
630 |
\resizebox{5in}{!}{ |
\resizebox{5in}{!}{ |
631 |
\includegraphics{part4/size_h.eps} |
\includegraphics{s_software/figs/size_h.eps} |
632 |
} |
} |
633 |
\end{center} |
\end{center} |
634 |
\caption{ The three level domain decomposition hierarchy employed by the |
\caption{ The three level domain decomposition hierarchy employed by the |
643 |
dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are |
dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are |
644 |
allocated to different threads of a process that are then bound to |
allocated to different threads of a process that are then bound to |
645 |
different physical processors ( see the multi-threaded |
different physical processors ( see the multi-threaded |
646 |
execution discussion in section \ref{sect:starting_the_code} ) then |
execution discussion in section \ref{sec:starting_the_code} ) then |
647 |
computation will be performed concurrently on each tile. However, it is also |
computation will be performed concurrently on each tile. However, it is also |
648 |
possible to run the same decomposition within a process running a single thread on |
possible to run the same decomposition within a process running a single thread on |
649 |
a single processor. In this case the tiles will be computed over sequentially. |
a single processor. In this case the tiles will be computed over sequentially. |
840 |
This set of values can be used for a cube sphere calculation. |
This set of values can be used for a cube sphere calculation. |
841 |
Each tile of size $32 \times 32$ represents a face of the |
Each tile of size $32 \times 32$ represents a face of the |
842 |
cube. Initializing the tile connectivity correctly ( see section |
cube. Initializing the tile connectivity correctly ( see section |
843 |
\ref{sect:cube_sphere_communication}. allows the rotations associated with |
\ref{sec:cube_sphere_communication}. allows the rotations associated with |
844 |
moving between the six cube faces to be embedded within the |
moving between the six cube faces to be embedded within the |
845 |
tile-tile communication code. |
tile-tile communication code. |
846 |
\end{enumerate} |
\end{enumerate} |
847 |
|
|
848 |
|
|
849 |
\subsection{Starting the code} |
\subsection{Starting the code} |
850 |
\label{sect:starting_the_code} |
\label{sec:starting_the_code} |
851 |
When code is started under the WRAPPER, execution begins in a main routine {\em |
When code is started under the WRAPPER, execution begins in a main routine {\em |
852 |
eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred |
eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred |
853 |
to the application through a routine called {\em THE\_MODEL\_MAIN()} |
to the application through a routine called {\em THE\_MODEL\_MAIN()} |
894 |
\end{figure} |
\end{figure} |
895 |
|
|
896 |
\subsubsection{Multi-threaded execution} |
\subsubsection{Multi-threaded execution} |
897 |
\label{sect:multi-threaded-execution} |
\label{sec:multi_threaded_execution} |
898 |
Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the |
Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the |
899 |
WRAPPER may cause several coarse grain threads to be initialized. The routine |
WRAPPER may cause several coarse grain threads to be initialized. The routine |
900 |
{\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single |
{\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single |
901 |
stack argument which is the thread number, stored in the |
stack argument which is the thread number, stored in the |
902 |
variable {\em myThid}. In addition to specifying a decomposition with |
variable {\em myThid}. In addition to specifying a decomposition with |
903 |
multiple tiles per process ( see section \ref{sect:specifying_a_decomposition}) |
multiple tiles per process ( see section \ref{sec:specifying_a_decomposition}) |
904 |
configuring and starting a code to run using multiple threads requires the following |
configuring and starting a code to run using multiple threads requires the following |
905 |
steps.\\ |
steps.\\ |
906 |
|
|
982 |
} \\ |
} \\ |
983 |
|
|
984 |
\subsubsection{Multi-process execution} |
\subsubsection{Multi-process execution} |
985 |
\label{sect:multi-process-execution} |
\label{sec:multi_process_execution} |
986 |
|
|
987 |
Despite its appealing programming model, multi-threaded execution |
Despite its appealing programming model, multi-threaded execution |
988 |
remains less common than multi-process execution. One major reason for |
remains less common than multi-process execution. One major reason for |
994 |
|
|
995 |
Multi-process execution is more ubiquitous. In order to run code in a |
Multi-process execution is more ubiquitous. In order to run code in a |
996 |
multi-process configuration a decomposition specification (see section |
multi-process configuration a decomposition specification (see section |
997 |
\ref{sect:specifying_a_decomposition}) is given (in which the at least |
\ref{sec:specifying_a_decomposition}) is given (in which the at least |
998 |
one of the parameters {\em nPx} or {\em nPy} will be greater than one) |
one of the parameters {\em nPx} or {\em nPy} will be greater than one) |
999 |
and then, as for multi-threaded operation, appropriate compile time |
and then, as for multi-threaded operation, appropriate compile time |
1000 |
and run time steps must be taken. |
and run time steps must be taken. |
1014 |
ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em |
ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em |
1015 |
CPP\_EEOPTIONS.h} file.) More detailed information about the use of |
CPP\_EEOPTIONS.h} file.) More detailed information about the use of |
1016 |
{\em genmake2} for specifying |
{\em genmake2} for specifying |
1017 |
local compiler flags is located in section \ref{sect:genmake}.\\ |
local compiler flags is located in section \ref{sec:genmake}.\\ |
1018 |
|
|
1019 |
|
|
1020 |
\fbox{ |
\fbox{ |
1118 |
|
|
1119 |
|
|
1120 |
\subsection{Controlling communication} |
\subsection{Controlling communication} |
1121 |
|
\label{sec:controlling_communication} |
1122 |
The WRAPPER maintains internal information that is used for communication |
The WRAPPER maintains internal information that is used for communication |
1123 |
operations and that can be customized for different platforms. This section |
operations and that can be customized for different platforms. This section |
1124 |
describes the information that is held and used. |
describes the information that is held and used. |
1141 |
a particular face. A value of {\em COMM\_MSG} is used to indicate |
a particular face. A value of {\em COMM\_MSG} is used to indicate |
1142 |
that some form of distributed memory communication is required to |
that some form of distributed memory communication is required to |
1143 |
communicate between these tile faces (see section |
communicate between these tile faces (see section |
1144 |
\ref{sect:distributed_memory_communication}). A value of {\em |
\ref{sec:distributed_memory_communication}). A value of {\em |
1145 |
COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared |
COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared |
1146 |
memory communication (see section |
memory communication (see section |
1147 |
\ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value |
\ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value |
1148 |
indicates that a CPU should communicate by writing to data |
indicates that a CPU should communicate by writing to data |
1149 |
structures owned by another CPU. A {\em COMM\_GET} value indicates |
structures owned by another CPU. A {\em COMM\_GET} value indicates |
1150 |
that a CPU should communicate by reading from data structures owned |
that a CPU should communicate by reading from data structures owned |
1201 |
the file {\em eedata}. If the value of {\em nThreads} is |
the file {\em eedata}. If the value of {\em nThreads} is |
1202 |
inconsistent with the number of threads requested from the operating |
inconsistent with the number of threads requested from the operating |
1203 |
system (for example by using an environment variable as described in |
system (for example by using an environment variable as described in |
1204 |
section \ref{sect:multi_threaded_execution}) then usually an error |
section \ref{sec:multi_threaded_execution}) then usually an error |
1205 |
will be reported by the routine {\em CHECK\_THREADS}. |
will be reported by the routine {\em CHECK\_THREADS}. |
1206 |
|
|
1207 |
\fbox{ |
\fbox{ |
1218 |
} |
} |
1219 |
|
|
1220 |
\item {\bf memsync flags} |
\item {\bf memsync flags} |
1221 |
As discussed in section \ref{sect:memory_consistency}, a low-level |
As discussed in section \ref{sec:memory_consistency}, a low-level |
1222 |
system function may be need to force memory consistency on some |
system function may be need to force memory consistency on some |
1223 |
shared memory systems. The routine {\em MEMSYNC()} is used for this |
shared memory systems. The routine {\em MEMSYNC()} is used for this |
1224 |
purpose. This routine should not need modifying and the information |
purpose. This routine should not need modifying and the information |
1244 |
\end{verbatim} |
\end{verbatim} |
1245 |
|
|
1246 |
\item {\bf Cache line size} |
\item {\bf Cache line size} |
1247 |
As discussed in section \ref{sect:cache_effects_and_false_sharing}, |
As discussed in section \ref{sec:cache_effects_and_false_sharing}, |
1248 |
milti-threaded codes explicitly avoid penalties associated with |
milti-threaded codes explicitly avoid penalties associated with |
1249 |
excessive coherence traffic on an SMP system. To do this the shared |
excessive coherence traffic on an SMP system. To do this the shared |
1250 |
memory data structures used by the {\em GLOBAL\_SUM}, {\em |
memory data structures used by the {\em GLOBAL\_SUM}, {\em |
1274 |
CPP\_EEMACROS.h}. The \_GSUM macro is a performance critical |
CPP\_EEMACROS.h}. The \_GSUM macro is a performance critical |
1275 |
operation, especially for large processor count, small tile size |
operation, especially for large processor count, small tile size |
1276 |
configurations. The custom communication example discussed in |
configurations. The custom communication example discussed in |
1277 |
section \ref{sect:jam_example} shows how the macro is used to invoke |
section \ref{sec:jam_example} shows how the macro is used to invoke |
1278 |
a custom global sum routine for a specific set of hardware. |
a custom global sum routine for a specific set of hardware. |
1279 |
|
|
1280 |
\item {\bf \_EXCH} |
\item {\bf \_EXCH} |
1287 |
the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH |
the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH |
1288 |
operation plays a crucial role in scaling to small tile, large |
operation plays a crucial role in scaling to small tile, large |
1289 |
logical and physical processor count configurations. The example in |
logical and physical processor count configurations. The example in |
1290 |
section \ref{sect:jam_example} discusses defining an optimized and |
section \ref{sec:jam_example} discusses defining an optimized and |
1291 |
specialized form on the \_EXCH operation. |
specialized form on the \_EXCH operation. |
1292 |
|
|
1293 |
The \_EXCH operation is also central to supporting grids such as the |
The \_EXCH operation is also central to supporting grids such as the |
1328 |
if this mechanism is unavailable then the work arrays can be extended |
if this mechanism is unavailable then the work arrays can be extended |
1329 |
with dimensions using the tile dimensioning scheme of {\em nSx} and |
with dimensions using the tile dimensioning scheme of {\em nSx} and |
1330 |
{\em nSy} (as described in section |
{\em nSy} (as described in section |
1331 |
\ref{sect:specifying_a_decomposition}). However, if the |
\ref{sec:specifying_a_decomposition}). However, if the |
1332 |
configuration being specified involves many more tiles than OS |
configuration being specified involves many more tiles than OS |
1333 |
threads then it can save memory resources to reduce the variable |
threads then it can save memory resources to reduce the variable |
1334 |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads |
{\em MAX\_NO\_THREADS} to be equal to the actual number of threads |
1386 |
how it can be used to adapt to new griding approaches. |
how it can be used to adapt to new griding approaches. |
1387 |
|
|
1388 |
\subsubsection{JAM example} |
\subsubsection{JAM example} |
1389 |
\label{sect:jam_example} |
\label{sec:jam_example} |
1390 |
On some platforms a big performance boost can be obtained by binding |
On some platforms a big performance boost can be obtained by binding |
1391 |
the communication routines {\em \_EXCH} and {\em \_GSUM} to |
the communication routines {\em \_EXCH} and {\em \_GSUM} to |
1392 |
specialized native libraries (for example, the shmem library on CRAY |
specialized native libraries (for example, the shmem library on CRAY |
1410 |
pattern. |
pattern. |
1411 |
|
|
1412 |
\subsubsection{Cube sphere communication} |
\subsubsection{Cube sphere communication} |
1413 |
\label{sect:cube_sphere_communication} |
\label{sec:cube_sphere_communication} |
1414 |
Actual {\em \_EXCH} routine code is generated automatically from a |
Actual {\em \_EXCH} routine code is generated automatically from a |
1415 |
series of template files, for example {\em exch\_rx.template}. This |
series of template files, for example {\em exch\_rx.template}. This |
1416 |
is done to allow a large number of variations on the exchange process |
is done to allow a large number of variations on the exchange process |
1445 |
|
|
1446 |
Fitting together the WRAPPER elements, package elements and |
Fitting together the WRAPPER elements, package elements and |
1447 |
MITgcm core equation elements of the source code produces calling |
MITgcm core equation elements of the source code produces calling |
1448 |
sequence shown in section \ref{sect:calling_sequence} |
sequence shown in section \ref{sec:calling_sequence} |
1449 |
|
|
1450 |
\subsection{Annotated call tree for MITgcm and WRAPPER} |
\subsection{Annotated call tree for MITgcm and WRAPPER} |
1451 |
\label{sect:calling_sequence} |
\label{sec:calling_sequence} |
1452 |
|
|
1453 |
WRAPPER layer. |
WRAPPER layer. |
1454 |
|
|