/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Diff of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.2 by adcroft, Thu Oct 11 19:12:38 2001 UTC revision 1.21 by edhill, Tue Apr 4 15:54:55 2006 UTC
# Line 1  Line 1 
1  % $Header$  % $Header$
2    
3  In this chapter we describe the software architecture and  This chapter focuses on describing the {\bf WRAPPER} environment within which
4  implementation strategy for the MITgcm code. The first part of this  both the core numerics and the pluggable packages operate. The description
5  chapter discusses the MITgcm architecture at an abstract level. In the second  presented here is intended to be a detailed exposition and contains significant
6  part of the chapter we described practical details of the MITgcm implementation  background material, as well as advanced details on working with the WRAPPER.
7  and of current tools and operating system features that are employed.  The tutorial sections of this manual (see sections
8    \ref{sect:tutorials}  and \ref{sect:tutorialIII})
9    contain more succinct, step-by-step instructions on running basic numerical
10    experiments, of varous types, both sequentially and in parallel. For many
11    projects simply starting from an example code and adapting it to suit a
12    particular situation
13    will be all that is required.
14    The first part of this chapter discusses the MITgcm architecture at an
15    abstract level. In the second part of the chapter we described practical
16    details of the MITgcm implementation and of current tools and operating system
17    features that are employed.
18    
19  \section{Overall architectural goals}  \section{Overall architectural goals}
20    \begin{rawhtml}
21    <!-- CMIREDIR:overall_architectural_goals: -->
22    \end{rawhtml}
23    
24  Broadly, the goals of the software architecture employed in MITgcm are  Broadly, the goals of the software architecture employed in MITgcm are
25  three-fold  three-fold
# Line 27  a software architecture which at the hig Line 40  a software architecture which at the hig
40  of  of
41    
42  \begin{enumerate}  \begin{enumerate}
43  \item A core set of numerical and support code. This is discussed in detail in  \item A core set of numerical and support code. This is discussed in
44  section \ref{sec:partII}.    detail in section \ref{chap:discretization}.
45  \item A scheme for supporting optional "pluggable" {\bf packages} (containing  
46  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).  \item A scheme for supporting optional ``pluggable'' {\bf packages}
47  These packages are used both to overlay alternate dynamics and to introduce    (containing for example mixed-layer schemes, biogeochemical schemes,
48  specialized physical content onto the core numerical code. An overview of    atmospheric physics).  These packages are used both to overlay
49  the {\bf package} scheme is given at the start of part \ref{part:packages}.    alternate dynamics and to introduce specialized physical content
50  \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel    onto the core numerical code. An overview of the {\bf package}
51  Programming Environment Resource), within which the core numerics and pluggable    scheme is given at the start of part \ref{chap:packagesI}.
52  packages operate.  
53    \item A support framework called {\bf WRAPPER} (Wrappable Application
54      Parallel Programming Environment Resource), within which the core
55      numerics and pluggable packages operate.
56  \end{enumerate}  \end{enumerate}
57    
58  This chapter focuses on describing the {\bf WRAPPER} environment under which  This chapter focuses on describing the {\bf WRAPPER} environment under
59  both the core numerics and the pluggable packages function. The description  which both the core numerics and the pluggable packages function. The
60  presented here is intended to be a detailed exposistion and contains significant  description presented here is intended to be a detailed exposition and
61  background material, as well as advanced details on working with the WRAPPER.  contains significant background material, as well as advanced details
62  The examples section of this manual (part \ref{part:example}) contains more  on working with the WRAPPER.  The examples section of this manual
63  succinct, step-by-step instructions on running basic numerical  (part \ref{chap:getting_started}) contains more succinct, step-by-step
64  experiments both sequentially and in parallel. For many projects simply  instructions on running basic numerical experiments both sequentially
65  starting from an example code and adapting it to suit a particular situation  and in parallel. For many projects simply starting from an example
66  will be all that is required.  code and adapting it to suit a particular situation will be all that
67    is required.
68    
69    
70  \begin{figure}  \begin{figure}
# Line 66  floating point operations.} Line 83  floating point operations.}
83  \end{figure}  \end{figure}
84    
85  \section{WRAPPER}  \section{WRAPPER}
86    \begin{rawhtml}
87    <!-- CMIREDIR:wrapper: -->
88    \end{rawhtml}
89    
90  A significant element of the software architecture utilized in  A significant element of the software architecture utilized in
91  MITgcm is a software superstructure and substructure collectively  MITgcm is a software superstructure and substructure collectively
# Line 73  called the WRAPPER (Wrappable Applicatio Line 93  called the WRAPPER (Wrappable Applicatio
93  Environment Resource). All numerical and support code in MITgcm is written  Environment Resource). All numerical and support code in MITgcm is written
94  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
95  the WRAPPER means that coding has to follow certain, relatively  the WRAPPER means that coding has to follow certain, relatively
96  straightforward, rules and conventions ( these are discussed further in  straightforward, rules and conventions (these are discussed further in
97  section \ref{sec:specifying_a_decomposition} ).  section \ref{sect:specifying_a_decomposition}).
98    
99  The approach taken by the WRAPPER is illustrated in figure  The approach taken by the WRAPPER is illustrated in figure
100  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
# Line 87  and operating systems. This allows numer Line 107  and operating systems. This allows numer
107  \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}  \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
108  \end{center}  \end{center}
109  \caption{  \caption{
110  Numerical code is written too fit within a software support  Numerical code is written to fit within a software support
111  infrastructure called WRAPPER. The WRAPPER is portable and  infrastructure called WRAPPER. The WRAPPER is portable and
112  can be sepcialized for a wide range of specific target hardware and  can be specialized for a wide range of specific target hardware and
113  programming environments, without impacting numerical code that fits  programming environments, without impacting numerical code that fits
114  within the WRAPPER. Codes that fit within the WRAPPER can generally be  within the WRAPPER. Codes that fit within the WRAPPER can generally be
115  made to run as fast on a particular platform as codes specially  made to run as fast on a particular platform as codes specially
# Line 98  optimized for that platform.} Line 118  optimized for that platform.}
118  \end{figure}  \end{figure}
119    
120  \subsection{Target hardware}  \subsection{Target hardware}
121  \label{sec:target_hardware}  \label{sect:target_hardware}
122    
123  The WRAPPER is designed to target as broad as possible a range of computer  The WRAPPER is designed to target as broad as possible a range of
124  systems. The original development of the WRAPPER took place on a  computer systems.  The original development of the WRAPPER took place
125  multi-processor, CRAY Y-MP system. On that system, numerical code performance  on a multi-processor, CRAY Y-MP system. On that system, numerical code
126  and scaling under the WRAPPER was in excess of that of an implementation that  performance and scaling under the WRAPPER was in excess of that of an
127  was tightly bound to the CRAY systems proprietary multi-tasking and  implementation that was tightly bound to the CRAY systems proprietary
128  micro-tasking approach. Later developments have been carried out on  multi-tasking and micro-tasking approach. Later developments have been
129  uniprocessor and multi-processor Sun systems with both uniform memory access  carried out on uniprocessor and multi-processor Sun systems with both
130  (UMA) and non-uniform memory access (NUMA) designs. Significant work has also  uniform memory access (UMA) and non-uniform memory access (NUMA)
131  been undertaken on x86 cluster systems, Alpha processor based clustered SMP  designs.  Significant work has also been undertaken on x86 cluster
132  systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.  systems, Alpha processor based clustered SMP systems, and on
133  The MITgcm code, operating within the WRAPPER, is also used routinely used on  cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
134  large scale MPP systems (for example T3E systems and IBM SP systems). In all  systems.  The MITgcm code, operating within the WRAPPER, is also
135  cases numerical code, operating within the WRAPPER, performs and scales very  routinely used on large scale MPP systems (for example, Cray T3E and
136  competitively with equivalent numerical code that has been modified to contain  IBM SP systems). In all cases numerical code, operating within the
137  native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.  WRAPPER, performs and scales very competitively with equivalent
138    numerical code that has been modified to contain native optimizations
139    for a particular system \ref{ref hoe and hill, ecmwf}.
140    
141  \subsection{Supporting hardware neutrality}  \subsection{Supporting hardware neutrality}
142    
143  The different systems listed in section \ref{sec:target_hardware} can be  The different systems listed in section \ref{sect:target_hardware} can
144  categorized in many different ways. For example, one common distinction is  be categorized in many different ways. For example, one common
145  between shared-memory parallel systems (SMP's, PVP's) and distributed memory  distinction is between shared-memory parallel systems (SMP and PVP)
146  parallel systems (for example x86 clusters and large MPP systems). This is one  and distributed memory parallel systems (for example x86 clusters and
147  example of a difference between compute platforms that can impact an  large MPP systems). This is one example of a difference between
148  application. Another common distinction is between vector processing systems  compute platforms that can impact an application. Another common
149  with highly specialized CPU's and memory subsystems and commodity  distinction is between vector processing systems with highly
150  microprocessor based systems. There are numerous other differences, especially  specialized CPUs and memory subsystems and commodity microprocessor
151  in relation to how parallel execution is supported. To capture the essential  based systems. There are numerous other differences, especially in
152  differences between different platforms the WRAPPER uses a {\it machine model}.  relation to how parallel execution is supported. To capture the
153    essential differences between different platforms the WRAPPER uses a
154    {\it machine model}.
155    
156  \subsection{WRAPPER machine model}  \subsection{WRAPPER machine model}
157    
158  Applications using the WRAPPER are not written to target just one  Applications using the WRAPPER are not written to target just one
159  particular machine (for example an IBM SP2) or just one particular family or  particular machine (for example an IBM SP2) or just one particular
160  class of machines (for example Parallel Vector Processor Systems). Instead the  family or class of machines (for example Parallel Vector Processor
161  WRAPPER provides applications with an  Systems). Instead the WRAPPER provides applications with an abstract
162  abstract {\it machine model}. The machine model is very general, however, it can  {\it machine model}. The machine model is very general, however, it
163  easily be specialized to fit, in a computationally effificent manner, any  can easily be specialized to fit, in a computationally efficient
164  computer architecture currently available to the scientific computing community.  manner, any computer architecture currently available to the
165    scientific computing community.
166    
167  \subsection{Machine model parallelism}  \subsection{Machine model parallelism}
168    \begin{rawhtml}
169   Codes operating under the WRAPPER target an abstract machine that is assumed to  <!-- CMIREDIR:domain_decomp: -->
170  consist of one or more logical processors that can compute concurrently.    \end{rawhtml}
171  Computational work is divided amongst the logical  
172  processors by allocating ``ownership'' to  Codes operating under the WRAPPER target an abstract machine that is
173  each processor of a certain set (or sets) of calculations. Each set of  assumed to consist of one or more logical processors that can compute
174  calculations owned by a particular processor is associated with a specific  concurrently.  Computational work is divided among the logical
175  region of the physical space that is being simulated, only one processor will  processors by allocating ``ownership'' to each processor of a certain
176  be associated with each such region (domain decomposition).    set (or sets) of calculations. Each set of calculations owned by a
177    particular processor is associated with a specific region of the
178  In a strict sense the logical processors over which work is divided do not need  physical space that is being simulated, only one processor will be
179  to correspond to physical processors. It is perfectly possible to execute a  associated with each such region (domain decomposition).
180  configuration decomposed for multiple logical processors on a single physical  
181  processor. This helps ensure that numerical code that is written to fit  In a strict sense the logical processors over which work is divided do
182  within the WRAPPER will parallelize with no additional effort and is  not need to correspond to physical processors.  It is perfectly
183  also useful when debugging codes. Generally, however,  possible to execute a configuration decomposed for multiple logical
184  the computational domain will be subdivided over multiple logical  processors on a single physical processor.  This helps ensure that
185  processors in order to then bind those logical processors to physical  numerical code that is written to fit within the WRAPPER will
186  processor resources that can compute in parallel.  parallelize with no additional effort.  It is also useful for
187    debugging purposes.  Generally, however, the computational domain will
188    be subdivided over multiple logical processors in order to then bind
189    those logical processors to physical processor resources that can
190    compute in parallel.
191    
192  \subsubsection{Tiles}  \subsubsection{Tiles}
193    
194  Computationally, associated with each region of physical  Computationally, the data structures (\textit{eg.} arrays, scalar
195  space allocated to a particular logical processor, there will be data  variables, etc.) that hold the simulated state are associated with
196  structures (arrays, scalar variables etc...) that hold the simulated state of  each region of physical space and are allocated to a particular
197  that region. We refer to these data structures as being {\bf owned} by the  logical processor.  We refer to these data structures as being {\bf
198  pprocessor to which their    owned} by the processor to which their associated region of physical
199  associated region of physical space has been allocated. Individual  space has been allocated.  Individual regions that are allocated to
200  regions that are allocated to processors are called {\bf tiles}. A  processors are called {\bf tiles}.  A processor can own more than one
201  processor can own more  tile.  Figure \ref{fig:domaindecomp} shows a physical domain being
202  than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being  mapped to a set of logical processors, with each processors owning a
203  mapped to a set of logical processors, with each processors owning a single  single region of the domain (a single tile).  Except for periods of
204  region of the domain (a single tile). Except for periods of  communication and coordination, each processor computes autonomously,
205  communication and coordination, each processor computes autonomously, working  working only with data from the tile (or tiles) that the processor
206  only with data from the tile (or tiles) that the processor owns. When multiple  owns.  When multiple tiles are alloted to a single processor, each
207  tiles are alloted to a single processor, each tile is computed on  tile is computed on independently of the other tiles, in a sequential
208  independently of the other tiles, in a sequential fashion.  fashion.
209    
210  \begin{figure}  \begin{figure}
211  \begin{center}  \begin{center}
# Line 184  independently of the other tiles, in a s Line 213  independently of the other tiles, in a s
213    \includegraphics{part4/domain_decomp.eps}    \includegraphics{part4/domain_decomp.eps}
214   }   }
215  \end{center}  \end{center}
216  \caption{ The WRAPPER provides support for one and two dimensional  \caption{ The WRAPPER provides support for one and two dimensional
217  decompositions of grid-point domains. The figure shows a hypothetical domain of    decompositions of grid-point domains. The figure shows a
218  total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in    hypothetical domain of total size $N_{x}N_{y}N_{z}$. This
219  two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf    hypothetical domain is decomposed in two-dimensions along the
220  tiles} are {\bf owned} by different processors. The {\bf owning}    $N_{x}$ and $N_{y}$ directions. The resulting {\bf tiles} are {\bf
221  processors perform the      owned} by different processors. The {\bf owning} processors
222  arithmetic operations associated with a {\bf tile}. Although not illustrated    perform the arithmetic operations associated with a {\bf tile}.
223  here, a single processor can {\bf own} several {\bf tiles}.    Although not illustrated here, a single processor can {\bf own}
224  Whenever a processor wishes to transfer data between tiles or    several {\bf tiles}.  Whenever a processor wishes to transfer data
225  communicate with other processors it calls a WRAPPER supplied    between tiles or communicate with other processors it calls a
226  function.    WRAPPER supplied function.  } \label{fig:domaindecomp}
 } \label{fig:domaindecomp}  
227  \end{figure}  \end{figure}
228    
229    
230  \subsubsection{Tile layout}  \subsubsection{Tile layout}
231    
232  Tiles consist of an interior region and an overlap region. The overlap region  Tiles consist of an interior region and an overlap region.  The
233  of a tile corresponds to the interior region of an adjacent tile.  overlap region of a tile corresponds to the interior region of an
234  In figure \ref{fig:tiledworld} each tile would own the region  adjacent tile.  In figure \ref{fig:tiledworld} each tile would own the
235  within the black square and hold duplicate information for overlap  region within the black square and hold duplicate information for
236  regions extending into the tiles to the north, south, east and west.  overlap regions extending into the tiles to the north, south, east and
237  During  west.  During computational phases a processor will reference data in
238  computational phases a processor will reference data in an overlap region  an overlap region whenever it requires values that lie outside the
239  whenever it requires values that outside the domain it owns. Periodically  domain it owns.  Periodically processors will make calls to WRAPPER
240  processors will make calls to WRAPPER functions to communicate data between  functions to communicate data between tiles, in order to keep the
241  tiles, in order to keep the overlap regions up to date (see section  overlap regions up to date (see section
242  \ref{sec:communication_primitives}). The WRAPPER functions can use a  \ref{sect:communication_primitives}).  The WRAPPER functions can use a
243  variety of different mechanisms to communicate data between tiles.  variety of different mechanisms to communicate data between tiles.
244    
245  \begin{figure}  \begin{figure}
# Line 228  Overlap regions are periodically updated Line 256  Overlap regions are periodically updated
256    
257  \subsection{Communication mechanisms}  \subsection{Communication mechanisms}
258    
259   Logical processors are assumed to be able to exchange information  Logical processors are assumed to be able to exchange information
260  between tiles and between each other using at least one of two possible  between tiles and between each other using at least one of two
261  mechanisms.  possible mechanisms.
262    
263  \begin{itemize}  \begin{itemize}
264  \item {\bf Shared memory communication}.  \item {\bf Shared memory communication}.  Under this mode of
265  Under this mode of communication data transfers are assumed to be possible    communication data transfers are assumed to be possible using direct
266  using direct addressing of regions of memory. In this case a CPU is able to read    addressing of regions of memory.  In this case a CPU is able to read
267  (and write) directly to regions of memory "owned" by another CPU    (and write) directly to regions of memory ``owned'' by another CPU
268  using simple programming language level assignment operations of the    using simple programming language level assignment operations of the
269  the sort shown in figure \ref{fig:simple_assign}. In this way one CPU    the sort shown in figure \ref{fig:simple_assign}.  In this way one
270  (CPU1 in the figure) can communicate information to another CPU (CPU2 in the    CPU (CPU1 in the figure) can communicate information to another CPU
271  figure) by assigning a particular value to a particular memory location.    (CPU2 in the figure) by assigning a particular value to a particular
272      memory location.
273  \item {\bf Distributed memory communication}.  
274  Under this mode of communication there is no mechanism, at the application code level,  \item {\bf Distributed memory communication}.  Under this mode of
275  for directly addressing regions of memory owned and visible to another CPU. Instead    communication there is no mechanism, at the application code level,
276  a communication library must be used as illustrated in figure    for directly addressing regions of memory owned and visible to
277  \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the    another CPU. Instead a communication library must be used as
278  communication library to communicate data from a tile that it owns to a tile    illustrated in figure \ref{fig:comm_msg}. In this case CPUs must
279  that another CPU owns. By default the WRAPPER binds to the MPI communication    call a function in the API of the communication library to
280  library \ref{MPI} for this style of communication.    communicate data from a tile that it owns to a tile that another CPU
281      owns. By default the WRAPPER binds to the MPI communication library
282      \ref{MPI} for this style of communication.
283  \end{itemize}  \end{itemize}
284    
285  The WRAPPER assumes that communication will use one of these two styles  The WRAPPER assumes that communication will use one of these two styles
286  of communication. The underlying hardware and operating system support  of communication.  The underlying hardware and operating system support
287  for the style used is not specified and can vary from system to system.  for the style used is not specified and can vary from system to system.
288    
289  \begin{figure}  \begin{figure}
# Line 267  for the style used is not specified and Line 297  for the style used is not specified and
297                                   |        END WHILE                                   |        END WHILE
298                                   |                                   |
299  \end{verbatim}  \end{verbatim}
300  \caption{ In the WRAPPER shared memory communication model, simple writes to an  \caption{In the WRAPPER shared memory communication model, simple writes to an
301  array can be made to be visible to other CPU's at the application code level.  array can be made to be visible to other CPUs at the application code level.
302  So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to  So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
303  element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)  element $3$ of array $a$, then other CPUs (for example CPU2 in the figure above)
304  will be able to see the value $8$ when they read from $a(3)$.  will be able to see the value $8$ when they read from $a(3)$.
305  This provides a very low latency and high bandwidth communication  This provides a very low latency and high bandwidth communication
306  mechanism.  mechanism.
# Line 289  mechanism. Line 319  mechanism.
319                                   |                                   |
320  \end{verbatim}  \end{verbatim}
321  \caption{ In the WRAPPER distributed memory communication model  \caption{ In the WRAPPER distributed memory communication model
322  data can not be made directly visible to other CPU's.  data can not be made directly visible to other CPUs.
323  If one CPU writes the value $8$ to element $3$ of array $a$, then  If one CPU writes the value $8$ to element $3$ of array $a$, then
324  at least one of CPU1 and/or CPU2 in the figure above will need  at least one of CPU1 and/or CPU2 in the figure above will need
325  to call a bespoke communication library in order for the updated  to call a bespoke communication library in order for the updated
326  value to be communicated between CPU's.  value to be communicated between CPUs.
327  } \label{fig:comm_msg}  } \label{fig:comm_msg}
328  \end{figure}  \end{figure}
329    
330  \subsection{Shared memory communication}  \subsection{Shared memory communication}
331  \label{sec:shared_memory_communication}  \label{sect:shared_memory_communication}
332    
333  Under shared communication independent CPU's are operating  Under shared communication independent CPUs are operating on the
334  on the exact same global address space at the application level.  exact same global address space at the application level.  This means
335  This means that CPU 1 can directly write into global  that CPU 1 can directly write into global data structures that CPU 2
336  data structures that CPU 2 ``owns'' using a simple  ``owns'' using a simple assignment at the application level.  This is
337  assignment at the application level.  the model of memory access is supported at the basic system design
338  This is the model of memory access is supported at the basic system  level in ``shared-memory'' systems such as PVP systems, SMP systems,
339  design level in ``shared-memory'' systems such as PVP systems, SMP systems,  and on distributed shared memory systems (\textit{eg.} SGI Origin, SGI
340  and on distributed shared memory systems (the SGI Origin).  Altix, and some AMD Opteron systems).  On such systems the WRAPPER
341  On such systems the WRAPPER will generally use simple read and write statements  will generally use simple read and write statements to access directly
342  to access directly application data structures when communicating between CPU's.  application data structures when communicating between CPUs.
343    
344  In a system where assignments statements, like the one in figure  In a system where assignments statements, like the one in figure
345  \ref{fig:simple_assign} map directly to  \ref{fig:simple_assign} map directly to hardware instructions that
346  hardware instructions that transport data between CPU and memory banks, this  transport data between CPU and memory banks, this can be a very
347  can be a very efficient mechanism for communication. In this case two CPU's,  efficient mechanism for communication.  In this case two CPUs, CPU1
348  CPU1 and CPU2, can communicate simply be reading and writing to an  and CPU2, can communicate simply be reading and writing to an agreed
349  agreed location and following a few basic rules. The latency of this sort  location and following a few basic rules.  The latency of this sort of
350  of communication is generally not that much higher than the hardware  communication is generally not that much higher than the hardware
351  latency of other memory accesses on the system. The bandwidth available  latency of other memory accesses on the system. The bandwidth
352  between CPU's communicating in this way can be close to the bandwidth of  available between CPUs communicating in this way can be close to the
353  the systems main-memory interconnect. This can make this method of  bandwidth of the systems main-memory interconnect.  This can make this
354  communication very efficient provided it is used appropriately.  method of communication very efficient provided it is used
355    appropriately.
356    
357  \subsubsection{Memory consistency}  \subsubsection{Memory consistency}
358  \label{sec:memory_consistency}  \label{sect:memory_consistency}
359    
360  When using shared memory communication between  When using shared memory communication between multiple processors the
361  multiple processors the WRAPPER level shields user applications from  WRAPPER level shields user applications from certain counter-intuitive
362  certain counter-intuitive system behaviors. In particular, one issue the  system behaviors.  In particular, one issue the WRAPPER layer must
363  WRAPPER layer must deal with is a systems memory model. In general the order  deal with is a systems memory model.  In general the order of reads
364  of reads and writes expressed by the textual order of an application code may  and writes expressed by the textual order of an application code may
365  not be the ordering of instructions executed by the processor performing the  not be the ordering of instructions executed by the processor
366  application. The processor performing the application instructions will always  performing the application.  The processor performing the application
367  operate so that, for the application instructions the processor is executing,  instructions will always operate so that, for the application
368  any reordering is not apparent. However, in general machines are often  instructions the processor is executing, any reordering is not
369  designed so that reordering of instructions is not hidden from other second  apparent.  However, in general machines are often designed so that
370  processors.  This means that, in general, even on a shared memory system two  reordering of instructions is not hidden from other second processors.
371  processors can observe inconsistent memory values.  This means that, in general, even on a shared memory system two
372    processors can observe inconsistent memory values.
373  The issue of memory consistency between multiple processors is discussed at  
374  length in many computer science papers, however, from a practical point of  The issue of memory consistency between multiple processors is
375  view, in order to deal with this issue, shared memory machines all provide  discussed at length in many computer science papers.  From a practical
376  some mechanism to enforce memory consistency when it is needed. The exact  point of view, in order to deal with this issue, shared memory
377  mechanism employed will vary between systems. For communication using shared  machines all provide some mechanism to enforce memory consistency when
378  memory, the WRAPPER provides a place to invoke the appropriate mechanism to  it is needed.  The exact mechanism employed will vary between systems.
379  ensure memory consistency for a particular platform.  For communication using shared memory, the WRAPPER provides a place to
380    invoke the appropriate mechanism to ensure memory consistency for a
381    particular platform.
382    
383  \subsubsection{Cache effects and false sharing}  \subsubsection{Cache effects and false sharing}
384  \label{sec:cache_effects_and_false_sharing}  \label{sect:cache_effects_and_false_sharing}
385    
386  Shared-memory machines often have local to processor memory caches  Shared-memory machines often have local to processor memory caches
387  which contain mirrored copies of main memory. Automatic cache-coherence  which contain mirrored copies of main memory.  Automatic cache-coherence
388  protocols are used to maintain consistency between caches on different  protocols are used to maintain consistency between caches on different
389  processors. These cache-coherence protocols typically enforce consistency  processors.  These cache-coherence protocols typically enforce consistency
390  between regions of memory with large granularity (typically 128 or 256 byte  between regions of memory with large granularity (typically 128 or 256 byte
391  chunks). The coherency protocols employed can be expensive relative to other  chunks).  The coherency protocols employed can be expensive relative to other
392  memory accesses and so care is taken in the WRAPPER (by padding synchronization  memory accesses and so care is taken in the WRAPPER (by padding synchronization
393  structures appropriately) to avoid unnecessary coherence traffic.  structures appropriately) to avoid unnecessary coherence traffic.
394    
395  \subsubsection{Operating system support for shared memory.}  \subsubsection{Operating system support for shared memory.}
396    
397  Applications running under multiple threads within a single process can  Applications running under multiple threads within a single process
398  use shared memory communication. In this case {\it all} the memory locations  can use shared memory communication.  In this case {\it all} the
399  in an application are potentially visible to all the compute threads. Multiple  memory locations in an application are potentially visible to all the
400  threads operating within a single process is the standard mechanism for  compute threads. Multiple threads operating within a single process is
401  supporting shared memory that the WRAPPER utilizes. Configuring and launching  the standard mechanism for supporting shared memory that the WRAPPER
402  code to run in multi-threaded mode on specific platforms is discussed in  utilizes. Configuring and launching code to run in multi-threaded mode
403  section \ref{sec:running_with_threads}.  However, on many systems, potentially  on specific platforms is discussed in section
404  very efficient mechanisms for using shared memory communication between  \ref{sect:multi-threaded-execution}.  However, on many systems,
405  multiple processes (in contrast to multiple threads within a single  potentially very efficient mechanisms for using shared memory
406  process) also exist. In most cases this works by making a limited region of  communication between multiple processes (in contrast to multiple
407  memory shared between processes. The MMAP \ref{magicgarden} and  threads within a single process) also exist. In most cases this works
408  IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do  by making a limited region of memory shared between processes. The
409  vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.  MMAP \ref{magicgarden} and IPC \ref{magicgarden} facilities in UNIX
410  Extensions exist for the WRAPPER that allow these mechanisms  systems provide this capability as do vendor specific tools like LAPI
411  to be used for shared memory communication. However, these mechanisms are not  \ref{IBMLAPI} and IMC \ref{Memorychannel}.  Extensions exist for the
412  distributed with the default WRAPPER sources, because of their proprietary  WRAPPER that allow these mechanisms to be used for shared memory
413  nature.  communication. However, these mechanisms are not distributed with the
414    default WRAPPER sources, because of their proprietary nature.
415    
416  \subsection{Distributed memory communication}  \subsection{Distributed memory communication}
417  \label{sec:distributed_memory_communication}  \label{sect:distributed_memory_communication}
418  Many parallel systems are not constructed in a way where it is  Many parallel systems are not constructed in a way where it is
419  possible or practical for an application to use shared memory  possible or practical for an application to use shared memory
420  for communication. For example cluster systems consist of individual computers  for communication. For example cluster systems consist of individual computers
421  connected by a fast network. On such systems their is no notion of shared memory  connected by a fast network. On such systems there is no notion of shared memory
422  at the system level. For this sort of system the WRAPPER provides support  at the system level. For this sort of system the WRAPPER provides support
423  for communication based on a bespoke communication library  for communication based on a bespoke communication library
424  (see figure \ref{fig:comm_msg}).  The default communication library used is MPI  (see figure \ref{fig:comm_msg}).  The default communication library used is MPI
# Line 394  described in \ref{hoe-hill:99} substitut Line 428  described in \ref{hoe-hill:99} substitut
428  highly optimized library.  highly optimized library.
429    
430  \subsection{Communication primitives}  \subsection{Communication primitives}
431  \label{sec:communication_primitives}  \label{sect:communication_primitives}
432    
433  \begin{figure}  \begin{figure}
434  \begin{center}  \begin{center}
# Line 402  highly optimized library. Line 436  highly optimized library.
436    \includegraphics{part4/comm-primm.eps}    \includegraphics{part4/comm-primm.eps}
437   }   }
438  \end{center}  \end{center}
439  \caption{Three performance critical parallel primititives are provided  \caption{Three performance critical parallel primitives are provided
440  by the WRAPPER. These primititives are always used to communicate data    by the WRAPPER. These primitives are always used to communicate data
441  between tiles. The figure shows four tiles. The curved arrows indicate    between tiles. The figure shows four tiles. The curved arrows
442  exchange primitives which transfer data between the overlap regions at tile    indicate exchange primitives which transfer data between the overlap
443  edges and interior regions for nearest-neighbor tiles.    regions at tile edges and interior regions for nearest-neighbor
444  The straight arrows symbolize global sum operations which connect all tiles.    tiles.  The straight arrows symbolize global sum operations which
445  The global sum operation provides both a key arithmetic primitive and can    connect all tiles.  The global sum operation provides both a key
446  serve as a synchronization primitive. A third barrier primitive is also    arithmetic primitive and can serve as a synchronization primitive. A
447  provided, it behaves much like the global sum primitive.    third barrier primitive is also provided, it behaves much like the
448  } \label{fig:communication_primitives}    global sum primitive.  } \label{fig:communication_primitives}
449  \end{figure}  \end{figure}
450    
451    
452  Optimized communication support is assumed to be possibly available  Optimized communication support is assumed to be potentially available
453  for a small number of communication operations.  for a small number of communication operations.  It is also assumed
454  It is assumed that communication performance optimizations can  that communication performance optimizations can be achieved by
455  be achieved by optimizing a small number of communication primitives.  optimizing a small number of communication primitives.  Three
456  Three optimizable primitives are provided by the WRAPPER  optimizable primitives are provided by the WRAPPER
 \begin{itemize}  
 \item{\bf EXCHANGE} This operation is used to transfer data between interior  
 and overlap regions of neighboring tiles. A number of different forms of this  
 operation are supported. These different forms handle  
457  \begin{itemize}  \begin{itemize}
458  \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled  \item{\bf EXCHANGE} This operation is used to transfer data between
459  separately.    interior and overlap regions of neighboring tiles. A number of
460  \item Bindings to different communication methods.    different forms of this operation are supported. These different
461  Exchange primitives select between using shared memory or distributed    forms handle
462  memory communication.    \begin{itemize}
463  \item Transformation operations required when transporting    \item Data type differences. Sixty-four bit and thirty-two bit
464  data between different grid regions. Transferring data      fields may be handled separately.
465  between faces of a cube-sphere grid, for example, involves a rotation    \item Bindings to different communication methods.  Exchange
466  of vector components.      primitives select between using shared memory or distributed
467  \item Forward and reverse mode computations. Derivative calculations require      memory communication.
468  tangent linear and adjoint forms of the exchange primitives.    \item Transformation operations required when transporting data
469        between different grid regions. Transferring data between faces of
470  \end{itemize}      a cube-sphere grid, for example, involves a rotation of vector
471        components.
472      \item Forward and reverse mode computations. Derivative calculations
473        require tangent linear and adjoint forms of the exchange
474        primitives.
475      \end{itemize}
476    
477  \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic  \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
478  operation for the pressure inversion phase of the MITgcm algorithm.    operation for the pressure inversion phase of the MITgcm algorithm.
479  For certain configurations scaling can be highly sensitive to    For certain configurations scaling can be highly sensitive to the
480  the performance of the global sum primitive. This operation is a collective    performance of the global sum primitive. This operation is a
481  operation involving all tiles of the simulated domain. Different forms    collective operation involving all tiles of the simulated domain.
482  of the global sum primitive exist for handling    Different forms of the global sum primitive exist for handling
483  \begin{itemize}    \begin{itemize}
484  \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled    \item Data type differences. Sixty-four bit and thirty-two bit
485  separately.      fields may be handled separately.
486  \item Bindings to different communication methods.    \item Bindings to different communication methods.  Exchange
487  Exchange primitives select between using shared memory or distributed      primitives select between using shared memory or distributed
488  memory communication.      memory communication.
489  \item Forward and reverse mode computations. Derivative calculations require    \item Forward and reverse mode computations. Derivative calculations
490  tangent linear and adjoint forms of the exchange primitives.      require tangent linear and adjoint forms of the exchange
491  \end{itemize}      primitives.
492      \end{itemize}
493  \item{\bf BARRIER} The WRAPPER provides a global synchronization function    
494  called barrier. This is used to synchronize computations over all tiles.  \item{\bf BARRIER} The WRAPPER provides a global synchronization
495  The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in    function called barrier. This is used to synchronize computations
496  some cases use the same underlying code.    over all tiles.  The {\bf BARRIER} and {\bf GLOBAL SUM} primitives
497      have much in common and in some cases use the same underlying code.
498  \end{itemize}  \end{itemize}
499    
500    
# Line 498  Following the discussion above, the mach Line 534  Following the discussion above, the mach
534  presents to an application has the following characteristics  presents to an application has the following characteristics
535    
536  \begin{itemize}  \begin{itemize}
537  \item The machine consists of one or more logical processors. \vspace{-3mm}  \item The machine consists of one or more logical processors.
538  \item Each processor operates on tiles that it owns.\vspace{-3mm}  \item Each processor operates on tiles that it owns.
539  \item A processor may own more than one tile.\vspace{-3mm}  \item A processor may own more than one tile.
540  \item Processors may compute concurrently.\vspace{-3mm}  \item Processors may compute concurrently.
541  \item Exchange of information between tiles is handled by the  \item Exchange of information between tiles is handled by the
542  machine (WRAPPER) not by the application.    machine (WRAPPER) not by the application.
543  \end{itemize}  \end{itemize}
544  Behind the scenes this allows the WRAPPER to adapt the machine model  Behind the scenes this allows the WRAPPER to adapt the machine model
545  functions to exploit hardware on which  functions to exploit hardware on which
546  \begin{itemize}  \begin{itemize}
547  \item Processors may be able to communicate very efficiently with each other  \item Processors may be able to communicate very efficiently with each
548  using shared memory. \vspace{-3mm}    other using shared memory.
549  \item An alternative communication mechanism based on a relatively  \item An alternative communication mechanism based on a relatively
550  simple inter-process communication API may be required.\vspace{-3mm}    simple inter-process communication API may be required.
551  \item Shared memory may not necessarily obey sequential consistency,  \item Shared memory may not necessarily obey sequential consistency,
552  however some mechanism will exist for enforcing memory consistency.    however some mechanism will exist for enforcing memory consistency.
 \vspace{-3mm}  
553  \item Memory consistency that is enforced at the hardware level  \item Memory consistency that is enforced at the hardware level
554  may be expensive. Unnecessary triggering of consistency protocols    may be expensive. Unnecessary triggering of consistency protocols
555  should be avoided. \vspace{-3mm}    should be avoided.
556  \item Memory access patterns may need to either repetitive or highly  \item Memory access patterns may need to either repetitive or highly
557  pipelined for optimum hardware performance. \vspace{-3mm}    pipelined for optimum hardware performance.
558  \end{itemize}  \end{itemize}
559    
560  This generic model captures the essential hardware ingredients  This generic model captures the essential hardware ingredients
# Line 527  of almost all successful scientific comp Line 562  of almost all successful scientific comp
562  last 50 years.  last 50 years.
563    
564  \section{Using the WRAPPER}  \section{Using the WRAPPER}
565    \begin{rawhtml}
566  In order to support maximum portability the WRAPPER is implemented primarily  <!-- CMIREDIR:using_the_wrapper: -->
567  in sequential Fortran 77. At a practical level the key steps provided by the  \end{rawhtml}
568  WRAPPER are  
569    In order to support maximum portability the WRAPPER is implemented
570    primarily in sequential Fortran 77. At a practical level the key steps
571    provided by the WRAPPER are
572  \begin{enumerate}  \begin{enumerate}
573  \item specifying how a domain will be decomposed  \item specifying how a domain will be decomposed
574  \item starting a code in either sequential or parallel modes of operations  \item starting a code in either sequential or parallel modes of operations
575  \item controlling communication between tiles and between concurrently  \item controlling communication between tiles and between concurrently
576  computing CPU's.    computing CPUs.
577  \end{enumerate}  \end{enumerate}
578  This section describes the details of each of these operations.  This section describes the details of each of these operations.
579  Section \ref{sec:specifying_a_decomposition} explains how the way in which  Section \ref{sect:specifying_a_decomposition} explains how the way in
580  a domain is decomposed (or composed) is expressed. Section  which a domain is decomposed (or composed) is expressed. Section
581  \ref{sec:starting_a_code} describes practical details of running codes  \ref{sect:starting_a_code} describes practical details of running
582  in various different parallel modes on contemporary computer systems.  codes in various different parallel modes on contemporary computer
583  Section \ref{sec:controlling_communication} explains the internal information  systems.  Section \ref{sect:controlling_communication} explains the
584  that the WRAPPER uses to control how information is communicated between  internal information that the WRAPPER uses to control how information
585  tiles.  is communicated between tiles.
586    
587  \subsection{Specifying a domain decomposition}  \subsection{Specifying a domain decomposition}
588  \label{sec:specifying_a_decomposition}  \label{sect:specifying_a_decomposition}
589    
590  At its heart much of the WRAPPER works only in terms of a collection of tiles  At its heart much of the WRAPPER works only in terms of a collection of tiles
591  which are interconnected to each other. This is also true of application  which are interconnected to each other. This is also true of application
# Line 599  be created within a single process. Each Line 637  be created within a single process. Each
637  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
638  allocated to different threads of a process that are then bound to  allocated to different threads of a process that are then bound to
639  different physical processors ( see the multi-threaded  different physical processors ( see the multi-threaded
640  execution discussion in section \ref{sec:starting_the_code} ) then  execution discussion in section \ref{sect:starting_the_code} ) then
641  computation will be performed concurrently on each tile. However, it is also  computation will be performed concurrently on each tile. However, it is also
642  possible to run the same decomposition within a process running a single thread on  possible to run the same decomposition within a process running a single thread on
643  a single processor. In this case the tiles will be computed over sequentially.  a single processor. In this case the tiles will be computed over sequentially.
# Line 651  Within a {\em bi}, {\em bj} loop Line 689  Within a {\em bi}, {\em bj} loop
689  computation is performed concurrently over as many processes and threads  computation is performed concurrently over as many processes and threads
690  as there are physical processors available to compute.  as there are physical processors available to compute.
691    
692    An exception to the the use of {\em bi} and {\em bj} in loops arises in the
693    exchange routines used when the exch2 package is used with the cubed
694    sphere.  In this case {\em bj} is generally set to 1 and the loop runs from
695    1,{\em bi}.  Within the loop {\em bi} is used to retrieve the tile number,
696    which is then used to reference exchange parameters.
697    
698  The amount of computation that can be embedded  The amount of computation that can be embedded
699  a single loop over {\em bi} and {\em bj} varies for different parts of the  a single loop over {\em bi} and {\em bj} varies for different parts of the
700  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
# Line 771  The global domain size is again ninety g Line 815  The global domain size is again ninety g
815  forty grid points in y. The two sub-domains in each process will be computed  forty grid points in y. The two sub-domains in each process will be computed
816  sequentially if they are given to a single thread within a single process.  sequentially if they are given to a single thread within a single process.
817  Alternatively if the code is invoked with multiple threads per process  Alternatively if the code is invoked with multiple threads per process
818  the two domains in y may be computed on concurrently.  the two domains in y may be computed concurrently.
819  \item  \item
820  \begin{verbatim}  \begin{verbatim}
821        PARAMETER (        PARAMETER (
# Line 789  thirty-two grid points, and x and y over Line 833  thirty-two grid points, and x and y over
833  There are six tiles allocated to six separate logical processors ({\em nSx=6}).  There are six tiles allocated to six separate logical processors ({\em nSx=6}).
834  This set of values can be used for a cube sphere calculation.  This set of values can be used for a cube sphere calculation.
835  Each tile of size $32 \times 32$ represents a face of the  Each tile of size $32 \times 32$ represents a face of the
836  cube. Initialising the tile connectivity correctly ( see section  cube. Initializing the tile connectivity correctly ( see section
837  \ref{sec:cube_sphere_communication}. allows the rotations associated with  \ref{sect:cube_sphere_communication}. allows the rotations associated with
838  moving between the six cube faces to be embedded within the  moving between the six cube faces to be embedded within the
839  tile-tile communication code.  tile-tile communication code.
840  \end{enumerate}  \end{enumerate}
841    
842    
843  \subsection{Starting the code}  \subsection{Starting the code}
844  \label{sec:starting_the_code}  \label{sect:starting_the_code}
845  When code is started under the WRAPPER, execution begins in a main routine {\em  When code is started under the WRAPPER, execution begins in a main routine {\em
846  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
847  to the application through a routine called {\em THE\_MODEL\_MAIN()}  to the application through a routine called {\em THE\_MODEL\_MAIN()}
# Line 807  by the application code. The startup cal Line 851  by the application code. The startup cal
851  WRAPPER is shown in figure \ref{fig:wrapper_startup}.  WRAPPER is shown in figure \ref{fig:wrapper_startup}.
852    
853  \begin{figure}  \begin{figure}
854    {\footnotesize
855  \begin{verbatim}  \begin{verbatim}
856    
857         MAIN           MAIN  
# Line 835  WRAPPER is shown in figure \ref{fig:wrap Line 880  WRAPPER is shown in figure \ref{fig:wrap
880    
881    
882  \end{verbatim}  \end{verbatim}
883    }
884  \caption{Main stages of the WRAPPER startup procedure.  \caption{Main stages of the WRAPPER startup procedure.
885  This process proceeds transfer of control to application code, which  This process proceeds transfer of control to application code, which
886  occurs through the procedure {\em THE\_MODEL\_MAIN()}.  occurs through the procedure {\em THE\_MODEL\_MAIN()}.
# Line 842  occurs through the procedure {\em THE\_M Line 888  occurs through the procedure {\em THE\_M
888  \end{figure}  \end{figure}
889    
890  \subsubsection{Multi-threaded execution}  \subsubsection{Multi-threaded execution}
891    \label{sect:multi-threaded-execution}
892  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
893  WRAPPER may cause several coarse grain threads to be initialized. The routine  WRAPPER may cause several coarse grain threads to be initialized. The routine
894  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
895  stack argument which is the thread number, stored in the  stack argument which is the thread number, stored in the
896  variable {\em myThid}. In addition to specifying a decomposition with  variable {\em myThid}. In addition to specifying a decomposition with
897  multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})  multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
898  configuring and starting a code to run using multiple threads requires the following  configuring and starting a code to run using multiple threads requires the following
899  steps.\\  steps.\\
900    
# Line 916  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1 Line 963  File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
963  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
964  File: {\em model/src/THE\_MODEL\_MAIN.F}\\  File: {\em model/src/THE\_MODEL\_MAIN.F}\\
965  File: {\em eesupp/src/MAIN.F}\\  File: {\em eesupp/src/MAIN.F}\\
966  File: {\em tools/genmake}\\  File: {\em tools/genmake2}\\
967  File: {\em eedata}\\  File: {\em eedata}\\
968  CPP:  {\em TARGET\_SUN}\\  CPP:  {\em TARGET\_SUN}\\
969  CPP:  {\em TARGET\_DEC}\\  CPP:  {\em TARGET\_DEC}\\
# Line 929  Parameter:  {\em nTy} Line 976  Parameter:  {\em nTy}
976  } \\  } \\
977    
978  \subsubsection{Multi-process execution}  \subsubsection{Multi-process execution}
979    \label{sect:multi-process-execution}
980    
981  Despite its appealing programming model, multi-threaded execution remains  Despite its appealing programming model, multi-threaded execution
982  less common then multi-process execution. One major reason for this  remains less common than multi-process execution. One major reason for
983  is that many system libraries are still not ``thread-safe''. This means that for  this is that many system libraries are still not ``thread-safe''. This
984  example on some systems it is not safe to call system routines to  means that, for example, on some systems it is not safe to call system
985  do I/O when running in multi-threaded mode, except for in a limited set of  routines to perform I/O when running in multi-threaded mode (except,
986  circumstances. Another reason is that support for multi-threaded programming  perhaps, in a limited set of circumstances).  Another reason is that
987  models varies between systems.  support for multi-threaded programming models varies between systems.
988    
989  Multi-process execution is more ubiquitous.  Multi-process execution is more ubiquitous.  In order to run code in a
990  In order to run code in a multi-process configuration a decomposition  multi-process configuration a decomposition specification (see section
991  specification is given ( in which the at least one of the  \ref{sect:specifying_a_decomposition}) is given (in which the at least
992  parameters {\em nPx} or {\em nPy} will be greater than one)  one of the parameters {\em nPx} or {\em nPy} will be greater than one)
993  and then, as for multi-threaded operation,  and then, as for multi-threaded operation, appropriate compile time
994  appropriate compile time and run time steps must be taken.  and run time steps must be taken.
995    
996  \paragraph{Compilation} Multi-process execution under the WRAPPER  \paragraph{Compilation} Multi-process execution under the WRAPPER
997  assumes that the portable, MPI libraries are available  assumes that the portable, MPI libraries are available for controlling
998  for controlling the start-up of multiple processes. The MPI libraries  the start-up of multiple processes. The MPI libraries are not
999  are not required, although they are usually used, for performance  required, although they are usually used, for performance critical
1000  critical communication. However, in order to simplify the task  communication. However, in order to simplify the task of controlling
1001  of controlling and coordinating the start up of a large number  and coordinating the start up of a large number (hundreds and possibly
1002  (hundreds and possibly even thousands) of copies of the same  even thousands) of copies of the same program, MPI is used. The calls
1003  program, MPI is used. The calls to the MPI multi-process startup  to the MPI multi-process startup routines must be activated at compile
1004  routines must be activated at compile time. This is done  time.  Currently MPI libraries are invoked by specifying the
1005  by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}  appropriate options file with the {\tt-of} flag when running the {\em
1006  flags in the {\em CPP\_EEOPTIONS.h} file.\\    genmake2} script, which generates the Makefile for compiling and
1007    linking MITgcm.  (Previously this was done by setting the {\em
1008  \fbox{    ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
1009  \begin{minipage}{4.75in}    CPP\_EEOPTIONS.h} file.)  More detailed information about the use of
1010  File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\  {\em genmake2} for specifying
1011  CPP:  {\em ALLOW\_USE\_MPI}\\  local compiler flags is located in section \ref{sect:genmake}.\\
 CPP:  {\em ALWAYS\_USE\_MPI}\\  
 Parameter:  {\em nPx}\\  
 Parameter:  {\em nPy}  
 \end{minipage}  
 } \\  
1012    
 Additionally, compile time options are required to link in the  
 MPI libraries and header files. Examples of these options  
 can be found in the {\em genmake} script that creates makefiles  
 for compilation. When this script is executed with the {bf -mpi}  
 flag it will generate a makefile that includes  
 paths for search for MPI head files and for linking in  
 MPI libraries. For example the {\bf -mpi} flag on a  
  Silicon Graphics IRIX system causes a  
 Makefile with the compilation command  
 Graphics IRIX system \begin{verbatim}  
 mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI  
 \end{verbatim}  
 to be generated.  
 This is the correct set of options for using the MPICH open-source  
 version of MPI, when it has been installed under the subdirectory  
 /usr/local/mpi.  
 However, on many systems there may be several  
 versions of MPI installed. For example many systems have both  
 the open source MPICH set of libraries and a vendor specific native form  
 of the MPI libraries. The correct setup to use will depend on the  
 local configuration of your system.\\  
1013    
1014  \fbox{  \fbox{
1015  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
1016  File: {\em tools/genmake}  Directory: {\em tools/build\_options}\\
1017    File: {\em tools/genmake2}
1018  \end{minipage}  \end{minipage}
1019  } \\  } \\
1020  \paragraph{\bf Execution} The mechanics of starting a program in  \paragraph{\bf Execution} The mechanics of starting a program in
1021  multi-process mode under MPI is not standardized. Documentation  multi-process mode under MPI is not standardized. Documentation
1022  associated with the distribution of MPI installed on a system will  associated with the distribution of MPI installed on a system will
1023  describe how to start a program using that distribution.  describe how to start a program using that distribution.  For the
1024  For the free, open-source MPICH system the MITgcm program is started  open-source MPICH system, the MITgcm program can be started using a
1025  using a command such as  command such as
1026  \begin{verbatim}  \begin{verbatim}
1027  mpirun -np 64 -machinefile mf ./mitgcmuv  mpirun -np 64 -machinefile mf ./mitgcmuv
1028  \end{verbatim}  \end{verbatim}
1029  In this example the text {\em -np 64} specifices the number of processes  In this example the text {\em -np 64} specifies the number of
1030  that will be created. The numeric value {\em 64} must be equal to the  processes that will be created. The numeric value {\em 64} must be
1031  product of the processor grid settings of {\em nPx} and {\em nPy}  equal to the product of the processor grid settings of {\em nPx} and
1032  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file  {\em nPy} in the file {\em SIZE.h}. The parameter {\em mf} specifies
1033  called ``mf'' will be read to get a list of processor names on  that a text file called ``mf'' will be read to get a list of processor
1034  which the sixty-four processes will execute. The syntax of this file  names on which the sixty-four processes will execute. The syntax of
1035  is specified by the MPI distribution  this file is specified by the MPI distribution.
1036  \\  \\
1037    
1038  \fbox{  \fbox{
1039  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1022  Parameter: {\em nPy} Line 1045  Parameter: {\em nPy}
1045    
1046    
1047  \paragraph{Environment variables}  \paragraph{Environment variables}
1048  On most systems multi-threaded execution also requires the setting  On most systems multi-threaded execution also requires the setting of
1049  of a special environment variable. On many machines this variable  a special environment variable. On many machines this variable is
1050  is called PARALLEL and its values should be set to the number  called PARALLEL and its values should be set to the number of parallel
1051  of parallel threads required. Generally the help pages associated  threads required. Generally the help or manual pages associated with
1052  with the multi-threaded compiler on a machine will explain  the multi-threaded compiler on a machine will explain how to set the
1053  how to set the required environment variables for that machines.  required environment variables.
1054    
1055  \paragraph{Runtime input parameters}  \paragraph{Runtime input parameters}
1056  Finally the file {\em eedata} needs to be configured to indicate  Finally the file {\em eedata} needs to be configured to indicate the
1057  the number of threads to be used in the x and y directions.  number of threads to be used in the x and y directions.  The variables
1058  The variables {\em nTx} and {\em nTy} in this file are used to  {\em nTx} and {\em nTy} in this file are used to specify the
1059  specify the information required. The product of {\em nTx} and  information required. The product of {\em nTx} and {\em nTy} must be
1060  {\em nTy} must be equal to the number of threads spawned i.e.  equal to the number of threads spawned i.e.  the setting of the
1061  the setting of the environment variable PARALLEL.  environment variable PARALLEL.  The value of {\em nTx} must subdivide
1062  The value of {\em nTx} must subdivide the number of sub-domains  the number of sub-domains in x ({\em nSx}) exactly. The value of {\em
1063  in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the    nTy} must subdivide the number of sub-domains in y ({\em nSy})
1064  number of sub-domains in y ({\em nSy}) exactly.  exactly.  The multiprocess startup of the MITgcm executable {\em
1065  The multiprocess startup of the MITgcm executable {\em mitgcmuv}    mitgcmuv} is controlled by the routines {\em EEBOOT\_MINIMAL()} and
1066  is controlled by the routines {\em EEBOOT\_MINIMAL()} and  {\em INI\_PROCS()}. The first routine performs basic steps required to
1067  {\em INI\_PROCS()}. The first routine performs basic steps required  make sure each process is started and has a textual output stream
1068  to make sure each process is started and has a textual output  associated with it. By default two output files are opened for each
1069  stream associated with it. By default two output files are opened  process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.  The {\bf
1070  for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.    NNNNN} part of the name is filled in with the process number so that
1071  The {\bf NNNNN} part of the name is filled in with the process  process number 0 will create output files {\bf STDOUT.0000} and {\bf
1072  number so that process number 0 will create output files    STDERR.0000}, process number 1 will create output files {\bf
1073  {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create    STDOUT.0001} and {\bf STDERR.0001}, etc. These files are used for
1074  output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files  reporting status and configuration information and for reporting error
1075  are used for reporting status and configuration information and  conditions on a process by process basis.  The {\em EEBOOT\_MINIMAL()}
1076  for reporting error conditions on a process by process basis.  procedure also sets the variables {\em myProcId} and {\em
1077  The {\em EEBOOT\_MINIMAL()} procedure also sets the variables    MPI\_COMM\_MODEL}.  These variables are related to processor
1078  {\em myProcId} and {\em MPI\_COMM\_MODEL}.  identification are are used later in the routine {\em INI\_PROCS()} to
1079  These variables are related  allocate tiles to processes.
1080  to processor identification are are used later in the routine  
1081  {\em INI\_PROCS()} to allocate tiles to processes.  Allocation of processes to tiles is controlled by the routine {\em
1082      INI\_PROCS()}. For each process this routine sets the variables {\em
1083  Allocation of processes to tiles in controlled by the routine    myXGlobalLo} and {\em myYGlobalLo}.  These variables specify, in
1084  {\em INI\_PROCS()}. For each process this routine sets  index space, the coordinates of the southernmost and westernmost
1085  the variables {\em myXGlobalLo} and {\em myYGlobalLo}.  corner of the southernmost and westernmost tile owned by this process.
1086  These variables specify (in index space) the coordinate  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are
1087  of the southern most and western most corner of the  also set in this routine. These are used to identify processes holding
1088  southern most and western most tile owned by this process.  tiles to the west, east, south and north of a given process. These
1089  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}  values are stored in global storage in the header file {\em
1090  are also set in this routine. These are used to identify    EESUPPORT.h} for use by communication routines.  The above does not
1091  processes holding tiles to the west, east, south and north  hold when the exch2 package is used.  The exch2 sets its own
1092  of this process. These values are stored in global storage  parameters to specify the global indices of tiles and their
1093  in the header file {\em EESUPPORT.h} for use by  relationships to each other.  See the documentation on the exch2
1094  communication routines.  package (\ref{sec:exch2}) for details.
1095  \\  \\
1096    
1097  \fbox{  \fbox{
# Line 1094  operations and that can be customized fo Line 1117  operations and that can be customized fo
1117  describes the information that is held and used.  describes the information that is held and used.
1118    
1119  \begin{enumerate}  \begin{enumerate}
1120  \item {\bf Tile-tile connectivity information} For each tile the WRAPPER  \item {\bf Tile-tile connectivity information}
1121  sets a flag that sets the tile number to the north, south, east and    For each tile the WRAPPER sets a flag that sets the tile number to
1122  west of that tile. This number is unique over all tiles in a    the north, south, east and west of that tile. This number is unique
1123  configuration. The number is held in the variables {\em tileNo}    over all tiles in a configuration. Except when using the cubed
1124  ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},    sphere and the exch2 package, the number is held in the variables
1125  {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile    {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em
1126  that specifies the type of communication that is used between tiles.      tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also
1127  This information is held in the variables {\em tileCommModeN},    stored with each tile that specifies the type of communication that
1128  {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.    is used between tiles.  This information is held in the variables
1129  This latter set of variables can take one of the following values    {\em tileCommModeN}, {\em tileCommModeS}, {\em tileCommModeE} and
1130  {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.    {\em tileCommModeW}.  This latter set of variables can take one of
1131  A value of {\em COMM\_NONE} is used to indicate that a tile has no    the following values {\em COMM\_NONE}, {\em COMM\_MSG}, {\em
1132  neighbor to cummnicate with on a particular face. A value      COMM\_PUT} and {\em COMM\_GET}.  A value of {\em COMM\_NONE} is
1133  of {\em COMM\_MSG} is used to indicated that some form of distributed    used to indicate that a tile has no neighbor to communicate with on
1134  memory communication is required to communicate between    a particular face. A value of {\em COMM\_MSG} is used to indicate
1135  these tile faces ( see section \ref{sec:distributed_memory_communication}).    that some form of distributed memory communication is required to
1136  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate    communicate between these tile faces (see section
1137  forms of shared memory communication ( see section    \ref{sect:distributed_memory_communication}).  A value of {\em
1138  \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates      COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
1139  that a CPU should communicate by writing to data structures owned by another    memory communication (see section
1140  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading    \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value
1141  from data structures owned by another CPU. These flags affect the behavior    indicates that a CPU should communicate by writing to data
1142  of the WRAPPER exchange primitive    structures owned by another CPU. A {\em COMM\_GET} value indicates
1143  (see figure \ref{fig:communication_primitives}). The routine    that a CPU should communicate by reading from data structures owned
1144  {\em ini\_communication\_patterns()} is responsible for setting the    by another CPU. These flags affect the behavior of the WRAPPER
1145  communication mode values for each tile.    exchange primitive (see figure \ref{fig:communication_primitives}).
1146  \\    The routine {\em ini\_communication\_patterns()} is responsible for
1147      setting the communication mode values for each tile.
1148    
1149      When using the cubed sphere configuration with the exch2 package,
1150      the relationships between tiles and their communication methods are
1151      set by the exch2 package and stored in different variables.  See the
1152      exch2 package documentation (\ref{sec:exch2} for details.
1153    
1154  \fbox{  \fbox{
1155  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1139  Parameter: {\em tileCommModeS} \\ Line 1168  Parameter: {\em tileCommModeS} \\
1168  } \\  } \\
1169    
1170  \item {\bf MP directives}  \item {\bf MP directives}
1171  The WRAPPER transfers control to numerical application code through    The WRAPPER transfers control to numerical application code through
1172  the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way    the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
1173  that allows for it to be invoked by several threads. Support for this    that allows for it to be invoked by several threads. Support for
1174  is based on using multi-processing (MP) compiler directives.    this is based on either multi-processing (MP) compiler directives or
1175  Most commercially available Fortran compilers support the generation    specific calls to multi-threading libraries (\textit{eg.} POSIX
1176  of code to spawn multiple threads through some form of compiler directives.    threads).  Most commercially available Fortran compilers support the
1177  As this is generally much more convenient than writing code to interface    generation of code to spawn multiple threads through some form of
1178  to operating system libraries to explicitly spawn threads, and on some systems    compiler directives.  Compiler directives are generally more
1179  this may be the only method available the WRAPPER is distributed with    convenient than writing code to explicitly spawning threads.  And,
1180  template MP directives for a number of systems.    on some systems, compiler directives may be the only method
1181      available.  The WRAPPER is distributed with template MP directives
1182   These directives are inserted into the code just before and after the    for a number of systems.
1183  transfer of control to numerical algorithm code through the routine  
1184  {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of    These directives are inserted into the code just before and after
1185  the code that performs this process for a Silicon Graphics system.    the transfer of control to numerical algorithm code through the
1186  This code is extracted from the files {\em main.F} and    routine {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows
1187  {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies    an example of the code that performs this process for a Silicon
1188  how many instances of the routine {\em THE\_MODEL\_MAIN} will    Graphics system.  This code is extracted from the files {\em main.F}
1189  be created. The value of {\em nThreads} is set in the routine    and {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads}
1190  {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the    specifies how many instances of the routine {\em THE\_MODEL\_MAIN}
1191  the product of the parameters {\em nTx} and {\em nTy} that    will be created. The value of {\em nThreads} is set in the routine
1192  are read from the file {\em eedata}. If the value of {\em nThreads}    {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the the
1193  is inconsistent with the number of threads requested from the    product of the parameters {\em nTx} and {\em nTy} that are read from
1194  operating system (for example by using an environment    the file {\em eedata}. If the value of {\em nThreads} is
1195  varialble as described in section \ref{sec:multi_threaded_execution})    inconsistent with the number of threads requested from the operating
1196  then usually an error will be reported by the routine    system (for example by using an environment variable as described in
1197  {\em CHECK\_THREADS}.\\    section \ref{sect:multi_threaded_execution}) then usually an error
1198      will be reported by the routine {\em CHECK\_THREADS}.
1199    
1200  \fbox{  \fbox{
1201  \begin{minipage}{4.75in}  \begin{minipage}{4.75in}
# Line 1181  Parameter: {\em nTy} \\ Line 1211  Parameter: {\em nTy} \\
1211  }  }
1212    
1213  \item {\bf memsync flags}  \item {\bf memsync flags}
1214  As discussed in section \ref{sec:memory_consistency}, when using shared memory,    As discussed in section \ref{sect:memory_consistency}, a low-level
1215  a low-level system function may be need to force memory consistency.    system function may be need to force memory consistency on some
1216  The routine {\em MEMSYNC()} is used for this purpose. This routine should    shared memory systems.  The routine {\em MEMSYNC()} is used for this
1217  not need modifying and the information below is only provided for    purpose. This routine should not need modifying and the information
1218  completeness. A logical parameter {\em exchNeedsMemSync} set    below is only provided for completeness. A logical parameter {\em
1219  in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether      exchNeedsMemSync} set in the routine {\em
1220  the {\em MEMSYNC()} primitive is called. In general this      INI\_COMMUNICATION\_PATTERNS()} controls whether the {\em
1221  routine is only used for multi-threaded execution.      MEMSYNC()} primitive is called. In general this routine is only
1222  The code that goes into the {\em MEMSYNC()}    used for multi-threaded execution.  The code that goes into the {\em
1223   routine is specific to the compiler and      MEMSYNC()} routine is specific to the compiler and processor used.
1224  processor being used for multi-threaded execution and in general    In some cases, it must be written using a short code snippet of
1225  must be written using a short code snippet of assembly language.    assembly language.  For an Ultra Sparc system the following code
1226  For an Ultra Sparc system the following code snippet is used    snippet is used
1227  \begin{verbatim}  \begin{verbatim}
1228  asm("membar #LoadStore|#StoreStore");  asm("membar #LoadStore|#StoreStore");
1229  \end{verbatim}  \end{verbatim}
1230  for an Alpha based sytem the euivalent code reads  for an Alpha based system the equivalent code reads
1231  \begin{verbatim}  \begin{verbatim}
1232  asm("mb");  asm("mb");
1233  \end{verbatim}  \end{verbatim}
# Line 1207  asm("lock; addl $0,0(%%esp)": : :"memory Line 1237  asm("lock; addl $0,0(%%esp)": : :"memory
1237  \end{verbatim}  \end{verbatim}
1238    
1239  \item {\bf Cache line size}  \item {\bf Cache line size}
1240  As discussed in section \ref{sec:cache_effects_and_false_sharing},    As discussed in section \ref{sect:cache_effects_and_false_sharing},
1241  milti-threaded codes explicitly avoid penalties associated with excessive    milti-threaded codes explicitly avoid penalties associated with
1242  coherence traffic on an SMP system. To do this the sgared memory data structures    excessive coherence traffic on an SMP system. To do this the shared
1243  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines    memory data structures used by the {\em GLOBAL\_SUM}, {\em
1244  are padded. The variables that control the padding are set in the      GLOBAL\_MAX} and {\em BARRIER} routines are padded. The variables
1245  header file {\em EEPARAMS.h}. These variables are called    that control the padding are set in the header file {\em
1246  {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and      EEPARAMS.h}. These variables are called {\em cacheLineSize}, {\em
1247  {\em lShare8}. The default values should not normally need changing.      lShare1}, {\em lShare4} and {\em lShare8}. The default values
1248      should not normally need changing.
1249    
1250  \item {\bf \_BARRIER}  \item {\bf \_BARRIER}
1251  This is a CPP macro that is expanded to a call to a routine    This is a CPP macro that is expanded to a call to a routine which
1252  which synchronises all the logical processors running under the    synchronizes all the logical processors running under the WRAPPER.
1253  WRAPPER. Using a macro here preserves flexibility to insert    Using a macro here preserves flexibility to insert a specialized
1254  a specialized call in-line into application code. By default this    call in-line into application code. By default this resolves to
1255  resolves to calling the procedure {\em BARRIER()}. The default    calling the procedure {\em BARRIER()}. The default setting for the
1256  setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.    \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
1257    
1258  \item {\bf \_GSUM}  \item {\bf \_GSUM}
1259  This is a CPP macro that is expanded to a call to a routine    This is a CPP macro that is expanded to a call to a routine which
1260  which sums up a floating point numner    sums up a floating point number over all the logical processors
1261  over all the logical processors running under the    running under the WRAPPER. Using a macro here provides extra
1262  WRAPPER. Using a macro here provides extra flexibility to insert    flexibility to insert a specialized call in-line into application
1263  a specialized call in-line into application code. By default this    code. By default this resolves to calling the procedure {\em
1264  resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for      GLOBAL\_SUM\_R8()} ( for 64-bit floating point operands) or {\em
1265  84=bit floating point operands)      GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The
1266  or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default    default setting for the \_GSUM macro is given in the file {\em
1267  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.      CPP\_EEMACROS.h}.  The \_GSUM macro is a performance critical
1268  The \_GSUM macro is a performance critical operation, especially for    operation, especially for large processor count, small tile size
1269  large processor count, small tile size configurations.    configurations.  The custom communication example discussed in
1270  The custom communication example discussed in section \ref{sec:jam_example}    section \ref{sect:jam_example} shows how the macro is used to invoke
1271  shows how the macro is used to invoke a custom global sum routine    a custom global sum routine for a specific set of hardware.
 for a specific set of hardware.  
1272    
1273  \item {\bf \_EXCH}  \item {\bf \_EXCH}
1274  The \_EXCH CPP macro is used to update tile overlap regions.    The \_EXCH CPP macro is used to update tile overlap regions.  It is
1275  It is qualified by a suffix indicating whether overlap updates are for    qualified by a suffix indicating whether overlap updates are for
1276  two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )    two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
1277  physical fields and whether fields are 32-bit floating point    physical fields and whether fields are 32-bit floating point (
1278  ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point    \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point (
1279  ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined    \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined in
1280  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the    the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
1281  \_EXCH operation plays a crucial role in scaling to small tile,    operation plays a crucial role in scaling to small tile, large
1282  large logical and physical processor count configurations.    logical and physical processor count configurations.  The example in
1283  The example in section \ref{sec:jam_example} discusses defining an    section \ref{sect:jam_example} discusses defining an optimized and
1284  optimised and specialized form on the \_EXCH operation.    specialized form on the \_EXCH operation.
1285    
1286  The \_EXCH operation is also central to supporting grids such as    The \_EXCH operation is also central to supporting grids such as the
1287  the cube-sphere grid. In this class of grid a rotation may be required    cube-sphere grid. In this class of grid a rotation may be required
1288  between tiles. Aligning the coordinate requiring rotation with the    between tiles. Aligning the coordinate requiring rotation with the
1289  tile decomposistion, allows the coordinate transformation to    tile decomposition, allows the coordinate transformation to be
1290  be embedded within a custom form of the \_EXCH primitive.    embedded within a custom form of the \_EXCH primitive.  In these
1291      cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
1292      package documentation \ref{sec:exch2}.
1293    
1294  \item {\bf Reverse Mode}  \item {\bf Reverse Mode}
1295  The communication primitives \_EXCH and \_GSUM both employ    The communication primitives \_EXCH and \_GSUM both employ
1296  hand-written adjoint forms (or reverse mode) forms.    hand-written adjoint forms (or reverse mode) forms.  These reverse
1297  These reverse mode forms can be found in the    mode forms can be found in the source code directory {\em
1298  sourc code directory {\em pkg/autodiff}.      pkg/autodiff}.  For the global sum primitive the reverse mode form
1299  For the global sum primitive the reverse mode form    calls are to {\em GLOBAL\_ADSUM\_R4} and {\em GLOBAL\_ADSUM\_R8}.
1300  calls are to {\em GLOBAL\_ADSUM\_R4} and    The reverse mode form of the exchange primitives are found in
1301  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the    routines prefixed {\em ADEXCH}. The exchange routines make calls to
1302  exchamge primitives are found in routines    the same low-level communication primitives as the forward mode
1303  prefixed {\em ADEXCH}. The exchange routines make calls to    operations. However, the routine argument {\em simulationMode} is
1304  the same low-level communication primitives as the forward mode    set to the value {\em REVERSE\_SIMULATION}. This signifies to the
1305  operations. However, the routine argument {\em simulationMode}    low-level routines that the adjoint forms of the appropriate
1306  is set to the value {\em REVERSE\_SIMULATION}. This signifies    communication operation should be performed.
1307  ti the low-level routines that the adjoint forms of the  
 appropriate communication operation should be performed.  
1308  \item {\bf MAX\_NO\_THREADS}  \item {\bf MAX\_NO\_THREADS}
1309  The variable {\em MAX\_NO\_THREADS} is used to indicate the    The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum
1310  maximum number of OS threads that a code will use. This    number of OS threads that a code will use. This value defaults to
1311  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.    thirty-two and is set in the file {\em EEPARAMS.h}.  For single
1312  For single threaded execution it can be reduced to one if required.    threaded execution it can be reduced to one if required.  The value
1313  The va;lue is largely private to the WRAPPER and application code    is largely private to the WRAPPER and application code will not
1314  will nor normally reference the value, except in the following scenario.    normally reference the value, except in the following scenario.
1315    
1316  For certain physical parametrization schemes it is necessary to have    For certain physical parametrization schemes it is necessary to have
1317  a substantial number of work arrays. Where these arrays are allocated    a substantial number of work arrays. Where these arrays are
1318  in heap storage ( for example COMMON blocks ) multi-threaded    allocated in heap storage (for example COMMON blocks) multi-threaded
1319  execution will require multiple instances of the COMMON block data.    execution will require multiple instances of the COMMON block data.
1320  This can be achieved using a Fortran 90 module construct, however,    This can be achieved using a Fortran 90 module construct.  However,
1321  if this might be unavailable then the work arrays can be extended    if this mechanism is unavailable then the work arrays can be extended
1322  with dimensions use the tile dimensioning scheme of {\em nSx}    with dimensions using the tile dimensioning scheme of {\em nSx} and
1323  and {\em nSy} ( as described in section    {\em nSy} (as described in section
1324  \ref{sec:specifying_a_decomposition}). However, if the configuration    \ref{sect:specifying_a_decomposition}). However, if the
1325  being specified involves many more tiles than OS threads then    configuration being specified involves many more tiles than OS
1326  it can save memory resources to reduce the variable    threads then it can save memory resources to reduce the variable
1327  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that    {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
1328  will be used and to declare the physical parameterisation    that will be used and to declare the physical parameterization work
1329  work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.    arrays with a single {\em MAX\_NO\_THREADS} extra dimension.  An
1330  An example of this is given in the verification experiment    example of this is given in the verification experiment {\em
1331  {\em aim.5l\_cs}. Here the default setting of      aim.5l\_cs}. Here the default setting of {\em MAX\_NO\_THREADS} is
1332  {\em MAX\_NO\_THREADS} is altered to    altered to
1333  \begin{verbatim}  \begin{verbatim}
1334        INTEGER MAX_NO_THREADS        INTEGER MAX_NO_THREADS
1335        PARAMETER ( MAX_NO_THREADS =    6 )        PARAMETER ( MAX_NO_THREADS =    6 )
1336  \end{verbatim}  \end{verbatim}
1337  and several work arrays for storing intermediate calculations are    and several work arrays for storing intermediate calculations are
1338  created with declarations of the form.    created with declarations of the form.
1339  \begin{verbatim}  \begin{verbatim}
1340        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1341  \end{verbatim}  \end{verbatim}
1342  This declaration scheme is not used widely, becuase most global data    This declaration scheme is not used widely, because most global data
1343  is used for permanent not temporary storage of state information.    is used for permanent not temporary storage of state information.
1344  In the case of permanent state information this approach cannot be used    In the case of permanent state information this approach cannot be
1345  because there has to be enough storage allocated for all tiles.    used because there has to be enough storage allocated for all tiles.
1346  However, the technique can sometimes be a useful scheme for reducing memory    However, the technique can sometimes be a useful scheme for reducing
1347  requirements in complex physical paramterisations.    memory requirements in complex physical parameterizations.
1348  \end{enumerate}  \end{enumerate}
1349    
1350  \begin{figure}  \begin{figure}
# Line 1333  C--     Invoke nThreads instances of the Line 1365  C--     Invoke nThreads instances of the
1365    
1366        ENDDO        ENDDO
1367  \end{verbatim}  \end{verbatim}
1368  \caption{Prior to transferring control to    \caption{Prior to transferring control to the procedure {\em
1369  the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use        THE\_MODEL\_MAIN()} the WRAPPER may use MP directives to spawn
1370  MP directives to spawn multiple threads.      multiple threads.  } \label{fig:mp_directives}
 } \label{fig:mp_directives}  
1371  \end{figure}  \end{figure}
1372    
1373    
# Line 1345  MP directives to spawn multiple threads. Line 1376  MP directives to spawn multiple threads.
1376  The isolation of performance critical communication primitives and the  The isolation of performance critical communication primitives and the
1377  sub-division of the simulation domain into tiles is a powerful tool.  sub-division of the simulation domain into tiles is a powerful tool.
1378  Here we show how it can be used to improve application performance and  Here we show how it can be used to improve application performance and
1379  how it can be used to adapt to new gridding approaches.  how it can be used to adapt to new griding approaches.
1380    
1381  \subsubsection{JAM example}  \subsubsection{JAM example}
1382  \label{sec:jam_example}  \label{sect:jam_example}
1383  On some platforms a big performance boost can be obtained by  On some platforms a big performance boost can be obtained by binding
1384  binding the communication routines {\em \_EXCH} and  the communication routines {\em \_EXCH} and {\em \_GSUM} to
1385  {\em \_GSUM} to specialized native libraries ) fro example the  specialized native libraries (for example, the shmem library on CRAY
1386  shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag  T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag is used as an
1387  is used as an illustration of a specialized communication configuration  illustration of a specialized communication configuration that
1388  that substitutes for standard, portable forms of {\em \_EXCH} and  substitutes for standard, portable forms of {\em \_EXCH} and {\em
1389  {\em \_GSUM}. It affects three source files {\em eeboot.F},    \_GSUM}. It affects three source files {\em eeboot.F}, {\em
1390  {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined    CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined is has
1391  is has the following effects.  the following effects.
1392  \begin{itemize}  \begin{itemize}
1393  \item An extra phase is included at boot time to initialize the custom  \item An extra phase is included at boot time to initialize the custom
1394  communications library ( see {\em ini\_jam.F}).    communications library ( see {\em ini\_jam.F}).
1395  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
1396  with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})    with calls to custom routines (see {\em gsum\_jam.F} and {\em
1397        exch\_jam.F})
1398  \item a highly specialized form of the exchange operator (optimized  \item a highly specialized form of the exchange operator (optimized
1399  for overlap regions of width one) is substitued into the elliptic    for overlap regions of width one) is substituted into the elliptic
1400  solver routine {\em cg2d.F}.    solver routine {\em cg2d.F}.
1401  \end{itemize}  \end{itemize}
1402  Developing specialized code for other libraries follows a similar  Developing specialized code for other libraries follows a similar
1403  pattern.  pattern.
1404    
1405  \subsubsection{Cube sphere communication}  \subsubsection{Cube sphere communication}
1406  \label{sec:cube_sphere_communication}  \label{sect:cube_sphere_communication}
1407  Actual {\em \_EXCH} routine code is generated automatically from  Actual {\em \_EXCH} routine code is generated automatically from a
1408  a series of template files, for example {\em exch\_rx.template}.  series of template files, for example {\em exch\_rx.template}.  This
1409  This is done to allow a large number of variations on the exchange  is done to allow a large number of variations on the exchange process
1410  process to be maintained. One set of variations supports the  to be maintained. One set of variations supports the cube sphere grid.
1411  cube sphere grid. Support for a cube sphere gris in MITgcm is based  Support for a cube sphere grid in MITgcm is based on having each face
1412  on having each face of the cube as a separate tile (or tiles).  of the cube as a separate tile or tiles.  The exchange routines are
1413  The exchage routines are then able to absorb much of the  then able to absorb much of the detailed rotation and reorientation
1414  detailed rotation and reorientation required when moving around the  required when moving around the cube grid. The set of {\em \_EXCH}
1415  cube grid. The set of {\em \_EXCH} routines that contain the  routines that contain the word cube in their name perform these
1416  word cube in their name perform these transformations.  transformations.  They are invoked when the run-time logical parameter
 They are invoked when the run-time logical parameter  
1417  {\em useCubedSphereExchange} is set true. To facilitate the  {\em useCubedSphereExchange} is set true. To facilitate the
1418  transformations on a staggered C-grid, exchange operations are defined  transformations on a staggered C-grid, exchange operations are defined
1419  separately for both vector and scalar quantitities and for  separately for both vector and scalar quantities and for grid-centered
1420  grid-centered and for grid-face and corner quantities.  and for grid-face and grid-corner quantities.  Three sets of exchange
1421  Three sets of exchange routines are defined. Routines  routines are defined. Routines with names of the form {\em exch\_rx}
1422  with names of the form {\em exch\_rx} are used to exchange  are used to exchange cell centered scalar quantities. Routines with
1423  cell centered scalar quantities. Routines with names of the form  names of the form {\em exch\_uv\_rx} are used to exchange vector
1424  {\em exch\_uv\_rx} are used to exchange vector quantities located at  quantities located at the C-grid velocity points. The vector
1425  the C-grid velocity points. The vector quantities exchanged by the  quantities exchanged by the {\em exch\_uv\_rx} routines can either be
1426  {\em exch\_uv\_rx} routines can either be signed (for example velocity  signed (for example velocity components) or un-signed (for example
1427  components) or un-signed (for example grid-cell separations).  grid-cell separations).  Routines with names of the form {\em
1428  Routines with names of the form {\em exch\_z\_rx} are used to exchange    exch\_z\_rx} are used to exchange quantities at the C-grid vorticity
1429  quantities at the C-grid vorticity point locations.  point locations.
1430    
1431    
1432    
1433    
1434  \section{MITgcm execution under WRAPPER}  \section{MITgcm execution under WRAPPER}
1435    \begin{rawhtml}
1436    <!-- CMIREDIR:mitgcm_wrapper: -->
1437    \end{rawhtml}
1438    
1439  Fitting together the WRAPPER elements, package elements and  Fitting together the WRAPPER elements, package elements and
1440  MITgcm core equation elements of the source code produces calling  MITgcm core equation elements of the source code produces calling
1441  sequence shown in section \ref{sec:calling_sequence}  sequence shown in section \ref{sect:calling_sequence}
1442    
1443  \subsection{Annotated call tree for MITgcm and WRAPPER}  \subsection{Annotated call tree for MITgcm and WRAPPER}
1444  \label{sec:calling_sequence}  \label{sect:calling_sequence}
1445    
1446  WRAPPER layer.  WRAPPER layer.
1447    
1448    {\footnotesize
1449  \begin{verbatim}  \begin{verbatim}
1450    
1451         MAIN           MAIN  
# Line 1438  WRAPPER layer. Line 1473  WRAPPER layer.
1473         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine
1474    
1475  \end{verbatim}  \end{verbatim}
1476    }
1477    
1478  Core equations plus packages.  Core equations plus packages.
1479    
1480    {\footnotesize
1481  \begin{verbatim}  \begin{verbatim}
1482  C  C
1483  C  C
# Line 1450  C  : Line 1487  C  :
1487  C  |  C  |
1488  C  |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm  C  |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
1489  C    |              :: Called from WRAPPER level numerical  C    |              :: Called from WRAPPER level numerical
1490  C    |              :: code innvocation routine. On entry  C    |              :: code invocation routine. On entry
1491  C    |              :: to THE_MODEL_MAIN separate thread and  C    |              :: to THE_MODEL_MAIN separate thread and
1492  C    |              :: separate processes will have been established.  C    |              :: separate processes will have been established.
1493  C    |              :: Each thread and process will have a unique ID  C    |              :: Each thread and process will have a unique ID
# Line 1464  C    | |-INI_PARMS :: Routine to set ker Line 1501  C    | |-INI_PARMS :: Routine to set ker
1501  C    | |           :: By default kernel parameters are read from file  C    | |           :: By default kernel parameters are read from file
1502  C    | |           :: "data" in directory in which code executes.  C    | |           :: "data" in directory in which code executes.
1503  C    | |  C    | |
1504  C    | |-MON_INIT :: Initialises monitor pacakge ( see pkg/monitor )  C    | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
1505  C    | |  C    | |
1506  C    | |-INI_GRID :: Control grid array (vert. and hori.) initialisation.  C    | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
1507  C    | | |        :: Grid arrays are held and described in GRID.h.  C    | | |        :: Grid arrays are held and described in GRID.h.
1508  C    | | |  C    | | |
1509  C    | | |-INI_VERTICAL_GRID        :: Initialise vertical grid arrays.  C    | | |-INI_VERTICAL_GRID        :: Initialize vertical grid arrays.
1510  C    | | |  C    | | |
1511  C    | | |-INI_CARTESIAN_GRID       :: Cartesian horiz. grid initialisation  C    | | |-INI_CARTESIAN_GRID       :: Cartesian horiz. grid initialization
1512  C    | | |                          :: (calculate grid from kernel parameters).  C    | | |                          :: (calculate grid from kernel parameters).
1513  C    | | |  C    | | |
1514  C    | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid  C    | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
1515  C    | | |                          :: initialisation (calculate grid from  C    | | |                          :: initialization (calculate grid from
1516  C    | | |                          :: kernel parameters).  C    | | |                          :: kernel parameters).
1517  C    | | |  C    | | |
1518  C    | | |-INI_CURVILINEAR_GRID     :: General orthogonal, structured horiz.  C    | | |-INI_CURVILINEAR_GRID     :: General orthogonal, structured horiz.
1519  C    | |                            :: grid initialisations. ( input from raw  C    | |                            :: grid initializations. ( input from raw
1520  C    | |                            :: grid files, LONC.bin, DXF.bin etc... )  C    | |                            :: grid files, LONC.bin, DXF.bin etc... )
1521  C    | |  C    | |
1522  C    | |-INI_DEPTHS    :: Read (from "bathyFile") or set bathymetry/orgography.  C    | |-INI_DEPTHS    :: Read (from "bathyFile") or set bathymetry/orgography.
# Line 1490  C    | | Line 1527  C    | |
1527  C    | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf  C    | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
1528  C    | |  C    | |
1529  C    | |-INI_CORI          :: Set coriolis term. zero, f-plane, beta-plane,  C    | |-INI_CORI          :: Set coriolis term. zero, f-plane, beta-plane,
1530  C    | |                   :: sphere optins are coded.  C    | |                   :: sphere options are coded.
1531  C    | |  C    | |
1532  C    | |-PACAKGES_BOOT      :: Start up the optional package environment.  C    | |-PACAKGES_BOOT      :: Start up the optional package environment.
1533  C    | |                    :: Runtime selection of active packages.  C    | |                    :: Runtime selection of active packages.
# Line 1511  C    | | Line 1548  C    | |
1548  C    | |-PACKAGES_CHECK  C    | |-PACKAGES_CHECK
1549  C    | | |  C    | | |
1550  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp
1551  C    | | |-OBCS_CHECK          :: Open bndy Pacakge. pkg/obcs  C    | | |-OBCS_CHECK          :: Open bndy Package. pkg/obcs
1552  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi
1553  C    | |  C    | |
1554  C    | |-PACKAGES_INIT_FIXED  C    | |-PACKAGES_INIT_FIXED
# Line 1531  C    | Line 1568  C    |
1568  C    |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl  C    |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
1569  C    |  C    |
1570  C    |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop  C    |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
1571  C    !                 :: Auotmatically gerenrated by TAMC/TAF.  C    !                 :: Auotmatically generated by TAMC/TAF.
1572  C    |  C    |
1573  C    |-CTRL_PACK   :: Control vector support package. see pkg/ctrl  C    |-CTRL_PACK   :: Control vector support package. see pkg/ctrl
1574  C    |  C    |
# Line 1545  C    | | | Line 1582  C    | | |
1582  C    | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf  C    | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
1583  C    | | |  C    | | |
1584  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,
1585  C    | | |              :: sphere optins are coded.  C    | | |              :: sphere options are coded.
1586  C    | | |  C    | | |
1587  C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.  C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.
1588  C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.  C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.
# Line 1553  C    | | |-INI_MIXING   :: Initialise di Line 1590  C    | | |-INI_MIXING   :: Initialise di
1590  C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical  C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical
1591  C    | | |              :: fields).  C    | | |              :: fields).
1592  C    | | |  C    | | |
1593  C    | | |-INI_FIELDS   :: Control initialising model fields to non-zero  C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero
1594  C    | | | |-INI_VEL    :: Initialize 3D flow field.  C    | | | |-INI_VEL    :: Initialize 3D flow field.
1595  C    | | | |-INI_THETA  :: Set model initial temperature field.  C    | | | |-INI_THETA  :: Set model initial temperature field.
1596  C    | | | |-INI_SALT   :: Set model initial salinity field.  C    | | | |-INI_SALT   :: Set model initial salinity field.
# Line 1631  C/\  | | |-CALC_EXACT_ETA :: Change SSH Line 1668  C/\  | | |-CALC_EXACT_ETA :: Change SSH
1668  C/\  | | |-CALC_SURF_DR   :: Calculate the new surface level thickness.  C/\  | | |-CALC_SURF_DR   :: Calculate the new surface level thickness.
1669  C/\  | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )  C/\  | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
1670  C/\  | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.  C/\  | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
1671  C/\  | | | |                    :: Simple interpolcation between end-points  C/\  | | | |                    :: Simple interpolation between end-points
1672  C/\  | | | |                    :: for forcing datasets.  C/\  | | | |                    :: for forcing datasets.
1673  C/\  | | | |                    C/\  | | | |                  
1674  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.
# Line 1779  C    |-COMM_STATS     :: Summarise inter Line 1816  C    |-COMM_STATS     :: Summarise inter
1816  C                     :: events.  C                     :: events.
1817  C  C
1818  \end{verbatim}  \end{verbatim}
1819    }
1820    
1821  \subsection{Measuring and Characterizing Performance}  \subsection{Measuring and Characterizing Performance}
1822    

Legend:
Removed from v.1.2  
changed lines
  Added in v.1.21

  ViewVC Help
Powered by ViewVC 1.1.22