/[MITgcm]/manual/s_software/text/sarch.tex

Diff of /manual/s_software/text/sarch.tex

Parent Directory | Revision Log | View Revision Graph Revision Graph | View Patch Patch

-revision 1.4 by cnh,
Thu Oct 25 18:36:55 2001 UTC
+revision 1.26 by jmc,
Mon Aug 30 23:09:22 2010 UTC
 Line 1
  % $Header$
- In this chapter we describe the software architecture and
+ This chapter focuses on describing the {\bf WRAPPER} environment
- implementation strategy for the MITgcm code. The first part of this
+ within which both the core numerics and the pluggable packages
- chapter discusses the MITgcm architecture at an abstract level. In the second
+ operate. The description presented here is intended to be a detailed
- part of the chapter we described practical details of the MITgcm implementation
+ exposition and contains significant background material, as well as
- and of current tools and operating system features that are employed.
+ advanced details on working with the WRAPPER.  The tutorial sections
+ of this manual (see sections \ref{sec:modelExamples} and
+ \ref{sec:tutorialIII}) contain more succinct, step-by-step
+ instructions on running basic numerical experiments, of varous types,
+ both sequentially and in parallel. For many projects simply starting
+ from an example code and adapting it to suit a particular situation
+ will be all that is required.  The first part of this chapter
+ discusses the MITgcm architecture at an abstract level. In the second
+ part of the chapter we described practical details of the MITgcm
+ implementation and of current tools and operating system features that
+ are employed.
  \section{Overall architectural goals}
+ \begin{rawhtml}
+ <!-- CMIREDIR:overall_architectural_goals: -->
+ \end{rawhtml}
  Broadly, the goals of the software architecture employed in MITgcm are
  three-fold
  \begin{itemize}
- \item We wish to be able to study a very broad range
+ \item We wish to be able to study a very broad range of interesting
- of interesting and challenging rotating fluids problems.
+   and challenging rotating fluids problems.
- \item We wish the model code to be readily targeted to
+ \item We wish the model code to be readily targeted to a wide range of
- a wide range of platforms
+   platforms
- \item On any given platform we would like to be
+ \item On any given platform we would like to be able to achieve
- able to achieve performance comparable to an implementation
+   performance comparable to an implementation developed and
- developed and specialized specifically for that platform.
+   specialized specifically for that platform.
  \end{itemize}
- These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
+ These points are summarized in figure
- which conveys the goals of the MITgcm design. The goals lead to
+ \ref{fig:mitgcm_architecture_goals} which conveys the goals of the
- a software architecture which at the high-level can be viewed as consisting
+ MITgcm design. The goals lead to a software architecture which at the
- of
+ high-level can be viewed as consisting of
  \begin{enumerate}
- \item A core set of numerical and support code. This is discussed in detail in
+ \item A core set of numerical and support code. This is discussed in
- section \ref{sec:partII}.
+   detail in section \ref{chap:discretization}.
- \item A scheme for supporting optional "pluggable" {\bf packages} (containing
- for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
+ \item A scheme for supporting optional ``pluggable'' {\bf packages}
- These packages are used both to overlay alternate dynamics and to introduce
+   (containing for example mixed-layer schemes, biogeochemical schemes,
- specialized physical content onto the core numerical code. An overview of
+   atmospheric physics).  These packages are used both to overlay
- the {\bf package} scheme is given at the start of part \ref{part:packages}.
+   alternate dynamics and to introduce specialized physical content
- \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
+   onto the core numerical code. An overview of the {\bf package}
- Programming Environment Resource), within which the core numerics and pluggable
+   scheme is given at the start of part \ref{chap:packagesI}.
- packages operate.
+ \item A support framework called {\bf WRAPPER} (Wrappable Application
+   Parallel Programming Environment Resource), within which the core
+   numerics and pluggable packages operate.
  \end{enumerate}
- This chapter focuses on describing the {\bf WRAPPER} environment under which
+ This chapter focuses on describing the {\bf WRAPPER} environment under
- both the core numerics and the pluggable packages function. The description
+ which both the core numerics and the pluggable packages function. The
- presented here is intended to be a detailed exposition and contains significant
+ description presented here is intended to be a detailed exposition and
- background material, as well as advanced details on working with the WRAPPER.
+ contains significant background material, as well as advanced details
- The examples section of this manual (part \ref{part:example}) contains more
+ on working with the WRAPPER.  The examples section of this manual
- succinct, step-by-step instructions on running basic numerical
+ (part \ref{chap:getting_started}) contains more succinct, step-by-step
- experiments both sequentially and in parallel. For many projects simply
+ instructions on running basic numerical experiments both sequentially
- starting from an example code and adapting it to suit a particular situation
+ and in parallel. For many projects simply starting from an example
- will be all that is required.
+ code and adapting it to suit a particular situation will be all that
+ is required.
  \begin{figure}
  \begin{center}
- \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
+ \resizebox{!}{2.5in}{\includegraphics{s_software/figs/mitgcm_goals.eps}}
  \end{center}
- \caption{
+ \caption{ The MITgcm architecture is designed to allow simulation of a
- The MITgcm architecture is designed to allow simulation of a wide
+   wide range of physical problems on a wide range of hardware. The
- range of physical problems on a wide range of hardware. The computational
+   computational resource requirements of the applications targeted
- resource requirements of the applications targeted range from around
+   range from around $10^7$ bytes ($\approx 10$ megabytes) of memory to
- $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
+   $10^{11}$ bytes ($\approx 100$ gigabytes). Arithmetic operation
- ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
+   counts for the applications of interest range from $10^{9}$ floating
- interest range from $10^{9}$ floating point operations to more than $10^{17}$
+   point operations to more than $10^{17}$ floating point operations.}
- floating point operations.}
  \label{fig:mitgcm_architecture_goals}
  \end{figure}
  \section{WRAPPER}
+ \begin{rawhtml}
- A significant element of the software architecture utilized in
+ <!-- CMIREDIR:wrapper: -->
- MITgcm is a software superstructure and substructure collectively
+ \end{rawhtml}
- called the WRAPPER (Wrappable Application Parallel Programming
- Environment Resource). All numerical and support code in MITgcm is written
+ A significant element of the software architecture utilized in MITgcm
- to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
+ is a software superstructure and substructure collectively called the
- the WRAPPER means that coding has to follow certain, relatively
+ WRAPPER (Wrappable Application Parallel Programming Environment
- straightforward, rules and conventions ( these are discussed further in
+ Resource). All numerical and support code in MITgcm is written to
- section \ref{sec:specifying_a_decomposition} ).
+ ``fit'' within the WRAPPER infrastructure. Writing code to ``fit''
+ within the WRAPPER means that coding has to follow certain, relatively
- The approach taken by the WRAPPER is illustrated in figure
+ straightforward, rules and conventions (these are discussed further in
- \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
+ section \ref{sec:specifying_a_decomposition}).
- that fits within it from architectural differences between hardware platforms
- and operating systems. This allows numerical code to be easily retargetted.
+ The approach taken by the WRAPPER is illustrated in figure
+ \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to
+ insulate code that fits within it from architectural differences
+ between hardware platforms and operating systems. This allows
+ numerical code to be easily retargetted.
  \begin{figure}
  \begin{center}
- \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
+ \resizebox{!}{4.5in}{\includegraphics{s_software/figs/fit_in_wrapper.eps}}
  \end{center}
  \caption{
- Numerical code is written too fit within a software support
+ Numerical code is written to fit within a software support
  infrastructure called WRAPPER. The WRAPPER is portable and
  can be specialized for a wide range of specific target hardware and
  programming environments, without impacting numerical code that fits
-Line 100 
 optimized for that platform.}
+Line 120 
 optimized for that platform.}
  \subsection{Target hardware}
  \label{sec:target_hardware}
- The WRAPPER is designed to target as broad as possible a range of computer
+ The WRAPPER is designed to target as broad as possible a range of
- systems. The original development of the WRAPPER took place on a
+ computer systems.  The original development of the WRAPPER took place
- multi-processor, CRAY Y-MP system. On that system, numerical code performance
+ on a multi-processor, CRAY Y-MP system. On that system, numerical code
- and scaling under the WRAPPER was in excess of that of an implementation that
+ performance and scaling under the WRAPPER was in excess of that of an
- was tightly bound to the CRAY systems proprietary multi-tasking and
+ implementation that was tightly bound to the CRAY systems proprietary
- micro-tasking approach. Later developments have been carried out on
+ multi-tasking and micro-tasking approach. Later developments have been
- uniprocessor and multi-processor Sun systems with both uniform memory access
+ carried out on uniprocessor and multi-processor Sun systems with both
- (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
+ uniform memory access (UMA) and non-uniform memory access (NUMA)
- been undertaken on x86 cluster systems, Alpha processor based clustered SMP
+ designs.  Significant work has also been undertaken on x86 cluster
- systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
+ systems, Alpha processor based clustered SMP systems, and on
- The MITgcm code, operating within the WRAPPER, is also used routinely used on
+ cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
- large scale MPP systems (for example T3E systems and IBM SP systems). In all
+ systems.  The MITgcm code, operating within the WRAPPER, is also
- cases numerical code, operating within the WRAPPER, performs and scales very
+ routinely used on large scale MPP systems (for example, Cray T3E and
- competitively with equivalent numerical code that has been modified to contain
+ IBM SP systems). In all cases numerical code, operating within the
- native optimizations for a particular system \ref{ref hoe and hill, ecmwf}.
+ WRAPPER, performs and scales very competitively with equivalent
+ numerical code that has been modified to contain native optimizations
+ for a particular system \cite{hoe-hill:99}.
  \subsection{Supporting hardware neutrality}
- The different systems listed in section \ref{sec:target_hardware} can be
+ The different systems listed in section \ref{sec:target_hardware} can
- categorized in many different ways. For example, one common distinction is
+ be categorized in many different ways. For example, one common
- between shared-memory parallel systems (SMP's, PVP's) and distributed memory
+ distinction is between shared-memory parallel systems (SMP and PVP)
- parallel systems (for example x86 clusters and large MPP systems). This is one
+ and distributed memory parallel systems (for example x86 clusters and
- example of a difference between compute platforms that can impact an
+ large MPP systems). This is one example of a difference between
- application. Another common distinction is between vector processing systems
+ compute platforms that can impact an application. Another common
- with highly specialized CPU's and memory subsystems and commodity
+ distinction is between vector processing systems with highly
- microprocessor based systems. There are numerous other differences, especially
+ specialized CPUs and memory subsystems and commodity microprocessor
- in relation to how parallel execution is supported. To capture the essential
+ based systems. There are numerous other differences, especially in
- differences between different platforms the WRAPPER uses a {\it machine model}.
+ relation to how parallel execution is supported. To capture the
+ essential differences between different platforms the WRAPPER uses a
+ {\it machine model}.
  \subsection{WRAPPER machine model}
  Applications using the WRAPPER are not written to target just one
- particular machine (for example an IBM SP2) or just one particular family or
+ particular machine (for example an IBM SP2) or just one particular
- class of machines (for example Parallel Vector Processor Systems). Instead the
+ family or class of machines (for example Parallel Vector Processor
- WRAPPER provides applications with an
+ Systems). Instead the WRAPPER provides applications with an abstract
- abstract {\it machine model}. The machine model is very general, however, it can
+ {\it machine model}. The machine model is very general, however, it
- easily be specialized to fit, in a computationally effificent manner, any
+ can easily be specialized to fit, in a computationally efficient
- computer architecture currently available to the scientific computing community.
+ manner, any computer architecture currently available to the
+ scientific computing community.
  \subsection{Machine model parallelism}
+ \label{sec:domain_decomposition}
-  Codes operating under the WRAPPER target an abstract machine that is assumed to
+ \begin{rawhtml}
- consist of one or more logical processors that can compute concurrently.
+ <!-- CMIREDIR:domain_decomp: -->
- Computational work is divided amongst the logical
+ \end{rawhtml}
- processors by allocating ``ownership'' to
- each processor of a certain set (or sets) of calculations. Each set of
+ Codes operating under the WRAPPER target an abstract machine that is
- calculations owned by a particular processor is associated with a specific
+ assumed to consist of one or more logical processors that can compute
- region of the physical space that is being simulated, only one processor will
+ concurrently.  Computational work is divided among the logical
- be associated with each such region (domain decomposition).
+ processors by allocating ``ownership'' to each processor of a certain
+ set (or sets) of calculations. Each set of calculations owned by a
- In a strict sense the logical processors over which work is divided do not need
+ particular processor is associated with a specific region of the
- to correspond to physical processors. It is perfectly possible to execute a
+ physical space that is being simulated, only one processor will be
- configuration decomposed for multiple logical processors on a single physical
+ associated with each such region (domain decomposition).
- processor. This helps ensure that numerical code that is written to fit
- within the WRAPPER will parallelize with no additional effort and is
+ In a strict sense the logical processors over which work is divided do
- also useful when debugging codes. Generally, however,
+ not need to correspond to physical processors.  It is perfectly
- the computational domain will be subdivided over multiple logical
+ possible to execute a configuration decomposed for multiple logical
- processors in order to then bind those logical processors to physical
+ processors on a single physical processor.  This helps ensure that
- processor resources that can compute in parallel.
+ numerical code that is written to fit within the WRAPPER will
+ parallelize with no additional effort.  It is also useful for
+ debugging purposes.  Generally, however, the computational domain will
+ be subdivided over multiple logical processors in order to then bind
+ those logical processors to physical processor resources that can
+ compute in parallel.
  \subsubsection{Tiles}
- Computationally, associated with each region of physical
+ Computationally, the data structures (\textit{eg.} arrays, scalar
- space allocated to a particular logical processor, there will be data
+ variables, etc.) that hold the simulated state are associated with
- structures (arrays, scalar variables etc...) that hold the simulated state of
+ each region of physical space and are allocated to a particular
- that region. We refer to these data structures as being {\bf owned} by the
+ logical processor.  We refer to these data structures as being {\bf
- processor to which their
+   owned} by the processor to which their associated region of physical
- associated region of physical space has been allocated. Individual
+ space has been allocated.  Individual regions that are allocated to
- regions that are allocated to processors are called {\bf tiles}. A
+ processors are called {\bf tiles}.  A processor can own more than one
- processor can own more
+ tile.  Figure \ref{fig:domaindecomp} shows a physical domain being
- than one tile. Figure \ref{fig:domaindecomp} shows a physical domain being
+ mapped to a set of logical processors, with each processors owning a
- mapped to a set of logical processors, with each processors owning a single
+ single region of the domain (a single tile).  Except for periods of
- region of the domain (a single tile). Except for periods of
+ communication and coordination, each processor computes autonomously,
- communication and coordination, each processor computes autonomously, working
+ working only with data from the tile (or tiles) that the processor
- only with data from the tile (or tiles) that the processor owns. When multiple
+ owns.  When multiple tiles are alloted to a single processor, each
- tiles are alloted to a single processor, each tile is computed on
+ tile is computed on independently of the other tiles, in a sequential
- independently of the other tiles, in a sequential fashion.
+ fashion.
  \begin{figure}
  \begin{center}
   \resizebox{5in}{!}{
-   \includegraphics{part4/domain_decomp.eps}
+   \includegraphics{s_software/figs/domain_decomp.eps}
   }
  \end{center}
  \caption{ The WRAPPER provides support for one and two dimensional
- decompositions of grid-point domains. The figure shows a hypothetical domain of
+   decompositions of grid-point domains. The figure shows a
- total size $N_{x}N_{y}N_{z}$. This hypothetical domain is decomposed in
+   hypothetical domain of total size $N_{x}N_{y}N_{z}$. This
- two-dimensions along the $N_{x}$ and $N_{y}$ directions. The resulting {\bf
+   hypothetical domain is decomposed in two-dimensions along the
- tiles} are {\bf owned} by different processors. The {\bf owning}
+   $N_{x}$ and $N_{y}$ directions. The resulting {\bf tiles} are {\bf
- processors perform the
+     owned} by different processors. The {\bf owning} processors
- arithmetic operations associated with a {\bf tile}. Although not illustrated
+   perform the arithmetic operations associated with a {\bf tile}.
- here, a single processor can {\bf own} several {\bf tiles}.
+   Although not illustrated here, a single processor can {\bf own}
- Whenever a processor wishes to transfer data between tiles or
+   several {\bf tiles}.  Whenever a processor wishes to transfer data
- communicate with other processors it calls a WRAPPER supplied
+   between tiles or communicate with other processors it calls a
- function.
+   WRAPPER supplied function.  } \label{fig:domaindecomp}
- } \label{fig:domaindecomp}
  \end{figure}
  \subsubsection{Tile layout}
- Tiles consist of an interior region and an overlap region. The overlap region
+ Tiles consist of an interior region and an overlap region.  The
- of a tile corresponds to the interior region of an adjacent tile.
+ overlap region of a tile corresponds to the interior region of an
- In figure \ref{fig:tiledworld} each tile would own the region
+ adjacent tile.  In figure \ref{fig:tiledworld} each tile would own the
- within the black square and hold duplicate information for overlap
+ region within the black square and hold duplicate information for
- regions extending into the tiles to the north, south, east and west.
+ overlap regions extending into the tiles to the north, south, east and
- During
+ west.  During computational phases a processor will reference data in
- computational phases a processor will reference data in an overlap region
+ an overlap region whenever it requires values that lie outside the
- whenever it requires values that outside the domain it owns. Periodically
+ domain it owns.  Periodically processors will make calls to WRAPPER
- processors will make calls to WRAPPER functions to communicate data between
+ functions to communicate data between tiles, in order to keep the
- tiles, in order to keep the overlap regions up to date (see section
+ overlap regions up to date (see section
- \ref{sec:communication_primitives}). The WRAPPER functions can use a
+ \ref{sec:communication_primitives}).  The WRAPPER functions can use a
  variety of different mechanisms to communicate data between tiles.
  \begin{figure}
  \begin{center}
   \resizebox{5in}{!}{
-   \includegraphics{part4/tiled-world.eps}
+   \includegraphics{s_software/figs/tiled-world.eps}
   }
  \end{center}
  \caption{ A global grid subdivided into tiles.
-Line 228 
 Overlap regions are periodically updated
+Line 257 
 Overlap regions are periodically updated
  \subsection{Communication mechanisms}
-  Logical processors are assumed to be able to exchange information
+ Logical processors are assumed to be able to exchange information
- between tiles and between each other using at least one of two possible
+ between tiles and between each other using at least one of two
- mechanisms.
+ possible mechanisms.
  \begin{itemize}
- \item {\bf Shared memory communication}.
+ \item {\bf Shared memory communication}.  Under this mode of
- Under this mode of communication data transfers are assumed to be possible
+   communication data transfers are assumed to be possible using direct
- using direct addressing of regions of memory. In this case a CPU is able to read
+   addressing of regions of memory.  In this case a CPU is able to read
- (and write) directly to regions of memory "owned" by another CPU
+   (and write) directly to regions of memory ``owned'' by another CPU
- using simple programming language level assignment operations of the
+   using simple programming language level assignment operations of the
- the sort shown in figure \ref{fig:simple_assign}. In this way one CPU
+   the sort shown in figure \ref{fig:simple_assign}.  In this way one
- (CPU1 in the figure) can communicate information to another CPU (CPU2 in the
+   CPU (CPU1 in the figure) can communicate information to another CPU
- figure) by assigning a particular value to a particular memory location.
+   (CPU2 in the figure) by assigning a particular value to a particular
+   memory location.
- \item {\bf Distributed memory communication}.
- Under this mode of communication there is no mechanism, at the application code level,
+ \item {\bf Distributed memory communication}.  Under this mode of
- for directly addressing regions of memory owned and visible to another CPU. Instead
+   communication there is no mechanism, at the application code level,
- a communication library must be used as illustrated in figure
+   for directly addressing regions of memory owned and visible to
- \ref{fig:comm_msg}. In this case CPU's must call a function in the API of the
+   another CPU. Instead a communication library must be used as
- communication library to communicate data from a tile that it owns to a tile
+   illustrated in figure \ref{fig:comm_msg}. In this case CPUs must
- that another CPU owns. By default the WRAPPER binds to the MPI communication
+   call a function in the API of the communication library to
- library \ref{MPI} for this style of communication.
+   communicate data from a tile that it owns to a tile that another CPU
+   owns. By default the WRAPPER binds to the MPI communication library
+   \cite{MPI-std-20} for this style of communication.
  \end{itemize}
  The WRAPPER assumes that communication will use one of these two styles
- of communication. The underlying hardware and operating system support
+ of communication.  The underlying hardware and operating system support
  for the style used is not specified and can vary from system to system.
  \begin{figure}
-Line 267 
 for the style used is not specified and
+Line 298 
 for the style used is not specified and
                                   |        END WHILE
                                   |
  \end{verbatim}
- \caption{ In the WRAPPER shared memory communication model, simple writes to an
+ \caption{In the WRAPPER shared memory communication model, simple writes to an
- array can be made to be visible to other CPU's at the application code level.
+ array can be made to be visible to other CPUs at the application code level.
  So that for example, if one CPU (CPU1 in the figure above) writes the value $8$ to
- element $3$ of array $a$, then other CPU's (for example CPU2 in the figure above)
+ element $3$ of array $a$, then other CPUs (for example CPU2 in the figure above)
  will be able to see the value $8$ when they read from $a(3)$.
  This provides a very low latency and high bandwidth communication
  mechanism.
-Line 289 
 mechanism.
+Line 320 
 mechanism.
                                   |
  \end{verbatim}
  \caption{ In the WRAPPER distributed memory communication model
- data can not be made directly visible to other CPU's.
+ data can not be made directly visible to other CPUs.
  If one CPU writes the value $8$ to element $3$ of array $a$, then
  at least one of CPU1 and/or CPU2 in the figure above will need
  to call a bespoke communication library in order for the updated
- value to be communicated between CPU's.
+ value to be communicated between CPUs.
  } \label{fig:comm_msg}
  \end{figure}
  \subsection{Shared memory communication}
  \label{sec:shared_memory_communication}
- Under shared communication independent CPU's are operating
+ Under shared communication independent CPUs are operating on the
- on the exact same global address space at the application level.
+ exact same global address space at the application level.  This means
- This means that CPU 1 can directly write into global
+ that CPU 1 can directly write into global data structures that CPU 2
- data structures that CPU 2 ``owns'' using a simple
+ ``owns'' using a simple assignment at the application level.  This is
- assignment at the application level.
+ the model of memory access is supported at the basic system design
- This is the model of memory access is supported at the basic system
+ level in ``shared-memory'' systems such as PVP systems, SMP systems,
- design level in ``shared-memory'' systems such as PVP systems, SMP systems,
+ and on distributed shared memory systems (\textit{eg.} SGI Origin, SGI
- and on distributed shared memory systems (the SGI Origin).
+ Altix, and some AMD Opteron systems).  On such systems the WRAPPER
- On such systems the WRAPPER will generally use simple read and write statements
+ will generally use simple read and write statements to access directly
- to access directly application data structures when communicating between CPU's.
+ application data structures when communicating between CPUs.
  In a system where assignments statements, like the one in figure
- \ref{fig:simple_assign} map directly to
+ \ref{fig:simple_assign} map directly to hardware instructions that
- hardware instructions that transport data between CPU and memory banks, this
+ transport data between CPU and memory banks, this can be a very
- can be a very efficient mechanism for communication. In this case two CPU's,
+ efficient mechanism for communication.  In this case two CPUs, CPU1
- CPU1 and CPU2, can communicate simply be reading and writing to an
+ and CPU2, can communicate simply be reading and writing to an agreed
- agreed location and following a few basic rules. The latency of this sort
+ location and following a few basic rules.  The latency of this sort of
- of communication is generally not that much higher than the hardware
+ communication is generally not that much higher than the hardware
- latency of other memory accesses on the system. The bandwidth available
+ latency of other memory accesses on the system. The bandwidth
- between CPU's communicating in this way can be close to the bandwidth of
+ available between CPUs communicating in this way can be close to the
- the systems main-memory interconnect. This can make this method of
+ bandwidth of the systems main-memory interconnect.  This can make this
- communication very efficient provided it is used appropriately.
+ method of communication very efficient provided it is used
+ appropriately.
  \subsubsection{Memory consistency}
  \label{sec:memory_consistency}
- When using shared memory communication between
+ When using shared memory communication between multiple processors the
- multiple processors the WRAPPER level shields user applications from
+ WRAPPER level shields user applications from certain counter-intuitive
- certain counter-intuitive system behaviors. In particular, one issue the
+ system behaviors.  In particular, one issue the WRAPPER layer must
- WRAPPER layer must deal with is a systems memory model. In general the order
+ deal with is a systems memory model.  In general the order of reads
- of reads and writes expressed by the textual order of an application code may
+ and writes expressed by the textual order of an application code may
- not be the ordering of instructions executed by the processor performing the
+ not be the ordering of instructions executed by the processor
- application. The processor performing the application instructions will always
+ performing the application.  The processor performing the application
- operate so that, for the application instructions the processor is executing,
+ instructions will always operate so that, for the application
- any reordering is not apparent. However, in general machines are often
+ instructions the processor is executing, any reordering is not
- designed so that reordering of instructions is not hidden from other second
+ apparent.  However, in general machines are often designed so that
- processors.  This means that, in general, even on a shared memory system two
+ reordering of instructions is not hidden from other second processors.
- processors can observe inconsistent memory values.
+ This means that, in general, even on a shared memory system two
+ processors can observe inconsistent memory values.
- The issue of memory consistency between multiple processors is discussed at
- length in many computer science papers, however, from a practical point of
+ The issue of memory consistency between multiple processors is
- view, in order to deal with this issue, shared memory machines all provide
+ discussed at length in many computer science papers.  From a practical
- some mechanism to enforce memory consistency when it is needed. The exact
+ point of view, in order to deal with this issue, shared memory
- mechanism employed will vary between systems. For communication using shared
+ machines all provide some mechanism to enforce memory consistency when
- memory, the WRAPPER provides a place to invoke the appropriate mechanism to
+ it is needed.  The exact mechanism employed will vary between systems.
- ensure memory consistency for a particular platform.
+ For communication using shared memory, the WRAPPER provides a place to
+ invoke the appropriate mechanism to ensure memory consistency for a
+ particular platform.
  \subsubsection{Cache effects and false sharing}
  \label{sec:cache_effects_and_false_sharing}
  Shared-memory machines often have local to processor memory caches
- which contain mirrored copies of main memory. Automatic cache-coherence
+ which contain mirrored copies of main memory.  Automatic cache-coherence
  protocols are used to maintain consistency between caches on different
- processors. These cache-coherence protocols typically enforce consistency
+ processors.  These cache-coherence protocols typically enforce consistency
  between regions of memory with large granularity (typically 128 or 256 byte
- chunks). The coherency protocols employed can be expensive relative to other
+ chunks).  The coherency protocols employed can be expensive relative to other
  memory accesses and so care is taken in the WRAPPER (by padding synchronization
  structures appropriately) to avoid unnecessary coherence traffic.
  \subsubsection{Operating system support for shared memory.}
- Applications running under multiple threads within a single process can
+ Applications running under multiple threads within a single process
- use shared memory communication. In this case {\it all} the memory locations
+ can use shared memory communication.  In this case {\it all} the
- in an application are potentially visible to all the compute threads. Multiple
+ memory locations in an application are potentially visible to all the
- threads operating within a single process is the standard mechanism for
+ compute threads. Multiple threads operating within a single process is
- supporting shared memory that the WRAPPER utilizes. Configuring and launching
+ the standard mechanism for supporting shared memory that the WRAPPER
- code to run in multi-threaded mode on specific platforms is discussed in
+ utilizes. Configuring and launching code to run in multi-threaded mode
- section \ref{sec:running_with_threads}.  However, on many systems, potentially
+ on specific platforms is discussed in section
- very efficient mechanisms for using shared memory communication between
+ \ref{sec:multi_threaded_execution}.  However, on many systems,
- multiple processes (in contrast to multiple threads within a single
+ potentially very efficient mechanisms for using shared memory
- process) also exist. In most cases this works by making a limited region of
+ communication between multiple processes (in contrast to multiple
- memory shared between processes. The MMAP \ref{magicgarden} and
+ threads within a single process) also exist. In most cases this works
- IPC \ref{magicgarden} facilities in UNIX systems provide this capability as do
+ by making a limited region of memory shared between processes. The
- vendor specific tools like LAPI \ref{IBMLAPI} and IMC \ref{Memorychannel}.
+ MMAP %\ref{magicgarden}
- Extensions exist for the WRAPPER that allow these mechanisms
+ and IPC %\ref{magicgarden}
- to be used for shared memory communication. However, these mechanisms are not
+ facilities in UNIX
- distributed with the default WRAPPER sources, because of their proprietary
+ systems provide this capability as do vendor specific tools like LAPI
- nature.
+ %\ref{IBMLAPI}
+ and IMC. %\ref{Memorychannel}.
+ Extensions exist for the
+ WRAPPER that allow these mechanisms to be used for shared memory
+ communication. However, these mechanisms are not distributed with the
+ default WRAPPER sources, because of their proprietary nature.
  \subsection{Distributed memory communication}
  \label{sec:distributed_memory_communication}
  Many parallel systems are not constructed in a way where it is
- possible or practical for an application to use shared memory
+ possible or practical for an application to use shared memory for
- for communication. For example cluster systems consist of individual computers
+ communication. For example cluster systems consist of individual
- connected by a fast network. On such systems their is no notion of shared memory
+ computers connected by a fast network. On such systems there is no
- at the system level. For this sort of system the WRAPPER provides support
+ notion of shared memory at the system level. For this sort of system
- for communication based on a bespoke communication library
+ the WRAPPER provides support for communication based on a bespoke
- (see figure \ref{fig:comm_msg}).  The default communication library used is MPI
+ communication library (see figure \ref{fig:comm_msg}).  The default
- \ref{mpi}. However, it is relatively straightforward to implement bindings to
+ communication library used is MPI \cite{MPI-std-20}. However, it is
- optimized platform specific communication libraries. For example the work
+ relatively straightforward to implement bindings to optimized platform
- described in \ref{hoe-hill:99} substituted standard MPI communication for a
+ specific communication libraries. For example the work described in
- highly optimized library.
+ \cite{hoe-hill:99} substituted standard MPI communication for a highly
+ optimized library.
  \subsection{Communication primitives}
  \label{sec:communication_primitives}
-Line 399 
 highly optimized library.
+Line 439 
 highly optimized library.
  \begin{figure}
  \begin{center}
   \resizebox{5in}{!}{
-   \includegraphics{part4/comm-primm.eps}
+   \includegraphics{s_software/figs/comm-primm.eps}
   }
  \end{center}
- \caption{Three performance critical parallel primititives are provided
+ \caption{Three performance critical parallel primitives are provided
- by the WRAPPER. These primititives are always used to communicate data
+   by the WRAPPER. These primitives are always used to communicate data
- between tiles. The figure shows four tiles. The curved arrows indicate
+   between tiles. The figure shows four tiles. The curved arrows
- exchange primitives which transfer data between the overlap regions at tile
+   indicate exchange primitives which transfer data between the overlap
- edges and interior regions for nearest-neighbor tiles.
+   regions at tile edges and interior regions for nearest-neighbor
- The straight arrows symbolize global sum operations which connect all tiles.
+   tiles.  The straight arrows symbolize global sum operations which
- The global sum operation provides both a key arithmetic primitive and can
+   connect all tiles.  The global sum operation provides both a key
- serve as a synchronization primitive. A third barrier primitive is also
+   arithmetic primitive and can serve as a synchronization primitive. A
- provided, it behaves much like the global sum primitive.
+   third barrier primitive is also provided, it behaves much like the
- } \label{fig:communication_primitives}
+   global sum primitive.  } \label{fig:communication_primitives}
  \end{figure}
- Optimized communication support is assumed to be possibly available
+ Optimized communication support is assumed to be potentially available
- for a small number of communication operations.
+ for a small number of communication operations.  It is also assumed
- It is assumed that communication performance optimizations can
+ that communication performance optimizations can be achieved by
- be achieved by optimizing a small number of communication primitives.
+ optimizing a small number of communication primitives.  Three
- Three optimizable primitives are provided by the WRAPPER
+ optimizable primitives are provided by the WRAPPER
- \begin{itemize}
- \item{\bf EXCHANGE} This operation is used to transfer data between interior
- and overlap regions of neighboring tiles. A number of different forms of this
- operation are supported. These different forms handle
  \begin{itemize}
- \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
+ \item{\bf EXCHANGE} This operation is used to transfer data between
- separately.
+   interior and overlap regions of neighboring tiles. A number of
- \item Bindings to different communication methods.
+   different forms of this operation are supported. These different
- Exchange primitives select between using shared memory or distributed
+   forms handle
- memory communication.
+   \begin{itemize}
- \item Transformation operations required when transporting
+   \item Data type differences. Sixty-four bit and thirty-two bit
- data between different grid regions. Transferring data
+     fields may be handled separately.
- between faces of a cube-sphere grid, for example, involves a rotation
+   \item Bindings to different communication methods.  Exchange
- of vector components.
+     primitives select between using shared memory or distributed
- \item Forward and reverse mode computations. Derivative calculations require
+     memory communication.
- tangent linear and adjoint forms of the exchange primitives.
+   \item Transformation operations required when transporting data
+     between different grid regions. Transferring data between faces of
- \end{itemize}
+     a cube-sphere grid, for example, involves a rotation of vector
+     components.
+   \item Forward and reverse mode computations. Derivative calculations
+     require tangent linear and adjoint forms of the exchange
+     primitives.
+   \end{itemize}
  \item{\bf GLOBAL SUM} The global sum operation is a central arithmetic
- operation for the pressure inversion phase of the MITgcm algorithm.
+   operation for the pressure inversion phase of the MITgcm algorithm.
- For certain configurations scaling can be highly sensitive to
+   For certain configurations scaling can be highly sensitive to the
- the performance of the global sum primitive. This operation is a collective
+   performance of the global sum primitive. This operation is a
- operation involving all tiles of the simulated domain. Different forms
+   collective operation involving all tiles of the simulated domain.
- of the global sum primitive exist for handling
+   Different forms of the global sum primitive exist for handling
- \begin{itemize}
+   \begin{itemize}
- \item Data type differences. Sixty-four bit and thirty-two bit fields may be handled
+   \item Data type differences. Sixty-four bit and thirty-two bit
- separately.
+     fields may be handled separately.
- \item Bindings to different communication methods.
+   \item Bindings to different communication methods.  Exchange
- Exchange primitives select between using shared memory or distributed
+     primitives select between using shared memory or distributed
- memory communication.
+     memory communication.
- \item Forward and reverse mode computations. Derivative calculations require
+   \item Forward and reverse mode computations. Derivative calculations
- tangent linear and adjoint forms of the exchange primitives.
+     require tangent linear and adjoint forms of the exchange
- \end{itemize}
+     primitives.
+   \end{itemize}
- \item{\bf BARRIER} The WRAPPER provides a global synchronization function
- called barrier. This is used to synchronize computations over all tiles.
+ \item{\bf BARRIER} The WRAPPER provides a global synchronization
- The {\bf BARRIER} and {\bf GLOBAL SUM} primitives have much in common and in
+   function called barrier. This is used to synchronize computations
- some cases use the same underlying code.
+   over all tiles.  The {\bf BARRIER} and {\bf GLOBAL SUM} primitives
+   have much in common and in some cases use the same underlying code.
  \end{itemize}
-Line 480 
 sub-domains.
+Line 522 
 sub-domains.
  \begin{figure}
  \begin{center}
   \resizebox{5in}{!}{
-   \includegraphics{part4/tiling_detail.eps}
+   \includegraphics{s_software/figs/tiling_detail.eps}
   }
  \end{center}
  \caption{The tiling strategy that the WRAPPER supports allows tiles
-Line 498 
 Following the discussion above, the mach
+Line 540 
 Following the discussion above, the mach
  presents to an application has the following characteristics
  \begin{itemize}
- \item The machine consists of one or more logical processors. \vspace{-3mm}
+ \item The machine consists of one or more logical processors.
- \item Each processor operates on tiles that it owns.\vspace{-3mm}
+ \item Each processor operates on tiles that it owns.
- \item A processor may own more than one tile.\vspace{-3mm}
+ \item A processor may own more than one tile.
- \item Processors may compute concurrently.\vspace{-3mm}
+ \item Processors may compute concurrently.
  \item Exchange of information between tiles is handled by the
- machine (WRAPPER) not by the application.
+   machine (WRAPPER) not by the application.
  \end{itemize}
  Behind the scenes this allows the WRAPPER to adapt the machine model
  functions to exploit hardware on which
  \begin{itemize}
- \item Processors may be able to communicate very efficiently with each other
+ \item Processors may be able to communicate very efficiently with each
- using shared memory. \vspace{-3mm}
+   other using shared memory.
  \item An alternative communication mechanism based on a relatively
- simple inter-process communication API may be required.\vspace{-3mm}
+   simple inter-process communication API may be required.
  \item Shared memory may not necessarily obey sequential consistency,
- however some mechanism will exist for enforcing memory consistency.
+   however some mechanism will exist for enforcing memory consistency.
- \vspace{-3mm}
  \item Memory consistency that is enforced at the hardware level
- may be expensive. Unnecessary triggering of consistency protocols
+   may be expensive. Unnecessary triggering of consistency protocols
- should be avoided. \vspace{-3mm}
+   should be avoided.
  \item Memory access patterns may need to either repetitive or highly
- pipelined for optimum hardware performance. \vspace{-3mm}
+   pipelined for optimum hardware performance.
  \end{itemize}
  This generic model captures the essential hardware ingredients
-Line 527 
 of almost all successful scientific comp
+Line 568 
 of almost all successful scientific comp
  last 50 years.
  \section{Using the WRAPPER}
+ \begin{rawhtml}
- In order to support maximum portability the WRAPPER is implemented primarily
+ <!-- CMIREDIR:using_the_wrapper: -->
- in sequential Fortran 77. At a practical level the key steps provided by the
+ \end{rawhtml}
- WRAPPER are
+ In order to support maximum portability the WRAPPER is implemented
+ primarily in sequential Fortran 77. At a practical level the key steps
+ provided by the WRAPPER are
  \begin{enumerate}
  \item specifying how a domain will be decomposed
  \item starting a code in either sequential or parallel modes of operations
  \item controlling communication between tiles and between concurrently
- computing CPU's.
+   computing CPUs.
  \end{enumerate}
  This section describes the details of each of these operations.
- Section \ref{sec:specifying_a_decomposition} explains how the way in which
+ Section \ref{sec:specifying_a_decomposition} explains how the way in
- a domain is decomposed (or composed) is expressed. Section
+ which a domain is decomposed (or composed) is expressed. Section
- \ref{sec:starting_a_code} describes practical details of running codes
+ \ref{sec:starting_the_code} describes practical details of running
- in various different parallel modes on contemporary computer systems.
+ codes in various different parallel modes on contemporary computer
- Section \ref{sec:controlling_communication} explains the internal information
+ systems.  Section \ref{sec:controlling_communication} explains the
- that the WRAPPER uses to control how information is communicated between
+ internal information that the WRAPPER uses to control how information
- tiles.
+ is communicated between tiles.
  \subsection{Specifying a domain decomposition}
  \label{sec:specifying_a_decomposition}
-Line 584 
 not cause any other problems.
+Line 628 
 not cause any other problems.
  \begin{figure}
  \begin{center}
   \resizebox{5in}{!}{
-   \includegraphics{part4/size_h.eps}
+   \includegraphics{s_software/figs/size_h.eps}
   }
  \end{center}
  \caption{ The three level domain decomposition hierarchy employed by the
-Line 651 
 Within a {\em bi}, {\em bj} loop
+Line 695 
 Within a {\em bi}, {\em bj} loop
  computation is performed concurrently over as many processes and threads
  as there are physical processors available to compute.
+ An exception to the the use of {\em bi} and {\em bj} in loops arises in the
+ exchange routines used when the exch2 package is used with the cubed
+ sphere.  In this case {\em bj} is generally set to 1 and the loop runs from
+,{\em bi}.  Within the loop {\em bi} is used to retrieve the tile number,
+ which is then used to reference exchange parameters.
  The amount of computation that can be embedded
  a single loop over {\em bi} and {\em bj} varies for different parts of the
  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
-Line 771 
 The global domain size is again ninety g
+Line 821 
 The global domain size is again ninety g
  forty grid points in y. The two sub-domains in each process will be computed
  sequentially if they are given to a single thread within a single process.
  Alternatively if the code is invoked with multiple threads per process
- the two domains in y may be computed on concurrently.
+ the two domains in y may be computed concurrently.
  \item
  \begin{verbatim}
        PARAMETER (
-Line 807 
 by the application code. The startup cal
+Line 857 
 by the application code. The startup cal
  WRAPPER is shown in figure \ref{fig:wrapper_startup}.
  \begin{figure}
+ {\footnotesize
  \begin{verbatim}
         MAIN
-Line 835 
 WRAPPER is shown in figure \ref{fig:wrap
+Line 886 
 WRAPPER is shown in figure \ref{fig:wrap
  \end{verbatim}
+ }
  \caption{Main stages of the WRAPPER startup procedure.
  This process proceeds transfer of control to application code, which
  occurs through the procedure {\em THE\_MODEL\_MAIN()}.
-Line 842 
 occurs through the procedure {\em THE\_M
+Line 894 
 occurs through the procedure {\em THE\_M
  \end{figure}
  \subsubsection{Multi-threaded execution}
- \label{sec:multi-threaded-execution}
+ \label{sec:multi_threaded_execution}
  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
  WRAPPER may cause several coarse grain threads to be initialized. The routine
  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
-Line 917 
 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
+Line 969 
 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
  File: {\em model/src/THE\_MODEL\_MAIN.F}\\
  File: {\em eesupp/src/MAIN.F}\\
- File: {\em tools/genmake}\\
+ File: {\em tools/genmake2}\\
  File: {\em eedata}\\
  CPP:  {\em TARGET\_SUN}\\
  CPP:  {\em TARGET\_DEC}\\
-Line 930 
 Parameter:  {\em nTy}
+Line 982 
 Parameter:  {\em nTy}
  } \\
  \subsubsection{Multi-process execution}
- \label{sec:multi-process-execution}
+ \label{sec:multi_process_execution}
- Despite its appealing programming model, multi-threaded execution remains
+ Despite its appealing programming model, multi-threaded execution
- less common then multi-process execution. One major reason for this
+ remains less common than multi-process execution. One major reason for
- is that many system libraries are still not ``thread-safe''. This means that for
+ this is that many system libraries are still not ``thread-safe''. This
- example on some systems it is not safe to call system routines to
+ means that, for example, on some systems it is not safe to call system
- do I/O when running in multi-threaded mode, except for in a limited set of
+ routines to perform I/O when running in multi-threaded mode (except,
- circumstances. Another reason is that support for multi-threaded programming
+ perhaps, in a limited set of circumstances).  Another reason is that
- models varies between systems.
+ support for multi-threaded programming models varies between systems.
- Multi-process execution is more ubiquitous.
+ Multi-process execution is more ubiquitous.  In order to run code in a
- In order to run code in a multi-process configuration a decomposition
+ multi-process configuration a decomposition specification (see section
- specification ( see section \ref{sec:specifying_a_decomposition})
+ \ref{sec:specifying_a_decomposition}) is given (in which the at least
- is given ( in which the at least one of the
+ one of the parameters {\em nPx} or {\em nPy} will be greater than one)
- parameters {\em nPx} or {\em nPy} will be greater than one)
+ and then, as for multi-threaded operation, appropriate compile time
- and then, as for multi-threaded operation,
+ and run time steps must be taken.
- appropriate compile time and run time steps must be taken.
+ \paragraph{Compilation} Multi-process execution under the WRAPPER
- \paragraph{Compilation} Multi-process execution under the WRAPPER
+ assumes that the portable, MPI libraries are available for controlling
- assumes that the portable, MPI libraries are available
+ the start-up of multiple processes. The MPI libraries are not
- for controlling the start-up of multiple processes. The MPI libraries
+ required, although they are usually used, for performance critical
- are not required, although they are usually used, for performance
+ communication. However, in order to simplify the task of controlling
- critical communication. However, in order to simplify the task
+ and coordinating the start up of a large number (hundreds and possibly
- of controlling and coordinating the start up of a large number
+ even thousands) of copies of the same program, MPI is used. The calls
- (hundreds and possibly even thousands) of copies of the same
+ to the MPI multi-process startup routines must be activated at compile
- program, MPI is used. The calls to the MPI multi-process startup
+ time.  Currently MPI libraries are invoked by specifying the
- routines must be activated at compile time. This is done
+ appropriate options file with the {\tt-of} flag when running the {\em
- by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}
+   genmake2} script, which generates the Makefile for compiling and
- flags in the {\em CPP\_EEOPTIONS.h} file.\\
+ linking MITgcm.  (Previously this was done by setting the {\em
+   ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
- \fbox{
+   CPP\_EEOPTIONS.h} file.)  More detailed information about the use of
- \begin{minipage}{4.75in}
+ {\em genmake2} for specifying
- File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\
+ local compiler flags is located in section \ref{sec:genmake}.\\
- CPP:  {\em ALLOW\_USE\_MPI}\\
- CPP:  {\em ALWAYS\_USE\_MPI}\\
- Parameter:  {\em nPx}\\
- Parameter:  {\em nPy}
- \end{minipage}
- } \\
- Additionally, compile time options are required to link in the
- MPI libraries and header files. Examples of these options
- can be found in the {\em genmake} script that creates makefiles
- for compilation. When this script is executed with the {bf -mpi}
- flag it will generate a makefile that includes
- paths for search for MPI head files and for linking in
- MPI libraries. For example the {\bf -mpi} flag on a
-  Silicon Graphics IRIX system causes a
- Makefile with the compilation command
- Graphics IRIX system \begin{verbatim}
- mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI
- \end{verbatim}
- to be generated.
- This is the correct set of options for using the MPICH open-source
- version of MPI, when it has been installed under the subdirectory
- /usr/local/mpi.
- However, on many systems there may be several
- versions of MPI installed. For example many systems have both
- the open source MPICH set of libraries and a vendor specific native form
- of the MPI libraries. The correct setup to use will depend on the
- local configuration of your system.\\
  \fbox{
  \begin{minipage}{4.75in}
- File: {\em tools/genmake}
+ Directory: {\em tools/build\_options}\\
+ File: {\em tools/genmake2}
  \end{minipage}
  } \\
  \paragraph{\bf Execution} The mechanics of starting a program in
  multi-process mode under MPI is not standardized. Documentation
  associated with the distribution of MPI installed on a system will
- describe how to start a program using that distribution.
+ describe how to start a program using that distribution.  For the
- For the free, open-source MPICH system the MITgcm program is started
+ open-source MPICH system, the MITgcm program can be started using a
- using a command such as
+ command such as
  \begin{verbatim}
  mpirun -np 64 -machinefile mf ./mitgcmuv
  \end{verbatim}
- In this example the text {\em -np 64} specifices the number of processes
+ In this example the text {\em -np 64} specifies the number of
- that will be created. The numeric value {\em 64} must be equal to the
+ processes that will be created. The numeric value {\em 64} must be
- product of the processor grid settings of {\em nPx} and {\em nPy}
+ equal to the product of the processor grid settings of {\em nPx} and
- in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
+ {\em nPy} in the file {\em SIZE.h}. The parameter {\em mf} specifies
- called ``mf'' will be read to get a list of processor names on
+ that a text file called ``mf'' will be read to get a list of processor
- which the sixty-four processes will execute. The syntax of this file
+ names on which the sixty-four processes will execute. The syntax of
- is specified by the MPI distribution
+ this file is specified by the MPI distribution.
  \\
  \fbox{
  \begin{minipage}{4.75in}
-Line 1025 
 Parameter: {\em nPy}
+Line 1051 
 Parameter: {\em nPy}
  \paragraph{Environment variables}
- On most systems multi-threaded execution also requires the setting
+ On most systems multi-threaded execution also requires the setting of
- of a special environment variable. On many machines this variable
+ a special environment variable. On many machines this variable is
- is called PARALLEL and its values should be set to the number
+ called PARALLEL and its values should be set to the number of parallel
- of parallel threads required. Generally the help pages associated
+ threads required. Generally the help or manual pages associated with
- with the multi-threaded compiler on a machine will explain
+ the multi-threaded compiler on a machine will explain how to set the
- how to set the required environment variables for that machines.
+ required environment variables.
  \paragraph{Runtime input parameters}
- Finally the file {\em eedata} needs to be configured to indicate
+ Finally the file {\em eedata} needs to be configured to indicate the
- the number of threads to be used in the x and y directions.
+ number of threads to be used in the x and y directions.  The variables
- The variables {\em nTx} and {\em nTy} in this file are used to
+ {\em nTx} and {\em nTy} in this file are used to specify the
- specify the information required. The product of {\em nTx} and
+ information required. The product of {\em nTx} and {\em nTy} must be
- {\em nTy} must be equal to the number of threads spawned i.e.
+ equal to the number of threads spawned i.e.  the setting of the
- the setting of the environment variable PARALLEL.
+ environment variable PARALLEL.  The value of {\em nTx} must subdivide
- The value of {\em nTx} must subdivide the number of sub-domains
+ the number of sub-domains in x ({\em nSx}) exactly. The value of {\em
- in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
+   nTy} must subdivide the number of sub-domains in y ({\em nSy})
- number of sub-domains in y ({\em nSy}) exactly.
+ exactly.  The multiprocess startup of the MITgcm executable {\em
- The multiprocess startup of the MITgcm executable {\em mitgcmuv}
+   mitgcmuv} is controlled by the routines {\em EEBOOT\_MINIMAL()} and
- is controlled by the routines {\em EEBOOT\_MINIMAL()} and
+ {\em INI\_PROCS()}. The first routine performs basic steps required to
- {\em INI\_PROCS()}. The first routine performs basic steps required
+ make sure each process is started and has a textual output stream
- to make sure each process is started and has a textual output
+ associated with it. By default two output files are opened for each
- stream associated with it. By default two output files are opened
+ process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.  The {\bf
- for each process with names {\bf STDOUT.NNNN} and {\bf STDERR.NNNN}.
+   NNNNN} part of the name is filled in with the process number so that
- The {\bf NNNNN} part of the name is filled in with the process
+ process number 0 will create output files {\bf STDOUT.0000} and {\bf
- number so that process number 0 will create output files
+   STDERR.0000}, process number 1 will create output files {\bf
- {\bf STDOUT.0000} and {\bf STDERR.0000}, process number 1 will create
+   STDOUT.0001} and {\bf STDERR.0001}, etc. These files are used for
- output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
+ reporting status and configuration information and for reporting error
- are used for reporting status and configuration information and
+ conditions on a process by process basis.  The {\em EEBOOT\_MINIMAL()}
- for reporting error conditions on a process by process basis.
+ procedure also sets the variables {\em myProcId} and {\em
- The {\em EEBOOT\_MINIMAL()} procedure also sets the variables
+   MPI\_COMM\_MODEL}.  These variables are related to processor
- {\em myProcId} and {\em MPI\_COMM\_MODEL}.
+ identification are are used later in the routine {\em INI\_PROCS()} to
- These variables are related
+ allocate tiles to processes.
- to processor identification are are used later in the routine
- {\em INI\_PROCS()} to allocate tiles to processes.
+ Allocation of processes to tiles is controlled by the routine {\em
+   INI\_PROCS()}. For each process this routine sets the variables {\em
- Allocation of processes to tiles in controlled by the routine
+   myXGlobalLo} and {\em myYGlobalLo}.  These variables specify, in
- {\em INI\_PROCS()}. For each process this routine sets
+ index space, the coordinates of the southernmost and westernmost
- the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
+ corner of the southernmost and westernmost tile owned by this process.
- These variables specify (in index space) the coordinate
+ The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are
- of the southern most and western most corner of the
+ also set in this routine. These are used to identify processes holding
- southern most and western most tile owned by this process.
+ tiles to the west, east, south and north of a given process. These
- The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
+ values are stored in global storage in the header file {\em
- are also set in this routine. These are used to identify
+   EESUPPORT.h} for use by communication routines.  The above does not
- processes holding tiles to the west, east, south and north
+ hold when the exch2 package is used.  The exch2 sets its own
- of this process. These values are stored in global storage
+ parameters to specify the global indices of tiles and their
- in the header file {\em EESUPPORT.h} for use by
+ relationships to each other.  See the documentation on the exch2
- communication routines.
+ package (\ref{sec:exch2}) for details.
  \\
  \fbox{
-Line 1092 
 Parameter: {\em pidN       }
+Line 1118 
 Parameter: {\em pidN       }
  \subsection{Controlling communication}
+ \label{sec:controlling_communication}
  The WRAPPER maintains internal information that is used for communication
  operations and that can be customized for different platforms. This section
  describes the information that is held and used.
  \begin{enumerate}
- \item {\bf Tile-tile connectivity information} For each tile the WRAPPER
+ \item {\bf Tile-tile connectivity information}
- sets a flag that sets the tile number to the north, south, east and
+   For each tile the WRAPPER sets a flag that sets the tile number to
- west of that tile. This number is unique over all tiles in a
+   the north, south, east and west of that tile. This number is unique
- configuration. The number is held in the variables {\em tileNo}
+   over all tiles in a configuration. Except when using the cubed
- ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
+   sphere and the exch2 package, the number is held in the variables
- {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
+   {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em
- that specifies the type of communication that is used between tiles.
+     tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also
- This information is held in the variables {\em tileCommModeN},
+   stored with each tile that specifies the type of communication that
- {\em tileCommModeS}, {\em tileCommModeE} and {\em tileCommModeW}.
+   is used between tiles.  This information is held in the variables
- This latter set of variables can take one of the following values
+   {\em tileCommModeN}, {\em tileCommModeS}, {\em tileCommModeE} and
- {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
+   {\em tileCommModeW}.  This latter set of variables can take one of
- A value of {\em COMM\_NONE} is used to indicate that a tile has no
+   the following values {\em COMM\_NONE}, {\em COMM\_MSG}, {\em
- neighbor to communicate with on a particular face. A value
+     COMM\_PUT} and {\em COMM\_GET}.  A value of {\em COMM\_NONE} is
- of {\em COMM\_MSG} is used to indicated that some form of distributed
+   used to indicate that a tile has no neighbor to communicate with on
- memory communication is required to communicate between
+   a particular face. A value of {\em COMM\_MSG} is used to indicate
- these tile faces ( see section \ref{sec:distributed_memory_communication}).
+   that some form of distributed memory communication is required to
- A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
+   communicate between these tile faces (see section
- forms of shared memory communication ( see section
+   \ref{sec:distributed_memory_communication}).  A value of {\em
- \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates
+     COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
- that a CPU should communicate by writing to data structures owned by another
+   memory communication (see section
- CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
+   \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value
- from data structures owned by another CPU. These flags affect the behavior
+   indicates that a CPU should communicate by writing to data
- of the WRAPPER exchange primitive
+   structures owned by another CPU. A {\em COMM\_GET} value indicates
- (see figure \ref{fig:communication_primitives}). The routine
+   that a CPU should communicate by reading from data structures owned
- {\em ini\_communication\_patterns()} is responsible for setting the
+   by another CPU. These flags affect the behavior of the WRAPPER
- communication mode values for each tile.
+   exchange primitive (see figure \ref{fig:communication_primitives}).
- \\
+   The routine {\em ini\_communication\_patterns()} is responsible for
+   setting the communication mode values for each tile.
+   When using the cubed sphere configuration with the exch2 package,
+   the relationships between tiles and their communication methods are
+   set by the exch2 package and stored in different variables.  See the
+   exch2 package documentation (\ref{sec:exch2} for details.
  \fbox{
  \begin{minipage}{4.75in}
-Line 1142 
 Parameter: {\em tileCommModeS} \\
+Line 1175 
 Parameter: {\em tileCommModeS} \\
  } \\
  \item {\bf MP directives}
- The WRAPPER transfers control to numerical application code through
+   The WRAPPER transfers control to numerical application code through
- the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
+   the routine {\em THE\_MODEL\_MAIN}. This routine is called in a way
- that allows for it to be invoked by several threads. Support for this
+   that allows for it to be invoked by several threads. Support for
- is based on using multi-processing (MP) compiler directives.
+   this is based on either multi-processing (MP) compiler directives or
- Most commercially available Fortran compilers support the generation
+   specific calls to multi-threading libraries (\textit{eg.} POSIX
- of code to spawn multiple threads through some form of compiler directives.
+   threads).  Most commercially available Fortran compilers support the
- As this is generally much more convenient than writing code to interface
+   generation of code to spawn multiple threads through some form of
- to operating system libraries to explicitly spawn threads, and on some systems
+   compiler directives.  Compiler directives are generally more
- this may be the only method available the WRAPPER is distributed with
+   convenient than writing code to explicitly spawning threads.  And,
- template MP directives for a number of systems.
+   on some systems, compiler directives may be the only method
+   available.  The WRAPPER is distributed with template MP directives
-  These directives are inserted into the code just before and after the
+   for a number of systems.
- transfer of control to numerical algorithm code through the routine
- {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows an example of
+   These directives are inserted into the code just before and after
- the code that performs this process for a Silicon Graphics system.
+   the transfer of control to numerical algorithm code through the
- This code is extracted from the files {\em main.F} and
+   routine {\em THE\_MODEL\_MAIN}. Figure \ref{fig:mp_directives} shows
- {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads} specifies
+   an example of the code that performs this process for a Silicon
- how many instances of the routine {\em THE\_MODEL\_MAIN} will
+   Graphics system.  This code is extracted from the files {\em main.F}
- be created. The value of {\em nThreads} is set in the routine
+   and {\em MAIN\_PDIRECTIVES1.h}. The variable {\em nThreads}
- {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the
+   specifies how many instances of the routine {\em THE\_MODEL\_MAIN}
- the product of the parameters {\em nTx} and {\em nTy} that
+   will be created. The value of {\em nThreads} is set in the routine
- are read from the file {\em eedata}. If the value of {\em nThreads}
+   {\em INI\_THREADING\_ENVIRONMENT}. The value is set equal to the the
- is inconsistent with the number of threads requested from the
+   product of the parameters {\em nTx} and {\em nTy} that are read from
- operating system (for example by using an environment
+   the file {\em eedata}. If the value of {\em nThreads} is
- variable as described in section \ref{sec:multi_threaded_execution})
+   inconsistent with the number of threads requested from the operating
- then usually an error will be reported by the routine
+   system (for example by using an environment variable as described in
- {\em CHECK\_THREADS}.\\
+   section \ref{sec:multi_threaded_execution}) then usually an error
+   will be reported by the routine {\em CHECK\_THREADS}.
  \fbox{
  \begin{minipage}{4.75in}
-Line 1184 
 Parameter: {\em nTy} \\
+Line 1218 
 Parameter: {\em nTy} \\
  }
  \item {\bf memsync flags}
- As discussed in section \ref{sec:memory_consistency}, when using shared memory,
+   As discussed in section \ref{sec:memory_consistency}, a low-level
- a low-level system function may be need to force memory consistency.
+   system function may be need to force memory consistency on some
- The routine {\em MEMSYNC()} is used for this purpose. This routine should
+   shared memory systems.  The routine {\em MEMSYNC()} is used for this
- not need modifying and the information below is only provided for
+   purpose. This routine should not need modifying and the information
- completeness. A logical parameter {\em exchNeedsMemSync} set
+   below is only provided for completeness. A logical parameter {\em
- in the routine {\em INI\_COMMUNICATION\_PATTERNS()} controls whether
+     exchNeedsMemSync} set in the routine {\em
- the {\em MEMSYNC()} primitive is called. In general this
+     INI\_COMMUNICATION\_PATTERNS()} controls whether the {\em
- routine is only used for multi-threaded execution.
+     MEMSYNC()} primitive is called. In general this routine is only
- The code that goes into the {\em MEMSYNC()}
+   used for multi-threaded execution.  The code that goes into the {\em
-  routine is specific to the compiler and
+     MEMSYNC()} routine is specific to the compiler and processor used.
- processor being used for multi-threaded execution and in general
+   In some cases, it must be written using a short code snippet of
- must be written using a short code snippet of assembly language.
+   assembly language.  For an Ultra Sparc system the following code
- For an Ultra Sparc system the following code snippet is used
+   snippet is used
  \begin{verbatim}
  asm("membar #LoadStore|#StoreStore");
  \end{verbatim}
-Line 1210 
 asm("lock; addl $0,0(%%esp)": : :"memory
+Line 1244 
 asm("lock; addl $0,0(%%esp)": : :"memory
  \end{verbatim}
  \item {\bf Cache line size}
- As discussed in section \ref{sec:cache_effects_and_false_sharing},
+   As discussed in section \ref{sec:cache_effects_and_false_sharing},
- milti-threaded codes explicitly avoid penalties associated with excessive
+   milti-threaded codes explicitly avoid penalties associated with
- coherence traffic on an SMP system. To do this the sgared memory data structures
+   excessive coherence traffic on an SMP system. To do this the shared
- used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
+   memory data structures used by the {\em GLOBAL\_SUM}, {\em
- are padded. The variables that control the padding are set in the
+     GLOBAL\_MAX} and {\em BARRIER} routines are padded. The variables
- header file {\em EEPARAMS.h}. These variables are called
+   that control the padding are set in the header file {\em
- {\em cacheLineSize}, {\em lShare1}, {\em lShare4} and
+     EEPARAMS.h}. These variables are called {\em cacheLineSize}, {\em
- {\em lShare8}. The default values should not normally need changing.
+     lShare1}, {\em lShare4} and {\em lShare8}. The default values
+   should not normally need changing.
  \item {\bf \_BARRIER}
- This is a CPP macro that is expanded to a call to a routine
+   This is a CPP macro that is expanded to a call to a routine which
- which synchronises all the logical processors running under the
+   synchronizes all the logical processors running under the WRAPPER.
- WRAPPER. Using a macro here preserves flexibility to insert
+   Using a macro here preserves flexibility to insert a specialized
- a specialized call in-line into application code. By default this
+   call in-line into application code. By default this resolves to
- resolves to calling the procedure {\em BARRIER()}. The default
+   calling the procedure {\em BARRIER()}. The default setting for the
- setting for the \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
+   \_BARRIER macro is given in the file {\em CPP\_EEMACROS.h}.
  \item {\bf \_GSUM}
- This is a CPP macro that is expanded to a call to a routine
+   This is a CPP macro that is expanded to a call to a routine which
- which sums up a floating point numner
+   sums up a floating point number over all the logical processors
- over all the logical processors running under the
+   running under the WRAPPER. Using a macro here provides extra
- WRAPPER. Using a macro here provides extra flexibility to insert
+   flexibility to insert a specialized call in-line into application
- a specialized call in-line into application code. By default this
+   code. By default this resolves to calling the procedure {\em
- resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for
+     GLOBAL\_SUM\_R8()} ( for 64-bit floating point operands) or {\em
-=bit floating point operands)
+     GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The
- or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default
+   default setting for the \_GSUM macro is given in the file {\em
- setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
+     CPP\_EEMACROS.h}.  The \_GSUM macro is a performance critical
- The \_GSUM macro is a performance critical operation, especially for
+   operation, especially for large processor count, small tile size
- large processor count, small tile size configurations.
+   configurations.  The custom communication example discussed in
- The custom communication example discussed in section \ref{sec:jam_example}
+   section \ref{sec:jam_example} shows how the macro is used to invoke
- shows how the macro is used to invoke a custom global sum routine
+   a custom global sum routine for a specific set of hardware.
- for a specific set of hardware.
  \item {\bf \_EXCH}
- The \_EXCH CPP macro is used to update tile overlap regions.
+   The \_EXCH CPP macro is used to update tile overlap regions.  It is
- It is qualified by a suffix indicating whether overlap updates are for
+   qualified by a suffix indicating whether overlap updates are for
- two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
+   two-dimensional ( \_EXCH\_XY ) or three dimensional ( \_EXCH\_XYZ )
- physical fields and whether fields are 32-bit floating point
+   physical fields and whether fields are 32-bit floating point (
- ( \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point
+   \_EXCH\_XY\_R4, \_EXCH\_XYZ\_R4 ) or 64-bit floating point (
- ( \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined
+   \_EXCH\_XY\_R8, \_EXCH\_XYZ\_R8 ). The macro mappings are defined in
- in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
+   the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
- \_EXCH operation plays a crucial role in scaling to small tile,
+   operation plays a crucial role in scaling to small tile, large
- large logical and physical processor count configurations.
+   logical and physical processor count configurations.  The example in
- The example in section \ref{sec:jam_example} discusses defining an
+   section \ref{sec:jam_example} discusses defining an optimized and
- optimised and specialized form on the \_EXCH operation.
+   specialized form on the \_EXCH operation.
- The \_EXCH operation is also central to supporting grids such as
+   The \_EXCH operation is also central to supporting grids such as the
- the cube-sphere grid. In this class of grid a rotation may be required
+   cube-sphere grid. In this class of grid a rotation may be required
- between tiles. Aligning the coordinate requiring rotation with the
+   between tiles. Aligning the coordinate requiring rotation with the
- tile decomposistion, allows the coordinate transformation to
+   tile decomposition, allows the coordinate transformation to be
- be embedded within a custom form of the \_EXCH primitive.
+   embedded within a custom form of the \_EXCH primitive.  In these
+   cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
+   package documentation \ref{sec:exch2}.
  \item {\bf Reverse Mode}
- The communication primitives \_EXCH and \_GSUM both employ
+   The communication primitives \_EXCH and \_GSUM both employ
- hand-written adjoint forms (or reverse mode) forms.
+   hand-written adjoint forms (or reverse mode) forms.  These reverse
- These reverse mode forms can be found in the
+   mode forms can be found in the source code directory {\em
- sourc code directory {\em pkg/autodiff}.
+     pkg/autodiff}.  For the global sum primitive the reverse mode form
- For the global sum primitive the reverse mode form
+   calls are to {\em GLOBAL\_ADSUM\_R4} and {\em GLOBAL\_ADSUM\_R8}.
- calls are to {\em GLOBAL\_ADSUM\_R4} and
+   The reverse mode form of the exchange primitives are found in
- {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
+   routines prefixed {\em ADEXCH}. The exchange routines make calls to
- exchamge primitives are found in routines
+   the same low-level communication primitives as the forward mode
- prefixed {\em ADEXCH}. The exchange routines make calls to
+   operations. However, the routine argument {\em simulationMode} is
- the same low-level communication primitives as the forward mode
+   set to the value {\em REVERSE\_SIMULATION}. This signifies to the
- operations. However, the routine argument {\em simulationMode}
+   low-level routines that the adjoint forms of the appropriate
- is set to the value {\em REVERSE\_SIMULATION}. This signifies
+   communication operation should be performed.
- ti the low-level routines that the adjoint forms of the
- appropriate communication operation should be performed.
  \item {\bf MAX\_NO\_THREADS}
- The variable {\em MAX\_NO\_THREADS} is used to indicate the
+   The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum
- maximum number of OS threads that a code will use. This
+   number of OS threads that a code will use. This value defaults to
- value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
+   thirty-two and is set in the file {\em EEPARAMS.h}.  For single
- For single threaded execution it can be reduced to one if required.
+   threaded execution it can be reduced to one if required.  The value
- The va;lue is largely private to the WRAPPER and application code
+   is largely private to the WRAPPER and application code will not
- will nor normally reference the value, except in the following scenario.
+   normally reference the value, except in the following scenario.
- For certain physical parametrization schemes it is necessary to have
+   For certain physical parametrization schemes it is necessary to have
- a substantial number of work arrays. Where these arrays are allocated
+   a substantial number of work arrays. Where these arrays are
- in heap storage ( for example COMMON blocks ) multi-threaded
+   allocated in heap storage (for example COMMON blocks) multi-threaded
- execution will require multiple instances of the COMMON block data.
+   execution will require multiple instances of the COMMON block data.
- This can be achieved using a Fortran 90 module construct, however,
+   This can be achieved using a Fortran 90 module construct.  However,
- if this might be unavailable then the work arrays can be extended
+   if this mechanism is unavailable then the work arrays can be extended
- with dimensions use the tile dimensioning scheme of {\em nSx}
+   with dimensions using the tile dimensioning scheme of {\em nSx} and
- and {\em nSy} ( as described in section
+   {\em nSy} (as described in section
- \ref{sec:specifying_a_decomposition}). However, if the configuration
+   \ref{sec:specifying_a_decomposition}). However, if the
- being specified involves many more tiles than OS threads then
+   configuration being specified involves many more tiles than OS
- it can save memory resources to reduce the variable
+   threads then it can save memory resources to reduce the variable
- {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
+   {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
- will be used and to declare the physical parameterisation
+   that will be used and to declare the physical parameterization work
- work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.
+   arrays with a single {\em MAX\_NO\_THREADS} extra dimension.  An
- An example of this is given in the verification experiment
+   example of this is given in the verification experiment {\em
- {\em aim.5l\_cs}. Here the default setting of
+     aim.5l\_cs}. Here the default setting of {\em MAX\_NO\_THREADS} is
- {\em MAX\_NO\_THREADS} is altered to
+   altered to
  \begin{verbatim}
        INTEGER MAX_NO_THREADS
        PARAMETER ( MAX_NO_THREADS =    6 )
  \end{verbatim}
- and several work arrays for storing intermediate calculations are
+   and several work arrays for storing intermediate calculations are
- created with declarations of the form.
+   created with declarations of the form.
  \begin{verbatim}
        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
  \end{verbatim}
- This declaration scheme is not used widely, becuase most global data
+   This declaration scheme is not used widely, because most global data
- is used for permanent not temporary storage of state information.
+   is used for permanent not temporary storage of state information.
- In the case of permanent state information this approach cannot be used
+   In the case of permanent state information this approach cannot be
- because there has to be enough storage allocated for all tiles.
+   used because there has to be enough storage allocated for all tiles.
- However, the technique can sometimes be a useful scheme for reducing memory
+   However, the technique can sometimes be a useful scheme for reducing
- requirements in complex physical paramterisations.
+   memory requirements in complex physical parameterizations.
  \end{enumerate}
  \begin{figure}
-Line 1336 
 C--     Invoke nThreads instances of the
+Line 1372 
 C--     Invoke nThreads instances of the
        ENDDO
  \end{verbatim}
- \caption{Prior to transferring control to
+   \caption{Prior to transferring control to the procedure {\em
- the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
+       THE\_MODEL\_MAIN()} the WRAPPER may use MP directives to spawn
- MP directives to spawn multiple threads.
+     multiple threads.  } \label{fig:mp_directives}
- } \label{fig:mp_directives}
  \end{figure}
-Line 1348 
 MP directives to spawn multiple threads.
+Line 1383 
 MP directives to spawn multiple threads.
  The isolation of performance critical communication primitives and the
  sub-division of the simulation domain into tiles is a powerful tool.
  Here we show how it can be used to improve application performance and
- how it can be used to adapt to new gridding approaches.
+ how it can be used to adapt to new griding approaches.
  \subsubsection{JAM example}
  \label{sec:jam_example}
- On some platforms a big performance boost can be obtained by
+ On some platforms a big performance boost can be obtained by binding
- binding the communication routines {\em \_EXCH} and
+ the communication routines {\em \_EXCH} and {\em \_GSUM} to
- {\em \_GSUM} to specialized native libraries ) fro example the
+ specialized native libraries (for example, the shmem library on CRAY
- shmem library on CRAY T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag
+ T3E systems). The {\em LETS\_MAKE\_JAM} CPP flag is used as an
- is used as an illustration of a specialized communication configuration
+ illustration of a specialized communication configuration that
- that substitutes for standard, portable forms of {\em \_EXCH} and
+ substitutes for standard, portable forms of {\em \_EXCH} and {\em
- {\em \_GSUM}. It affects three source files {\em eeboot.F},
+   \_GSUM}. It affects three source files {\em eeboot.F}, {\em
- {\em CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined
+   CPP\_EEMACROS.h} and {\em cg2d.F}. When the flag is defined is has
- is has the following effects.
+ the following effects.
  \begin{itemize}
  \item An extra phase is included at boot time to initialize the custom
- communications library ( see {\em ini\_jam.F}).
+   communications library ( see {\em ini\_jam.F}).
  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
- with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
+   with calls to custom routines (see {\em gsum\_jam.F} and {\em
+     exch\_jam.F})
  \item a highly specialized form of the exchange operator (optimized
- for overlap regions of width one) is substituted into the elliptic
+   for overlap regions of width one) is substituted into the elliptic
- solver routine {\em cg2d.F}.
+   solver routine {\em cg2d.F}.
  \end{itemize}
  Developing specialized code for other libraries follows a similar
  pattern.
  \subsubsection{Cube sphere communication}
  \label{sec:cube_sphere_communication}
- Actual {\em \_EXCH} routine code is generated automatically from
+ Actual {\em \_EXCH} routine code is generated automatically from a
- a series of template files, for example {\em exch\_rx.template}.
+ series of template files, for example {\em exch\_rx.template}.  This
- This is done to allow a large number of variations on the exchange
+ is done to allow a large number of variations on the exchange process
- process to be maintained. One set of variations supports the
+ to be maintained. One set of variations supports the cube sphere grid.
- cube sphere grid. Support for a cube sphere grid in MITgcm is based
+ Support for a cube sphere grid in MITgcm is based on having each face
- on having each face of the cube as a separate tile (or tiles).
+ of the cube as a separate tile or tiles.  The exchange routines are
- The exchange routines are then able to absorb much of the
+ then able to absorb much of the detailed rotation and reorientation
- detailed rotation and reorientation required when moving around the
+ required when moving around the cube grid. The set of {\em \_EXCH}
- cube grid. The set of {\em \_EXCH} routines that contain the
+ routines that contain the word cube in their name perform these
- word cube in their name perform these transformations.
+ transformations.  They are invoked when the run-time logical parameter
- They are invoked when the run-time logical parameter
  {\em useCubedSphereExchange} is set true. To facilitate the
  transformations on a staggered C-grid, exchange operations are defined
- separately for both vector and scalar quantities and for
+ separately for both vector and scalar quantities and for grid-centered
- grid-centered and for grid-face and corner quantities.
+ and for grid-face and grid-corner quantities.  Three sets of exchange
- Three sets of exchange routines are defined. Routines
+ routines are defined. Routines with names of the form {\em exch\_rx}
- with names of the form {\em exch\_rx} are used to exchange
+ are used to exchange cell centered scalar quantities. Routines with
- cell centered scalar quantities. Routines with names of the form
+ names of the form {\em exch\_uv\_rx} are used to exchange vector
- {\em exch\_uv\_rx} are used to exchange vector quantities located at
+ quantities located at the C-grid velocity points. The vector
- the C-grid velocity points. The vector quantities exchanged by the
+ quantities exchanged by the {\em exch\_uv\_rx} routines can either be
- {\em exch\_uv\_rx} routines can either be signed (for example velocity
+ signed (for example velocity components) or un-signed (for example
- components) or un-signed (for example grid-cell separations).
+ grid-cell separations).  Routines with names of the form {\em
- Routines with names of the form {\em exch\_z\_rx} are used to exchange
+   exch\_z\_rx} are used to exchange quantities at the C-grid vorticity
- quantities at the C-grid vorticity point locations.
+ point locations.
  \section{MITgcm execution under WRAPPER}
+ \begin{rawhtml}
+ <!-- CMIREDIR:mitgcm_wrapper: -->
+ \end{rawhtml}
  Fitting together the WRAPPER elements, package elements and
  MITgcm core equation elements of the source code produces calling
-Line 1414 
 sequence shown in section \ref{sec:calli
+Line 1452 
 sequence shown in section \ref{sec:calli
  WRAPPER layer.
+ {\footnotesize
  \begin{verbatim}
         MAIN
-Line 1441 
 WRAPPER layer.
+Line 1480 
 WRAPPER layer.
         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine
  \end{verbatim}
+ }
  Core equations plus packages.
+ {\footnotesize
  \begin{verbatim}
  C
- C
  C Invocation from WRAPPER level...
  C  :
  C  :
-Line 1510 
 C    | | |-CTRL_INIT           :: Contro
+Line 1550 
 C    | | |-CTRL_INIT           :: Contro
  C    | | |-OPTIM_READPARMS     :: Optimisation support package. see pkg/ctrl
  C    | | |-GRDCHK_READPARMS    :: Gradient check package. see pkg/grdchk
  C    | | |-ECCO_READPARMS      :: ECCO Support Package. see pkg/ecco
+ C    | | |-PTRACERS_READPARMS  :: multiple tracer package, see pkg/ptracers
+ C    | | |-GCHEM_READPARMS     :: tracer interface package, see pkg/gchem
  C    | |
  C    | |-PACKAGES_CHECK
  C    | | |
  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp
- C    | | |-OBCS_CHECK          :: Open bndy Package. pkg/obcs
+ C    | | |-OBCS_CHECK          :: Open bndy Pacakge. pkg/obcs
  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi
  C    | |
  C    | |-PACKAGES_INIT_FIXED
  C    | | |-OBCS_INIT_FIXED     :: Open bndy Package. see pkg/obcs
  C    | | |-FLT_INIT            :: Floats Package. see pkg/flt
+ C    | | |-GCHEM_INIT_FIXED    :: tracer interface pachage, see pkg/gchem
  C    | |
  C    | |-ZONAL_FILT_INIT       :: FFT filter Package. see pkg/zonal_filt
  C    | |
- C    | |-INI_CG2D              :: 2d con. grad solver initialisation.
+ C    | |-INI_CG2D              :: 2d con. grad solver initialization.
  C    | |
- C    | |-INI_CG3D              :: 3d con. grad solver initialisation.
+ C    | |-INI_CG3D              :: 3d con. grad solver initialization.
  C    | |
  C    | |-CONFIG_SUMMARY        :: Provide synopsis of kernel setup.
  C    |                         :: Includes annotated table of kernel
-Line 1550 
 C    | | |
+Line 1593 
 C    | | |
  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,
  C    | | |              :: sphere options are coded.
  C    | | |
- C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.
+ C    | | |-INI_CG2D     :: 2d con. grad solver initialization.
- C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.
+ C    | | |-INI_CG3D     :: 3d con. grad solver initialization.
- C    | | |-INI_MIXING   :: Initialise diapycnal diffusivity.
+ C    | | |-INI_MIXING   :: Initialize diapycnal diffusivity.
- C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical
+ C    | | |-INI_DYNVARS  :: Initialize to zero all DYNVARS.h arrays (dynamical
  C    | | |              :: fields).
  C    | | |
  C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero
-Line 1561 
 C    | | | |-INI_VEL    :: Initialize 3D
+Line 1604 
 C    | | | |-INI_VEL    :: Initialize 3D
  C    | | | |-INI_THETA  :: Set model initial temperature field.
  C    | | | |-INI_SALT   :: Set model initial salinity field.
  C    | | | |-INI_PSURF  :: Set model initial free-surface height/pressure.
- C    | | |
+ C    | | | |-INI_PRESSURE :: Compute model initial hydrostatic pressure
- C    | | |-INI_TR1      :: Set initial tracer 1 distribution.
+ C    | | | |-READ_CHECKPOINT :: Read the checkpoint
  C    | | |
  C    | | |-THE_CORRECTION_STEP :: Step forward to next time step.
  C    | | | |                   :: Here applied to move restart conditions
-Line 1589 
 C    | | | |-FIND_RHO  :: Find adjacent
+Line 1632 
 C    | | | |-FIND_RHO  :: Find adjacent
  C    | | | |-CONVECT   :: Mix static instability.
  C    | | | |-TIMEAVE_CUMULATE :: Update convection statistics.
  C    | | |
- C    | | |-PACKAGES_INIT_VARIABLES :: Does initialisation of time evolving
+ C    | | |-PACKAGES_INIT_VARIABLES :: Does initialization of time evolving
  C    | | | |                       :: package data.
  C    | | | |
  C    | | | |-GMREDI_INIT          :: GM package. ( see pkg/gmredi )
  C    | | | |-KPP_INIT             :: KPP package. ( see pkg/kpp )
  C    | | | |-KPP_OPEN_DIAGS
  C    | | | |-OBCS_INIT_VARIABLES  :: Open bndy. package. ( see pkg/obcs )
+ C    | | | |-PTRACERS_INIT        :: multi. tracer package,(see pkg/ptracers)
+ C    | | | |-GCHEM_INIT           :: tracer interface pkg (see pkh/gchem)
  C    | | | |-AIM_INIT             :: Interm. atmos package. ( see pkg/aim )
  C    | | | |-CTRL_MAP_INI         :: Control vector package.( see pkg/ctrl )
  C    | | | |-COST_INIT            :: Cost function package. ( see pkg/cost )
-Line 1638 
 C/\  | | | |                    :: Simpl
+Line 1683 
 C/\  | | | |                    :: Simpl
  C/\  | | | |                    :: for forcing datasets.
  C/\  | | | |
  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.
+ C/\  | | |-SEAICE_MODEL   :: Compute sea-ice terms. ( pkg/seaice )
+ C/\  | | |-FREEZE         :: Limit surface temperature.
+ C/\  | | |-GCHEM_FIELD_LOAD :: load tracer forcing fields (pkg/gchem)
  C/\  | | |
  C/\  | | |-THERMODYNAMICS :: theta, salt + tracer equations driver.
  C/\  | | | |
  C/\  | | | |-INTEGRATE_FOR_W :: Integrate for vertical velocity.
  C/\  | | | |-OBCS_APPLY_W    :: Open bndy. package ( see pkg/obcs ).
- C/\  | | | |-FIND_RHO        :: Calculates [rho(S,T,z)-Rhonil] of a slice
+ C/\  | | | |-FIND_RHO        :: Calculates [rho(S,T,z)-RhoConst] of a slice
  C/\  | | | |-GRAD_SIGMA      :: Calculate isoneutral gradients
  C/\  | | | |-CALC_IVDC       :: Set Implicit Vertical Diffusivity for Convection
  C/\  | | | |
  C/\  | | | |-OBCS_CALC            :: Open bndy. package ( see pkg/obcs ).
  C/\  | | | |-EXTERNAL_FORCING_SURF:: Accumulates appropriately dimensioned
- C/\  | | | |                      :: forcing terms.
+ C/\  | | | | |                    :: forcing terms.
+ C/\  | | | | |-PTRACERS_FORCING_SURF :: Tracer package ( see pkg/ptracers ).
  C/\  | | | |
  C/\  | | | |-GMREDI_CALC_TENSOR   :: GM package ( see pkg/gmredi ).
  C/\  | | | |-GMREDI_CALC_TENSOR_DUMMY :: GM package ( see pkg/gmredi ).
-Line 1667 
 C/\  | | | |
+Line 1716 
 C/\  | | | |
  C/\  | | | |-CALC_GT              :: Calculate the temperature tendency terms
  C/\  | | | | |
  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
- C/\  | | | | |                    :: ( see pkg/gad )
+ C/\  | | | | | |                  :: ( see pkg/gad )
+ C/\  | | | | | |-KPP_TRANSPORT_T  :: KPP non-local transport ( see pkg/kpp ).
+ C/\  | | | | |
  C/\  | | | | |-EXTERNAL_FORCING_T :: Problem specific forcing for temperature.
  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gt for free-surface height.
-Line 1677 
 C/\  | | | |
+Line 1728 
 C/\  | | | |
  C/\  | | | |-CALC_GS              :: Calculate the salinity tendency terms
  C/\  | | | | |
  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
- C/\  | | | | |                    :: ( see pkg/gad )
+ C/\  | | | | | |                  :: ( see pkg/gad )
+ C/\  | | | | | |-KPP_TRANSPORT_S  :: KPP non-local transport ( see pkg/kpp ).
+ C/\  | | | | |
  C/\  | | | | |-EXTERNAL_FORCING_S :: Problem specific forcing for salt.
  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
  C/\  | | | |
  C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time
  C/\  | | | |
- C/\  | | | |-CALC_GTR1            :: Calculate other tracer(s) tendency terms
+ C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time
+ C/\  | | | |
+ C/\  | | | |-PTRACERS_INTEGRATE   :: Integrate other tracer(s) (see pkg/ptracers).
  C/\  | | | | |
  C/\  | | | | |-GAD_CALC_RHS       :: Generalised advection package
- C/\  | | | | |                    :: ( see pkg/gad )
+ C/\  | | | | | |                  :: ( see pkg/gad )
- C/\  | | | | |-EXTERNAL_FORCING_TR:: Problem specific forcing for tracer.
+ C/\  | | | | | |-KPP_TRANSPORT_PTR:: KPP non-local transport ( see pkg/kpp ).
+ C/\  | | | | |
+ C/\  | | | | |-PTRACERS_FORCING   :: Problem specific forcing for tracer.
+ C/\  | | | | |-GCHEM_FORCING_INT  :: tracer forcing for gchem pkg (if all
+ C/\  | | | | |                       tendancy terms calcualted together)
  C/\  | | | | |-ADAMS_BASHFORTH2   :: Extrapolate tendencies forward in time.
  C/\  | | | | |-FREESURF_RESCALE_G :: Re-scale Gs for free-surface height.
+ C/\  | | | | |-TIMESTEP_TRACER    :: Step tracer field forward in time
  C/\  | | | |
- C/\  | | | |-TIMESTEP_TRACER      :: Step tracer field forward in time
  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).
- C/\  | | | |-FREEZE               :: Limit range of temperature.
  C/\  | | | |
  C/\  | | | |-IMPLDIFF             :: Solve vertical implicit diffusion equation.
  C/\  | | | |-OBCS_APPLY_TS        :: Open bndy. package (see pkg/obcs ).
-Line 1753 
 C/\  | | |
+Line 1811 
 C/\  | | |
  C/\  | | |-DO_FIELDS_BLOCKING_EXCHANGES :: Sync up overlap regions.
  C/\  | | | |-EXCH
  C/\  | | |
+ C/\  | | |-GCHEM_FORCING_SEP :: tracer forcing for gchem pkg (if
+ C/\  | | |                      tracer dependent tendencies calculated
+ C/\  | | |                      separatly)
+ C/\  | | |
  C/\  | | |-FLT_MAIN         :: Float package ( pkg/flt ).
  C/\  | | |
  C/\  | | |-MONITOR          :: Monitor package ( pkg/monitor ).
-Line 1763 
 C/\  | | | |-TIMEAVE_STATV_WRITE :: Time
+Line 1825 
 C/\  | | | |-TIMEAVE_STATV_WRITE :: Time
  C/\  | | | |-AIM_WRITE_DIAGS     :: Intermed. atmos diags. see pkg/aim
  C/\  | | | |-GMREDI_DIAGS        :: GM diags. see pkg/gmredi
  C/\  | | | |-KPP_DO_DIAGS        :: KPP diags. see pkg/kpp
+ C/\  | | | |-SBO_CALC            :: SBO diags. see pkg/sbo
+ C/\  | | | |-SBO_DIAGS           :: SBO diags. see pkg/sbo
+ C/\  | | | |-SEAICE_DO_DIAGS     :: SEAICE diags. see pkg/seaice
+ C/\  | | | |-GCHEM_DIAGS         :: gchem diags. see pkg/gchem
  C/\  | | |
  C/\  | | |-WRITE_CHECKPOINT :: Do I/O for restart files.
  C/\  | |
-Line 1780 
 C    |-TIMER_PRINTALL :: Computational t
+Line 1846 
 C    |-TIMER_PRINTALL :: Computational t
  C    |
  C    |-COMM_STATS     :: Summarise inter-proc and inter-thread communication
  C                     :: events.
  C
  \end{verbatim}
+ }
  \subsection{Measuring and Characterizing Performance}

 Legend:



Removed from v.1.4
 


changed lines


 
Added in v.1.26
 Legend:



Removed from v.1.4
 


changed lines


 
Added in v.1.26
-Removed from v.1.4
+Added in v.1.26

	ViewVC Help
Powered by ViewVC 1.1.22