/[MITgcm]/manual/s_software/text/sarch.tex

Diff of /manual/s_software/text/sarch.tex

Parent Directory | Revision Log | View Revision Graph Revision Graph | View Patch Patch

-revision 1.4 by cnh,
Thu Oct 25 18:36:55 2001 UTC
+revision 1.6 by adcroft,
Tue Nov 13 20:13:55 2001 UTC
 Line 28 
 of
  \begin{enumerate}
  \item A core set of numerical and support code. This is discussed in detail in
- section \ref{sec:partII}.
+ section \ref{sect:partII}.
  \item A scheme for supporting optional "pluggable" {\bf packages} (containing
  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
  These packages are used both to overlay alternate dynamics and to introduce
 Line 74 
 Environment Resource). All numerical and
  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
  the WRAPPER means that coding has to follow certain, relatively
  straightforward, rules and conventions ( these are discussed further in
- section \ref{sec:specifying_a_decomposition} ).
+ section \ref{sect:specifying_a_decomposition} ).
  The approach taken by the WRAPPER is illustrated in figure
  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
 Line 98 
 optimized for that platform.}
  \end{figure}
  \subsection{Target hardware}
- \label{sec:target_hardware}
+ \label{sect:target_hardware}
  The WRAPPER is designed to target as broad as possible a range of computer
  systems. The original development of the WRAPPER took place on a
 Line 118 
 native optimizations for a particular sy
  \subsection{Supporting hardware neutrality}
- The different systems listed in section \ref{sec:target_hardware} can be
+ The different systems listed in section \ref{sect:target_hardware} can be
  categorized in many different ways. For example, one common distinction is
  between shared-memory parallel systems (SMP's, PVP's) and distributed memory
  parallel systems (for example x86 clusters and large MPP systems). This is one
 Line 136 
 particular machine (for example an IBM S
  class of machines (for example Parallel Vector Processor Systems). Instead the
  WRAPPER provides applications with an
  abstract {\it machine model}. The machine model is very general, however, it can
- easily be specialized to fit, in a computationally effificent manner, any
+ easily be specialized to fit, in a computationally efficient manner, any
  computer architecture currently available to the scientific computing community.
  \subsection{Machine model parallelism}
   Codes operating under the WRAPPER target an abstract machine that is assumed to
  consist of one or more logical processors that can compute concurrently.
- Computational work is divided amongst the logical
+ Computational work is divided among the logical
  processors by allocating ``ownership'' to
  each processor of a certain set (or sets) of calculations. Each set of
  calculations owned by a particular processor is associated with a specific
 Line 211 
 computational phases a processor will re
  whenever it requires values that outside the domain it owns. Periodically
  processors will make calls to WRAPPER functions to communicate data between
  tiles, in order to keep the overlap regions up to date (see section
- \ref{sec:communication_primitives}). The WRAPPER functions can use a
+ \ref{sect:communication_primitives}). The WRAPPER functions can use a
  variety of different mechanisms to communicate data between tiles.
  \begin{figure}
 Line 298 
 value to be communicated between CPU's.
  \end{figure}
  \subsection{Shared memory communication}
- \label{sec:shared_memory_communication}
+ \label{sect:shared_memory_communication}
  Under shared communication independent CPU's are operating
  on the exact same global address space at the application level.
 Line 324 
 the systems main-memory interconnect. Th
  communication very efficient provided it is used appropriately.
  \subsubsection{Memory consistency}
- \label{sec:memory_consistency}
+ \label{sect:memory_consistency}
  When using shared memory communication between
  multiple processors the WRAPPER level shields user applications from
 Line 348 
 memory, the WRAPPER provides a place to
  ensure memory consistency for a particular platform.
  \subsubsection{Cache effects and false sharing}
- \label{sec:cache_effects_and_false_sharing}
+ \label{sect:cache_effects_and_false_sharing}
  Shared-memory machines often have local to processor memory caches
  which contain mirrored copies of main memory. Automatic cache-coherence
 Line 367 
 in an application are potentially visibl
  threads operating within a single process is the standard mechanism for
  supporting shared memory that the WRAPPER utilizes. Configuring and launching
  code to run in multi-threaded mode on specific platforms is discussed in
- section \ref{sec:running_with_threads}.  However, on many systems, potentially
+ section \ref{sect:running_with_threads}.  However, on many systems, potentially
  very efficient mechanisms for using shared memory communication between
  multiple processes (in contrast to multiple threads within a single
  process) also exist. In most cases this works by making a limited region of
 Line 380 
 distributed with the default WRAPPER sou
  nature.
  \subsection{Distributed memory communication}
- \label{sec:distributed_memory_communication}
+ \label{sect:distributed_memory_communication}
  Many parallel systems are not constructed in a way where it is
  possible or practical for an application to use shared memory
  for communication. For example cluster systems consist of individual computers
 Line 394 
 described in \ref{hoe-hill:99} substitut
  highly optimized library.
  \subsection{Communication primitives}
- \label{sec:communication_primitives}
+ \label{sect:communication_primitives}
  \begin{figure}
  \begin{center}
 Line 402 
 highly optimized library.
    \includegraphics{part4/comm-primm.eps}
   }
  \end{center}
- \caption{Three performance critical parallel primititives are provided
+ \caption{Three performance critical parallel primitives are provided
- by the WRAPPER. These primititives are always used to communicate data
+ by the WRAPPER. These primitives are always used to communicate data
  between tiles. The figure shows four tiles. The curved arrows indicate
  exchange primitives which transfer data between the overlap regions at tile
  edges and interior regions for nearest-neighbor tiles.
 Line 538 
 WRAPPER are
  computing CPU's.
  \end{enumerate}
  This section describes the details of each of these operations.
- Section \ref{sec:specifying_a_decomposition} explains how the way in which
+ Section \ref{sect:specifying_a_decomposition} explains how the way in which
  a domain is decomposed (or composed) is expressed. Section
- \ref{sec:starting_a_code} describes practical details of running codes
+ \ref{sect:starting_a_code} describes practical details of running codes
  in various different parallel modes on contemporary computer systems.
- Section \ref{sec:controlling_communication} explains the internal information
+ Section \ref{sect:controlling_communication} explains the internal information
  that the WRAPPER uses to control how information is communicated between
  tiles.
  \subsection{Specifying a domain decomposition}
- \label{sec:specifying_a_decomposition}
+ \label{sect:specifying_a_decomposition}
  At its heart much of the WRAPPER works only in terms of a collection of tiles
  which are interconnected to each other. This is also true of application
 Line 599 
 be created within a single process. Each
  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
  allocated to different threads of a process that are then bound to
  different physical processors ( see the multi-threaded
- execution discussion in section \ref{sec:starting_the_code} ) then
+ execution discussion in section \ref{sect:starting_the_code} ) then
  computation will be performed concurrently on each tile. However, it is also
  possible to run the same decomposition within a process running a single thread on
  a single processor. In this case the tiles will be computed over sequentially.
 Line 790 
 There are six tiles allocated to six sep
  This set of values can be used for a cube sphere calculation.
  Each tile of size $32 \times 32$ represents a face of the
  cube. Initializing the tile connectivity correctly ( see section
- \ref{sec:cube_sphere_communication}. allows the rotations associated with
+ \ref{sect:cube_sphere_communication}. allows the rotations associated with
  moving between the six cube faces to be embedded within the
  tile-tile communication code.
  \end{enumerate}
  \subsection{Starting the code}
- \label{sec:starting_the_code}
+ \label{sect:starting_the_code}
  When code is started under the WRAPPER, execution begins in a main routine {\em
  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
  to the application through a routine called {\em THE\_MODEL\_MAIN()}
 Line 842 
 occurs through the procedure {\em THE\_M
  \end{figure}
  \subsubsection{Multi-threaded execution}
- \label{sec:multi-threaded-execution}
+ \label{sect:multi-threaded-execution}
  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
  WRAPPER may cause several coarse grain threads to be initialized. The routine
  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
  stack argument which is the thread number, stored in the
  variable {\em myThid}. In addition to specifying a decomposition with
- multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})
+ multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
  configuring and starting a code to run using multiple threads requires the following
  steps.\\
 Line 930 
 Parameter:  {\em nTy}
  } \\
  \subsubsection{Multi-process execution}
- \label{sec:multi-process-execution}
+ \label{sect:multi-process-execution}
  Despite its appealing programming model, multi-threaded execution remains
  less common then multi-process execution. One major reason for this
 Line 942 
 models varies between systems.
  Multi-process execution is more ubiquitous.
  In order to run code in a multi-process configuration a decomposition
- specification ( see section \ref{sec:specifying_a_decomposition})
+ specification ( see section \ref{sect:specifying_a_decomposition})
  is given ( in which the at least one of the
  parameters {\em nPx} or {\em nPy} will be greater than one)
  and then, as for multi-threaded operation,
 Line 1006 
 using a command such as
  \begin{verbatim}
  mpirun -np 64 -machinefile mf ./mitgcmuv
  \end{verbatim}
- In this example the text {\em -np 64} specifices the number of processes
+ In this example the text {\em -np 64} specifies the number of processes
  that will be created. The numeric value {\em 64} must be equal to the
  product of the processor grid settings of {\em nPx} and {\em nPy}
  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
 Line 1112 
 A value of {\em COMM\_NONE} is used to i
  neighbor to communicate with on a particular face. A value
  of {\em COMM\_MSG} is used to indicated that some form of distributed
  memory communication is required to communicate between
- these tile faces ( see section \ref{sec:distributed_memory_communication}).
+ these tile faces ( see section \ref{sect:distributed_memory_communication}).
  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
  forms of shared memory communication ( see section
- \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates
+ \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
  that a CPU should communicate by writing to data structures owned by another
  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
  from data structures owned by another CPU. These flags affect the behavior
 Line 1166 
 the product of the parameters {\em nTx}
  are read from the file {\em eedata}. If the value of {\em nThreads}
  is inconsistent with the number of threads requested from the
  operating system (for example by using an environment
- variable as described in section \ref{sec:multi_threaded_execution})
+ variable as described in section \ref{sect:multi_threaded_execution})
  then usually an error will be reported by the routine
  {\em CHECK\_THREADS}.\\
 Line 1184 
 Parameter: {\em nTy} \\
  }
  \item {\bf memsync flags}
- As discussed in section \ref{sec:memory_consistency}, when using shared memory,
+ As discussed in section \ref{sect:memory_consistency}, when using shared memory,
  a low-level system function may be need to force memory consistency.
  The routine {\em MEMSYNC()} is used for this purpose. This routine should
  not need modifying and the information below is only provided for
 Line 1210 
 asm("lock; addl $0,0(%%esp)": : :"memory
  \end{verbatim}
  \item {\bf Cache line size}
- As discussed in section \ref{sec:cache_effects_and_false_sharing},
+ As discussed in section \ref{sect:cache_effects_and_false_sharing},
  milti-threaded codes explicitly avoid penalties associated with excessive
- coherence traffic on an SMP system. To do this the sgared memory data structures
+ coherence traffic on an SMP system. To do this the shared memory data structures
  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
  are padded. The variables that control the padding are set in the
  header file {\em EEPARAMS.h}. These variables are called
 Line 1220 
 header file {\em EEPARAMS.h}. These vari
  {\em lShare8}. The default values should not normally need changing.
  \item {\bf \_BARRIER}
  This is a CPP macro that is expanded to a call to a routine
- which synchronises all the logical processors running under the
+ which synchronizes all the logical processors running under the
  WRAPPER. Using a macro here preserves flexibility to insert
  a specialized call in-line into application code. By default this
  resolves to calling the procedure {\em BARRIER()}. The default
 Line 1228 
 setting for the \_BARRIER macro is given
  \item {\bf \_GSUM}
  This is a CPP macro that is expanded to a call to a routine
- which sums up a floating point numner
+ which sums up a floating point number
  over all the logical processors running under the
  WRAPPER. Using a macro here provides extra flexibility to insert
  a specialized call in-line into application code. By default this
- resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for
+ resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
-=bit floating point operands)
+-bit floating point operands)
- or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default
+ or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
  The \_GSUM macro is a performance critical operation, especially for
  large processor count, small tile size configurations.
- The custom communication example discussed in section \ref{sec:jam_example}
+ The custom communication example discussed in section \ref{sect:jam_example}
  shows how the macro is used to invoke a custom global sum routine
  for a specific set of hardware.
 Line 1252 
 physical fields and whether fields are 3
  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
  \_EXCH operation plays a crucial role in scaling to small tile,
  large logical and physical processor count configurations.
- The example in section \ref{sec:jam_example} discusses defining an
+ The example in section \ref{sect:jam_example} discusses defining an
- optimised and specialized form on the \_EXCH operation.
+ optimized and specialized form on the \_EXCH operation.
  The \_EXCH operation is also central to supporting grids such as
  the cube-sphere grid. In this class of grid a rotation may be required
  between tiles. Aligning the coordinate requiring rotation with the
- tile decomposistion, allows the coordinate transformation to
+ tile decomposition, allows the coordinate transformation to
  be embedded within a custom form of the \_EXCH primitive.
  \item {\bf Reverse Mode}
  The communication primitives \_EXCH and \_GSUM both employ
  hand-written adjoint forms (or reverse mode) forms.
  These reverse mode forms can be found in the
- sourc code directory {\em pkg/autodiff}.
+ source code directory {\em pkg/autodiff}.
  For the global sum primitive the reverse mode form
  calls are to {\em GLOBAL\_ADSUM\_R4} and
  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
- exchamge primitives are found in routines
+ exchange primitives are found in routines
  prefixed {\em ADEXCH}. The exchange routines make calls to
  the same low-level communication primitives as the forward mode
  operations. However, the routine argument {\em simulationMode}
 Line 1281 
 The variable {\em MAX\_NO\_THREADS} is u
  maximum number of OS threads that a code will use. This
  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
  For single threaded execution it can be reduced to one if required.
- The va;lue is largely private to the WRAPPER and application code
+ The value; is largely private to the WRAPPER and application code
  will nor normally reference the value, except in the following scenario.
  For certain physical parametrization schemes it is necessary to have
 Line 1292 
 This can be achieved using a Fortran 90
  if this might be unavailable then the work arrays can be extended
  with dimensions use the tile dimensioning scheme of {\em nSx}
  and {\em nSy} ( as described in section
- \ref{sec:specifying_a_decomposition}). However, if the configuration
+ \ref{sect:specifying_a_decomposition}). However, if the configuration
  being specified involves many more tiles than OS threads then
  it can save memory resources to reduce the variable
  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
- will be used and to declare the physical parameterisation
+ will be used and to declare the physical parameterization
- work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.
+ work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
  An example of this is given in the verification experiment
  {\em aim.5l\_cs}. Here the default setting of
  {\em MAX\_NO\_THREADS} is altered to
 Line 1310 
 created with declarations of the form.
  \begin{verbatim}
        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
  \end{verbatim}
- This declaration scheme is not used widely, becuase most global data
+ This declaration scheme is not used widely, because most global data
  is used for permanent not temporary storage of state information.
  In the case of permanent state information this approach cannot be used
  because there has to be enough storage allocated for all tiles.
  However, the technique can sometimes be a useful scheme for reducing memory
- requirements in complex physical paramterisations.
+ requirements in complex physical parameterizations.
  \end{enumerate}
  \begin{figure}
 Line 1348 
 MP directives to spawn multiple threads.
  The isolation of performance critical communication primitives and the
  sub-division of the simulation domain into tiles is a powerful tool.
  Here we show how it can be used to improve application performance and
- how it can be used to adapt to new gridding approaches.
+ how it can be used to adapt to new griding approaches.
  \subsubsection{JAM example}
- \label{sec:jam_example}
+ \label{sect:jam_example}
  On some platforms a big performance boost can be obtained by
  binding the communication routines {\em \_EXCH} and
  {\em \_GSUM} to specialized native libraries ) fro example the
 Line 1374 
 Developing specialized code for other li
  pattern.
  \subsubsection{Cube sphere communication}
- \label{sec:cube_sphere_communication}
+ \label{sect:cube_sphere_communication}
  Actual {\em \_EXCH} routine code is generated automatically from
  a series of template files, for example {\em exch\_rx.template}.
  This is done to allow a large number of variations on the exchange
 Line 1407 
 quantities at the C-grid vorticity point
  Fitting together the WRAPPER elements, package elements and
  MITgcm core equation elements of the source code produces calling
- sequence shown in section \ref{sec:calling_sequence}
+ sequence shown in section \ref{sect:calling_sequence}
  \subsection{Annotated call tree for MITgcm and WRAPPER}
- \label{sec:calling_sequence}
+ \label{sect:calling_sequence}
  WRAPPER layer.

 Legend:



Removed from v.1.4
 


changed lines


 
Added in v.1.6
 Legend:



Removed from v.1.4
 


changed lines


 
Added in v.1.6
-Removed from v.1.4
+Added in v.1.6

	ViewVC Help
Powered by ViewVC 1.1.22