/[MITgcm]/manual/s_software/text/sarch.tex

Diff of /manual/s_software/text/sarch.tex

Parent Directory | Revision Log | View Revision Graph Revision Graph | View Patch Patch

-revision 1.1 by cnh,
Tue Oct  9 10:33:17 2001 UTC
+revision 1.18 by afe,
Tue Mar 23 16:47:05 2004 UTC
 Line 1
+ % $Header$
- In this chapter we describe the software architecture and
+ This chapter focuses on describing the {\bf WRAPPER} environment within which
- implementation strategy for the MITgcm code. The first part of this
+ both the core numerics and the pluggable packages operate. The description
- chapter discusses the MITgcm architecture at an abstract level. In the second
+ presented here is intended to be a detailed exposition and contains significant
- part of the chapter we described practical details of the MITgcm implementation
+ background material, as well as advanced details on working with the WRAPPER.
- and of current tools and operating system features that are employed.
+ The tutorial sections of this manual (see sections
+ \ref{sect:tutorials}  and \ref{sect:tutorialIII})
+ contain more succinct, step-by-step instructions on running basic numerical
+ experiments, of varous types, both sequentially and in parallel. For many
+ projects simply starting from an example code and adapting it to suit a
+ particular situation
+ will be all that is required.
+ The first part of this chapter discusses the MITgcm architecture at an
+ abstract level. In the second part of the chapter we described practical
+ details of the MITgcm implementation and of current tools and operating system
+ features that are employed.
  \section{Overall architectural goals}
-Line 11 
 Broadly, the goals of the software archi
+Line 22 
 Broadly, the goals of the software archi
  three-fold
  \begin{itemize}
  \item We wish to be able to study a very broad range
  of interesting and challenging rotating fluids problems.
  \item We wish the model code to be readily targeted to
  a wide range of platforms
  \item On any given platform we would like to be
  able to achieve performance comparable to an implementation
  developed and specialized specifically for that platform.
  \end{itemize}
  These points are summarized in figure \ref{fig:mitgcm_architecture_goals}
-Line 30 
 a software architecture which at the hig
+Line 37 
 a software architecture which at the hig
  of
  \begin{enumerate}
  \item A core set of numerical and support code. This is discussed in detail in
- section \ref{sec:partII}.
+ section \ref{sect:partII}.
  \item A scheme for supporting optional "pluggable" {\bf packages} (containing
  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
  These packages are used both to overlay alternate dynamics and to introduce
  specialized physical content onto the core numerical code. An overview of
  the {\bf package} scheme is given at the start of part \ref{part:packages}.
  \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel
  Programming Environment Resource), within which the core numerics and pluggable
  packages operate.
  \end{enumerate}
  This chapter focuses on describing the {\bf WRAPPER} environment under which
  both the core numerics and the pluggable packages function. The description
- presented here is intended to be a detailed exposistion and contains significant
+ presented here is intended to be a detailed exposition and contains significant
  background material, as well as advanced details on working with the WRAPPER.
  The examples section of this manual (part \ref{part:example}) contains more
  succinct, step-by-step instructions on running basic numerical
-Line 57 
 experiments both sequentially and in par
+Line 59 
 experiments both sequentially and in par
  starting from an example code and adapting it to suit a particular situation
  will be all that is required.
  \begin{figure}
  \begin{center}
-  \resizebox{!}{2.5in}{
+ \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
-   \includegraphics*[1.5in,2.4in][9.5in,6.3in]{part4/mitgcm_goals.eps}
-  }
  \end{center}
- \caption{The MITgcm architecture is designed to allow simulation of a wide
+ \caption{
+ The MITgcm architecture is designed to allow simulation of a wide
  range of physical problems on a wide range of hardware. The computational
  resource requirements of the applications targeted range from around
  $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
  ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of
  interest range from $10^{9}$ floating point operations to more than $10^{17}$
- floating point operations.} \label{fig:mitgcm_architecture_goals}
+ floating point operations.}
+ \label{fig:mitgcm_architecture_goals}
  \end{figure}
  \section{WRAPPER}
+ \begin{rawhtml}
+ <!-- CMIREDIR:wrapper: -->
+ \end{rawhtml}
  A significant element of the software architecture utilized in
  MITgcm is a software superstructure and substructure collectively
-Line 81 
 Environment Resource). All numerical and
+Line 87 
 Environment Resource). All numerical and
  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
  the WRAPPER means that coding has to follow certain, relatively
  straightforward, rules and conventions ( these are discussed further in
- section \ref{sec:specifying_a_decomposition} ).
+ section \ref{sect:specifying_a_decomposition} ).
  The approach taken by the WRAPPER is illustrated in figure
  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
  that fits within it from architectural differences between hardware platforms
  and operating systems. This allows numerical code to be easily retargetted.
  \begin{figure}
  \begin{center}
-  \resizebox{6in}{4.5in}{
+ \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
-   \includegraphics*[0.6in,0.7in][9.0in,8.5in]{part4/fit_in_wrapper.eps}
-  }
  \end{center}
- \caption{ Numerical code is written too fit within a software support
+ \caption{
+ Numerical code is written to fit within a software support
  infrastructure called WRAPPER. The WRAPPER is portable and
- can be sepcialized for a wide range of specific target hardware and
+ can be specialized for a wide range of specific target hardware and
  programming environments, without impacting numerical code that fits
  within the WRAPPER. Codes that fit within the WRAPPER can generally be
  made to run as fast on a particular platform as codes specially
- optimized for that platform.
+ optimized for that platform.}
- } \label{fig:fit_in_wrapper}
+ \label{fig:fit_in_wrapper}
  \end{figure}
  \subsection{Target hardware}
- \label{sec:target_hardware}
+ \label{sect:target_hardware}
  The WRAPPER is designed to target as broad as possible a range of computer
  systems. The original development of the WRAPPER took place on a
-Line 116 
 uniprocessor and multi-processor Sun sys
+Line 123 
 uniprocessor and multi-processor Sun sys
  (UMA) and non-uniform memory access (NUMA) designs. Significant work has also
  been undertaken on x86 cluster systems, Alpha processor based clustered SMP
  systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics.
- The MITgcm code, operating within the WRAPPER, is also used routinely used on
+ The MITgcm code, operating within the WRAPPER, is also routinely used on
  large scale MPP systems (for example T3E systems and IBM SP systems). In all
  cases numerical code, operating within the WRAPPER, performs and scales very
  competitively with equivalent numerical code that has been modified to contain
-Line 124 
 native optimizations for a particular sy
+Line 131 
 native optimizations for a particular sy
  \subsection{Supporting hardware neutrality}
- The different systems listed in section \ref{sec:target_hardware} can be
+ The different systems listed in section \ref{sect:target_hardware} can be
  categorized in many different ways. For example, one common distinction is
  between shared-memory parallel systems (SMP's, PVP's) and distributed memory
  parallel systems (for example x86 clusters and large MPP systems). This is one
-Line 142 
 particular machine (for example an IBM S
+Line 149 
 particular machine (for example an IBM S
  class of machines (for example Parallel Vector Processor Systems). Instead the
  WRAPPER provides applications with an
  abstract {\it machine model}. The machine model is very general, however, it can
- easily be specialized to fit, in a computationally effificent manner, any
+ easily be specialized to fit, in a computationally efficient manner, any
  computer architecture currently available to the scientific computing community.
  \subsection{Machine model parallelism}
+ \begin{rawhtml}
+ <!-- CMIREDIR:domain_decomp: -->
+ \end{rawhtml}
   Codes operating under the WRAPPER target an abstract machine that is assumed to
  consist of one or more logical processors that can compute concurrently.
- Computational work is divided amongst the logical
+ Computational work is divided among the logical
  processors by allocating ``ownership'' to
  each processor of a certain set (or sets) of calculations. Each set of
  calculations owned by a particular processor is associated with a specific
-Line 172 
 Computationally, associated with each re
+Line 182 
 Computationally, associated with each re
  space allocated to a particular logical processor, there will be data
  structures (arrays, scalar variables etc...) that hold the simulated state of
  that region. We refer to these data structures as being {\bf owned} by the
- pprocessor to which their
+ processor to which their
  associated region of physical space has been allocated. Individual
  regions that are allocated to processors are called {\bf tiles}. A
  processor can own more
-Line 186 
 independently of the other tiles, in a s
+Line 196 
 independently of the other tiles, in a s
  \begin{figure}
  \begin{center}
-  \resizebox{7in}{3in}{
+  \resizebox{5in}{!}{
-   \includegraphics*[0.5in,2.7in][12.5in,6.4in]{part4/domain_decomp.eps}
+   \includegraphics{part4/domain_decomp.eps}
   }
  \end{center}
  \caption{ The WRAPPER provides support for one and two dimensional
-Line 217 
 computational phases a processor will re
+Line 227 
 computational phases a processor will re
  whenever it requires values that outside the domain it owns. Periodically
  processors will make calls to WRAPPER functions to communicate data between
  tiles, in order to keep the overlap regions up to date (see section
- \ref{sec:communication_primitives}). The WRAPPER functions can use a
+ \ref{sect:communication_primitives}). The WRAPPER functions can use a
  variety of different mechanisms to communicate data between tiles.
  \begin{figure}
  \begin{center}
-  \resizebox{7in}{3in}{
+  \resizebox{5in}{!}{
-   \includegraphics*[4.5in,3.7in][12.5in,6.7in]{part4/tiled-world.eps}
+   \includegraphics{part4/tiled-world.eps}
   }
  \end{center}
  \caption{ A global grid subdivided into tiles.
-Line 304 
 value to be communicated between CPU's.
+Line 314 
 value to be communicated between CPU's.
  \end{figure}
  \subsection{Shared memory communication}
- \label{sec:shared_memory_communication}
+ \label{sect:shared_memory_communication}
  Under shared communication independent CPU's are operating
  on the exact same global address space at the application level.
-Line 330 
 the systems main-memory interconnect. Th
+Line 340 
 the systems main-memory interconnect. Th
  communication very efficient provided it is used appropriately.
  \subsubsection{Memory consistency}
- \label{sec:memory_consistency}
+ \label{sect:memory_consistency}
  When using shared memory communication between
  multiple processors the WRAPPER level shields user applications from
-Line 354 
 memory, the WRAPPER provides a place to
+Line 364 
 memory, the WRAPPER provides a place to
  ensure memory consistency for a particular platform.
  \subsubsection{Cache effects and false sharing}
- \label{sec:cache_effects_and_false_sharing}
+ \label{sect:cache_effects_and_false_sharing}
  Shared-memory machines often have local to processor memory caches
  which contain mirrored copies of main memory. Automatic cache-coherence
-Line 373 
 in an application are potentially visibl
+Line 383 
 in an application are potentially visibl
  threads operating within a single process is the standard mechanism for
  supporting shared memory that the WRAPPER utilizes. Configuring and launching
  code to run in multi-threaded mode on specific platforms is discussed in
- section \ref{sec:running_with_threads}.  However, on many systems, potentially
+ section \ref{sect:running_with_threads}.  However, on many systems, potentially
  very efficient mechanisms for using shared memory communication between
  multiple processes (in contrast to multiple threads within a single
  process) also exist. In most cases this works by making a limited region of
-Line 386 
 distributed with the default WRAPPER sou
+Line 396 
 distributed with the default WRAPPER sou
  nature.
  \subsection{Distributed memory communication}
- \label{sec:distributed_memory_communication}
+ \label{sect:distributed_memory_communication}
  Many parallel systems are not constructed in a way where it is
  possible or practical for an application to use shared memory
  for communication. For example cluster systems consist of individual computers
-Line 400 
 described in \ref{hoe-hill:99} substitut
+Line 410 
 described in \ref{hoe-hill:99} substitut
  highly optimized library.
  \subsection{Communication primitives}
- \label{sec:communication_primitives}
+ \label{sect:communication_primitives}
  \begin{figure}
  \begin{center}
-  \resizebox{5in}{3in}{
+  \resizebox{5in}{!}{
-   \includegraphics*[1.5in,0.7in][7.9in,4.4in]{part4/comm-primm.eps}
+   \includegraphics{part4/comm-primm.eps}
   }
  \end{center}
- \caption{Three performance critical parallel primititives are provided
+ \caption{Three performance critical parallel primitives are provided
- by the WRAPPER. These primititives are always used to communicate data
+ by the WRAPPER. These primitives are always used to communicate data
  between tiles. The figure shows four tiles. The curved arrows indicate
  exchange primitives which transfer data between the overlap regions at tile
  edges and interior regions for nearest-neighbor tiles.
-Line 485 
 sub-domains.
+Line 495 
 sub-domains.
  \begin{figure}
  \begin{center}
-  \resizebox{5in}{3in}{
+  \resizebox{5in}{!}{
-   \includegraphics*[0.5in,1.3in][7.9in,5.7in]{part4/tiling_detail.eps}
+   \includegraphics{part4/tiling_detail.eps}
   }
  \end{center}
  \caption{The tiling strategy that the WRAPPER supports allows tiles
-Line 544 
 WRAPPER are
+Line 554 
 WRAPPER are
  computing CPU's.
  \end{enumerate}
  This section describes the details of each of these operations.
- Section \ref{sec:specifying_a_decomposition} explains how the way in which
+ Section \ref{sect:specifying_a_decomposition} explains how the way in which
  a domain is decomposed (or composed) is expressed. Section
- \ref{sec:starting_a_code} describes practical details of running codes
+ \ref{sect:starting_a_code} describes practical details of running codes
  in various different parallel modes on contemporary computer systems.
- Section \ref{sec:controlling_communication} explains the internal information
+ Section \ref{sect:controlling_communication} explains the internal information
  that the WRAPPER uses to control how information is communicated between
  tiles.
  \subsection{Specifying a domain decomposition}
- \label{sec:specifying_a_decomposition}
+ \label{sect:specifying_a_decomposition}
  At its heart much of the WRAPPER works only in terms of a collection of tiles
  which are interconnected to each other. This is also true of application
-Line 589 
 not cause any other problems.
+Line 599 
 not cause any other problems.
  \begin{figure}
  \begin{center}
-  \resizebox{5in}{7in}{
+  \resizebox{5in}{!}{
-   \includegraphics*[0.5in,0.3in][7.9in,10.7in]{part4/size_h.eps}
+   \includegraphics{part4/size_h.eps}
   }
  \end{center}
  \caption{ The three level domain decomposition hierarchy employed by the
-Line 605 
 be created within a single process. Each
+Line 615 
 be created within a single process. Each
  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
  allocated to different threads of a process that are then bound to
  different physical processors ( see the multi-threaded
- execution discussion in section \ref{sec:starting_the_code} ) then
+ execution discussion in section \ref{sect:starting_the_code} ) then
  computation will be performed concurrently on each tile. However, it is also
  possible to run the same decomposition within a process running a single thread on
  a single processor. In this case the tiles will be computed over sequentially.
-Line 657 
 Within a {\em bi}, {\em bj} loop
+Line 667 
 Within a {\em bi}, {\em bj} loop
  computation is performed concurrently over as many processes and threads
  as there are physical processors available to compute.
+ An exception to the the use of {\em bi} and {\em bj} in loops arises in the
+ exchange routines used when the exch2 package is used with the cubed
+ sphere.  In this case {\em bj} is generally set to 1 and the loop runs from
+,{\em bi}.  Within the loop {\em bi} is used to retrieve the tile number,
+ which is then used to reference exchange parameters.
  The amount of computation that can be embedded
  a single loop over {\em bi} and {\em bj} varies for different parts of the
  MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract
-Line 777 
 The global domain size is again ninety g
+Line 793 
 The global domain size is again ninety g
  forty grid points in y. The two sub-domains in each process will be computed
  sequentially if they are given to a single thread within a single process.
  Alternatively if the code is invoked with multiple threads per process
- the two domains in y may be computed on concurrently.
+ the two domains in y may be computed concurrently.
  \item
  \begin{verbatim}
        PARAMETER (
-Line 795 
 thirty-two grid points, and x and y over
+Line 811 
 thirty-two grid points, and x and y over
  There are six tiles allocated to six separate logical processors ({\em nSx=6}).
  This set of values can be used for a cube sphere calculation.
  Each tile of size $32 \times 32$ represents a face of the
- cube. Initialising the tile connectivity correctly ( see section
+ cube. Initializing the tile connectivity correctly ( see section
- \ref{sec:cube_sphere_communication}. allows the rotations associated with
+ \ref{sect:cube_sphere_communication}. allows the rotations associated with
  moving between the six cube faces to be embedded within the
  tile-tile communication code.
  \end{enumerate}
  \subsection{Starting the code}
- \label{sec:starting_the_code}
+ \label{sect:starting_the_code}
  When code is started under the WRAPPER, execution begins in a main routine {\em
  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
  to the application through a routine called {\em THE\_MODEL\_MAIN()}
-Line 812 
 to support subsequent calls to communica
+Line 828 
 to support subsequent calls to communica
  by the application code. The startup calling sequence followed by the
  WRAPPER is shown in figure \ref{fig:wrapper_startup}.
  \begin{figure}
+ {\footnotesize
  \begin{verbatim}
         MAIN
-Line 842 
 WRAPPER is shown in figure \ref{fig:wrap
+Line 858 
 WRAPPER is shown in figure \ref{fig:wrap
  \end{verbatim}
+ }
  \caption{Main stages of the WRAPPER startup procedure.
  This process proceeds transfer of control to application code, which
  occurs through the procedure {\em THE\_MODEL\_MAIN()}.
-Line 849 
 occurs through the procedure {\em THE\_M
+Line 866 
 occurs through the procedure {\em THE\_M
  \end{figure}
  \subsubsection{Multi-threaded execution}
+ \label{sect:multi-threaded-execution}
  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
  WRAPPER may cause several coarse grain threads to be initialized. The routine
  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
  stack argument which is the thread number, stored in the
  variable {\em myThid}. In addition to specifying a decomposition with
- multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})
+ multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
  configuring and starting a code to run using multiple threads requires the following
  steps.\\
-Line 904 
 parallelization the compiler may otherwi
+Line 922 
 parallelization the compiler may otherwi
  \end{enumerate}
- \paragraph{Environment variables}
- On most systems multi-threaded execution also requires the setting
- of a special environment variable. On many machines this variable
- is called PARALLEL and its values should be set to the number
- of parallel threads required. Generally the help pages associated
- with the multi-threaded compiler on a machine will explain
- how to set the required environment variables for that machines.
- \paragraph{Runtime input parameters}
- Finally the file {\em eedata} needs to be configured to indicate
- the number of threads to be used in the x and y directions.
- The variables {\em nTx} and {\em nTy} in this file are used to
- specify the information required. The product of {\em nTx} and
- {\em nTy} must be equal to the number of threads spawned i.e.
- the setting of the environment variable PARALLEL.
- The value of {\em nTx} must subdivide the number of sub-domains
- in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
- number of sub-domains in y ({\em nSy}) exactly.
  An example of valid settings for the {\em eedata} file for a
  domain with two subdomains in y and running with two threads is shown
  below
-Line 942 
 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
+Line 941 
 File: {\em eesupp/inc/MAIN\_PDIRECTIVES1
  File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\
  File: {\em model/src/THE\_MODEL\_MAIN.F}\\
  File: {\em eesupp/src/MAIN.F}\\
- File: {\em tools/genmake}\\
+ File: {\em tools/genmake2}\\
  File: {\em eedata}\\
  CPP:  {\em TARGET\_SUN}\\
  CPP:  {\em TARGET\_DEC}\\
-Line 955 
 Parameter:  {\em nTy}
+Line 954 
 Parameter:  {\em nTy}
  } \\
  \subsubsection{Multi-process execution}
+ \label{sect:multi-process-execution}
  Despite its appealing programming model, multi-threaded execution remains
  less common then multi-process execution. One major reason for this
 Line 966 
 models varies between systems.
  Multi-process execution is more ubiquitous.
  In order to run code in a multi-process configuration a decomposition
- specification is given ( in which the at least one of the
+ specification ( see section \ref{sect:specifying_a_decomposition})
+ is given ( in which the at least one of the
  parameters {\em nPx} or {\em nPy} will be greater than one)
  and then, as for multi-threaded operation,
  appropriate compile time and run time steps must be taken.
-Line 979 
 critical communication. However, in orde
+Line 980 
 critical communication. However, in orde
  of controlling and coordinating the start up of a large number
  (hundreds and possibly even thousands) of copies of the same
  program, MPI is used. The calls to the MPI multi-process startup
- routines must be activated at compile time. This is done
+ routines must be activated at compile time.  Currently MPI libraries are
- by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI}
+ invoked by
- flags in the {\em CPP\_EEOPTIONS.h} file.\\
+ specifying the appropriate options file with the
+ {\tt-of} flag when running the {\em genmake2}
+ script, which generates the Makefile for compiling and linking MITgcm.
+ (Previously this was done by setting the {\em ALLOW\_USE\_MPI} and
+ {\em ALWAYS\_USE\_MPI} flags in the {\em CPP\_EEOPTIONS.h} file.)  More
+ detailed information about the use of {\em genmake2} for specifying
+ local compiler flags is located in section \ref{sect:genmake}.\\
- \fbox{
- \begin{minipage}{4.75in}
- File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\
- CPP:  {\em ALLOW\_USE\_MPI}\\
- CPP:  {\em ALWAYS\_USE\_MPI}\\
- Parameter:  {\em nPx}\\
- Parameter:  {\em nPy}
- \end{minipage}
- } \\
- Additionally, compile time options are required to link in the
- MPI libraries and header files. Examples of these options
- can be found in the {\em genmake} script that creates makefiles
- for compilation. When this script is executed with the {bf -mpi}
- flag it will generate a makefile that includes
- paths for search for MPI head files and for linking in
- MPI libraries. For example the {\bf -mpi} flag on a
-  Silicon Graphics IRIX system causes a
- Makefile with the compilation command
- Graphics IRIX system \begin{verbatim}
- mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI
- \end{verbatim}
- to be generated.
- This is the correct set of options for using the MPICH open-source
- version of MPI, when it has been installed under the subdirectory
- /usr/local/mpi.
- However, on many systems there may be several
- versions of MPI installed. For example many systems have both
- the open source MPICH set of libraries and a vendor specific native form
- of the MPI libraries. The correct setup to use will depend on the
- local configuration of your system.\\
  \fbox{
  \begin{minipage}{4.75in}
- File: {\em tools/genmake}
+ Directory: {\em tools/build\_options}\\
+ File: {\em tools/genmake2}
  \end{minipage}
  } \\
  \paragraph{\bf Execution} The mechanics of starting a program in
-Line 1029 
 using a command such as
+Line 1006 
 using a command such as
  \begin{verbatim}
  mpirun -np 64 -machinefile mf ./mitgcmuv
  \end{verbatim}
- In this example the text {\em -np 64} specifices the number of processes
+ In this example the text {\em -np 64} specifies the number of processes
  that will be created. The numeric value {\em 64} must be equal to the
  product of the processor grid settings of {\em nPx} and {\em nPy}
  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
  called ``mf'' will be read to get a list of processor names on
  which the sixty-four processes will execute. The syntax of this file
- is specified by the MPI distribution
+ is specified by the MPI distribution.
  \\
  \fbox{
-Line 1046 
 Parameter: {\em nPy}
+Line 1023 
 Parameter: {\em nPy}
  \end{minipage}
  } \\
+ \paragraph{Environment variables}
+ On most systems multi-threaded execution also requires the setting
+ of a special environment variable. On many machines this variable
+ is called PARALLEL and its values should be set to the number
+ of parallel threads required. Generally the help pages associated
+ with the multi-threaded compiler on a machine will explain
+ how to set the required environment variables for that machines.
+ \paragraph{Runtime input parameters}
+ Finally the file {\em eedata} needs to be configured to indicate
+ the number of threads to be used in the x and y directions.
+ The variables {\em nTx} and {\em nTy} in this file are used to
+ specify the information required. The product of {\em nTx} and
+ {\em nTy} must be equal to the number of threads spawned i.e.
+ the setting of the environment variable PARALLEL.
+ The value of {\em nTx} must subdivide the number of sub-domains
+ in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the
+ number of sub-domains in y ({\em nSy}) exactly.
  The multiprocess startup of the MITgcm executable {\em mitgcmuv}
  is controlled by the routines {\em EEBOOT\_MINIMAL()} and
  {\em INI\_PROCS()}. The first routine performs basic steps required
-Line 1058 
 number so that process number 0 will cre
+Line 1054 
 number so that process number 0 will cre
  output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
  are used for reporting status and configuration information and
  for reporting error conditions on a process by process basis.
- The {{\em EEBOOT\_MINIMAL()} procedure also sets the variables
+ The {\em EEBOOT\_MINIMAL()} procedure also sets the variables
  {\em myProcId} and {\em MPI\_COMM\_MODEL}.
  These variables are related
  to processor identification are are used later in the routine
-Line 1067 
 to processor identification are are used
+Line 1063 
 to processor identification are are used
  Allocation of processes to tiles in controlled by the routine
  {\em INI\_PROCS()}. For each process this routine sets
  the variables {\em myXGlobalLo} and {\em myYGlobalLo}.
- These variables specify (in index space) the coordinate
+ These variables specify in index space the coordinates
- of the southern most and western most corner of the
+ of the southernmost and westernmost corner of the
- southern most and western most tile owned by this process.
+ southernmost and westernmost tile owned by this process.
  The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN}
  are also set in this routine. These are used to identify
  processes holding tiles to the west, east, south and north
  of this process. These values are stored in global storage
  in the header file {\em EESUPPORT.h} for use by
- communication routines.
+ communication routines.  The above does not hold when the
+ exch2 package is used -- exch2 sets its own parameters to
+ specify the global indices of tiles and their relationships
+ to each other.  See the documentation on the exch2 package
+ (\ref{sec:exch2})  for
+ details.
  \\
  \fbox{
-Line 1099 
 Parameter: {\em pidN       }
+Line 1100 
 Parameter: {\em pidN       }
  The WRAPPER maintains internal information that is used for communication
  operations and that can be customized for different platforms. This section
  describes the information that is held and used.
  \begin{enumerate}
- \item {\bf Tile-tile connectivity information} For each tile the WRAPPER
+ \item {\bf Tile-tile connectivity information}
- sets a flag that sets the tile number to the north, south, east and
+ For each tile the WRAPPER
+ sets a flag that sets the tile number to the north,
+ south, east and
  west of that tile. This number is unique over all tiles in a
- configuration. The number is held in the variables {\em tileNo}
+ configuration. Except when using the cubed sphere and the exch2 package,
+ the number is held in the variables {\em tileNo}
  ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS},
  {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile
  that specifies the type of communication that is used between tiles.
-Line 1112 
 This information is held in the variable
+Line 1117 
 This information is held in the variable
  This latter set of variables can take one of the following values
  {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
  A value of {\em COMM\_NONE} is used to indicate that a tile has no
- neighbor to cummnicate with on a particular face. A value
+ neighbor to communicate with on a particular face. A value
  of {\em COMM\_MSG} is used to indicated that some form of distributed
  memory communication is required to communicate between
- these tile faces ( see section \ref{sec:distributed_memory_communication}).
+ these tile faces ( see section \ref{sect:distributed_memory_communication}).
  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
  forms of shared memory communication ( see section
- \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates
+ \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
  that a CPU should communicate by writing to data structures owned by another
  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
  from data structures owned by another CPU. These flags affect the behavior
-Line 1126 
 of the WRAPPER exchange primitive
+Line 1131 
 of the WRAPPER exchange primitive
  (see figure \ref{fig:communication_primitives}). The routine
  {\em ini\_communication\_patterns()} is responsible for setting the
  communication mode values for each tile.
- \\
+ When using the cubed sphere configuration with the exch2 package, the
+ relationships between tiles and their communication methods are set
+ by the package in other variables.  See the exch2 package documentation
+ (\ref{sec:exch2} for details.
  \fbox{
  \begin{minipage}{4.75in}
-Line 1169 
 the product of the parameters {\em nTx}
+Line 1180 
 the product of the parameters {\em nTx}
  are read from the file {\em eedata}. If the value of {\em nThreads}
  is inconsistent with the number of threads requested from the
  operating system (for example by using an environment
- varialble as described in section \ref{sec:multi_threaded_execution})
+ variable as described in section \ref{sect:multi_threaded_execution})
  then usually an error will be reported by the routine
  {\em CHECK\_THREADS}.\\
-Line 1186 
 Parameter: {\em nTy} \\
+Line 1197 
 Parameter: {\em nTy} \\
  \end{minipage}
  }
- \begin{figure}
- \begin{verbatim}
- C--
- C--  Parallel directives for MIPS Pro Fortran compiler
- C--
- C      Parallel compiler directives for SGI with IRIX
- C$PAR  PARALLEL DO
- C$PAR&  CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
- C$PAR&  SHARE(nThreads),LOCAL(myThid,I)
- C
-       DO I=1,nThreads
-         myThid = I
- C--     Invoke nThreads instances of the numerical model
-         CALL THE_MODEL_MAIN(myThid)
-       ENDDO
- \end{verbatim}
- \caption{Prior to transferring control to
- the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
- MP directives to spawn multiple threads.
- } \label{fig:mp_directives}
- \end{figure}
  \item {\bf memsync flags}
- As discussed in section \ref{sec:memory_consistency}, when using shared memory,
+ As discussed in section \ref{sect:memory_consistency}, when using shared memory,
  a low-level system function may be need to force memory consistency.
  The routine {\em MEMSYNC()} is used for this purpose. This routine should
  not need modifying and the information below is only provided for
-Line 1228 
 For an Ultra Sparc system the following
+Line 1214 
 For an Ultra Sparc system the following
  \begin{verbatim}
  asm("membar #LoadStore|#StoreStore");
  \end{verbatim}
- for an Alpha based sytem the euivalent code reads
+ for an Alpha based system the equivalent code reads
  \begin{verbatim}
  asm("mb");
  \end{verbatim}
-Line 1238 
 asm("lock; addl $0,0(%%esp)": : :"memory
+Line 1224 
 asm("lock; addl $0,0(%%esp)": : :"memory
  \end{verbatim}
  \item {\bf Cache line size}
- As discussed in section \ref{sec:cache_effects_and_false_sharing},
+ As discussed in section \ref{sect:cache_effects_and_false_sharing},
  milti-threaded codes explicitly avoid penalties associated with excessive
- coherence traffic on an SMP system. To do this the sgared memory data structures
+ coherence traffic on an SMP system. To do this the shared memory data structures
  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
  are padded. The variables that control the padding are set in the
  header file {\em EEPARAMS.h}. These variables are called
-Line 1248 
 header file {\em EEPARAMS.h}. These vari
+Line 1234 
 header file {\em EEPARAMS.h}. These vari
  {\em lShare8}. The default values should not normally need changing.
  \item {\bf \_BARRIER}
  This is a CPP macro that is expanded to a call to a routine
- which synchronises all the logical processors running under the
+ which synchronizes all the logical processors running under the
  WRAPPER. Using a macro here preserves flexibility to insert
  a specialized call in-line into application code. By default this
  resolves to calling the procedure {\em BARRIER()}. The default
-Line 1256 
 setting for the \_BARRIER macro is given
+Line 1242 
 setting for the \_BARRIER macro is given
  \item {\bf \_GSUM}
  This is a CPP macro that is expanded to a call to a routine
- which sums up a floating point numner
+ which sums up a floating point number
  over all the logical processors running under the
  WRAPPER. Using a macro here provides extra flexibility to insert
  a specialized call in-line into application code. By default this
- resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for
+ resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
-=bit floating point operands)
+-bit floating point operands)
- or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default
+ or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
  The \_GSUM macro is a performance critical operation, especially for
  large processor count, small tile size configurations.
- The custom communication example discussed in section \ref{sec:jam_example}
+ The custom communication example discussed in section \ref{sect:jam_example}
  shows how the macro is used to invoke a custom global sum routine
  for a specific set of hardware.
-Line 1280 
 physical fields and whether fields are 3
+Line 1266 
 physical fields and whether fields are 3
  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
  \_EXCH operation plays a crucial role in scaling to small tile,
  large logical and physical processor count configurations.
- The example in section \ref{sec:jam_example} discusses defining an
+ The example in section \ref{sect:jam_example} discusses defining an
- optimised and specialized form on the \_EXCH operation.
+ optimized and specialized form on the \_EXCH operation.
  The \_EXCH operation is also central to supporting grids such as
  the cube-sphere grid. In this class of grid a rotation may be required
  between tiles. Aligning the coordinate requiring rotation with the
- tile decomposistion, allows the coordinate transformation to
+ tile decomposition, allows the coordinate transformation to
- be embedded within a custom form of the \_EXCH primitive.
+ be embedded within a custom form of the \_EXCH primitive.  In these
+ cases \_EXCH is mapped to exch2 routines, as detailed in the exch2
+ package documentation  \ref{sec:exch2}.
  \item {\bf Reverse Mode}
  The communication primitives \_EXCH and \_GSUM both employ
  hand-written adjoint forms (or reverse mode) forms.
  These reverse mode forms can be found in the
- sourc code directory {\em pkg/autodiff}.
+ source code directory {\em pkg/autodiff}.
  For the global sum primitive the reverse mode form
  calls are to {\em GLOBAL\_ADSUM\_R4} and
  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
- exchamge primitives are found in routines
+ exchange primitives are found in routines
  prefixed {\em ADEXCH}. The exchange routines make calls to
  the same low-level communication primitives as the forward mode
  operations. However, the routine argument {\em simulationMode}
  is set to the value {\em REVERSE\_SIMULATION}. This signifies
  ti the low-level routines that the adjoint forms of the
  appropriate communication operation should be performed.
  \item {\bf MAX\_NO\_THREADS}
  The variable {\em MAX\_NO\_THREADS} is used to indicate the
  maximum number of OS threads that a code will use. This
  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
  For single threaded execution it can be reduced to one if required.
- The va;lue is largely private to the WRAPPER and application code
+ The value; is largely private to the WRAPPER and application code
  will nor normally reference the value, except in the following scenario.
  For certain physical parametrization schemes it is necessary to have
-Line 1320 
 This can be achieved using a Fortran 90
+Line 1309 
 This can be achieved using a Fortran 90
  if this might be unavailable then the work arrays can be extended
  with dimensions use the tile dimensioning scheme of {\em nSx}
  and {\em nSy} ( as described in section
- \ref{sec:specifying_a_decomposition}). However, if the configuration
+ \ref{sect:specifying_a_decomposition}). However, if the configuration
  being specified involves many more tiles than OS threads then
  it can save memory resources to reduce the variable
  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
- will be used and to declare the physical parameterisation
+ will be used and to declare the physical parameterization
- work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.
+ work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
  An example of this is given in the verification experiment
  {\em aim.5l\_cs}. Here the default setting of
  {\em MAX\_NO\_THREADS} is altered to
-Line 1338 
 created with declarations of the form.
+Line 1327 
 created with declarations of the form.
  \begin{verbatim}
        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
  \end{verbatim}
- This declaration scheme is not used widely, becuase most global data
+ This declaration scheme is not used widely, because most global data
  is used for permanent not temporary storage of state information.
  In the case of permanent state information this approach cannot be used
  because there has to be enough storage allocated for all tiles.
  However, the technique can sometimes be a useful scheme for reducing memory
- requirements in complex physical paramterisations.
+ requirements in complex physical parameterizations.
  \end{enumerate}
+ \begin{figure}
+ \begin{verbatim}
+ C--
+ C--  Parallel directives for MIPS Pro Fortran compiler
+ C--
+ C      Parallel compiler directives for SGI with IRIX
+ C$PAR  PARALLEL DO
+ C$PAR&  CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
+ C$PAR&  SHARE(nThreads),LOCAL(myThid,I)
+ C
+       DO I=1,nThreads
+         myThid = I
+ C--     Invoke nThreads instances of the numerical model
+         CALL THE_MODEL_MAIN(myThid)
+       ENDDO
+ \end{verbatim}
+ \caption{Prior to transferring control to
+ the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
+ MP directives to spawn multiple threads.
+ } \label{fig:mp_directives}
+ \end{figure}
  \subsubsection{Specializing the Communication Code}
  The isolation of performance critical communication primitives and the
  sub-division of the simulation domain into tiles is a powerful tool.
  Here we show how it can be used to improve application performance and
- how it can be used to adapt to new gridding approaches.
+ how it can be used to adapt to new griding approaches.
  \subsubsection{JAM example}
- \label{sec:jam_example}
+ \label{sect:jam_example}
  On some platforms a big performance boost can be obtained by
  binding the communication routines {\em \_EXCH} and
  {\em \_GSUM} to specialized native libraries ) fro example the
-Line 1371 
 communications library ( see {\em ini\_j
+Line 1384 
 communications library ( see {\em ini\_j
  \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
  with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
  \item a highly specialized form of the exchange operator (optimized
- for overlap regions of width one) is substitued into the elliptic
+ for overlap regions of width one) is substituted into the elliptic
  solver routine {\em cg2d.F}.
  \end{itemize}
  Developing specialized code for other libraries follows a similar
  pattern.
  \subsubsection{Cube sphere communication}
- \label{sec:cube_sphere_communication}
+ \label{sect:cube_sphere_communication}
  Actual {\em \_EXCH} routine code is generated automatically from
  a series of template files, for example {\em exch\_rx.template}.
  This is done to allow a large number of variations on the exchange
  process to be maintained. One set of variations supports the
- cube sphere grid. Support for a cube sphere gris in MITgcm is based
+ cube sphere grid. Support for a cube sphere grid in MITgcm is based
- on having each face of the cube as a separate tile (or tiles).
+ on having each face of the cube as a separate tile or tiles.
- The exchage routines are then able to absorb much of the
+ The exchange routines are then able to absorb much of the
  detailed rotation and reorientation required when moving around the
  cube grid. The set of {\em \_EXCH} routines that contain the
  word cube in their name perform these transformations.
  They are invoked when the run-time logical parameter
  {\em useCubedSphereExchange} is set true. To facilitate the
  transformations on a staggered C-grid, exchange operations are defined
- separately for both vector and scalar quantitities and for
+ separately for both vector and scalar quantities and for
  grid-centered and for grid-face and corner quantities.
  Three sets of exchange routines are defined. Routines
  with names of the form {\em exch\_rx} are used to exchange
-Line 1411 
 quantities at the C-grid vorticity point
+Line 1424 
 quantities at the C-grid vorticity point
  Fitting together the WRAPPER elements, package elements and
  MITgcm core equation elements of the source code produces calling
- sequence shown in section \ref{sec:calling_sequence}
+ sequence shown in section \ref{sect:calling_sequence}
  \subsection{Annotated call tree for MITgcm and WRAPPER}
- \label{sec:calling_sequence}
+ \label{sect:calling_sequence}
  WRAPPER layer.
+ {\footnotesize
  \begin{verbatim}
         MAIN
-Line 1445 
 WRAPPER layer.
+Line 1459 
 WRAPPER layer.
         |--THE_MODEL_MAIN   :: Numerical code top-level driver routine
  \end{verbatim}
+ }
  Core equations plus packages.
+ {\footnotesize
  \begin{verbatim}
  C
  C
-Line 1457 
 C  :
+Line 1473 
 C  :
  C  |
  C  |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
  C    |              :: Called from WRAPPER level numerical
- C    |              :: code innvocation routine. On entry
+ C    |              :: code invocation routine. On entry
  C    |              :: to THE_MODEL_MAIN separate thread and
  C    |              :: separate processes will have been established.
  C    |              :: Each thread and process will have a unique ID
-Line 1471 
 C    | |-INI_PARMS :: Routine to set ker
+Line 1487 
 C    | |-INI_PARMS :: Routine to set ker
  C    | |           :: By default kernel parameters are read from file
  C    | |           :: "data" in directory in which code executes.
  C    | |
- C    | |-MON_INIT :: Initialises monitor pacakge ( see pkg/monitor )
+ C    | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
  C    | |
- C    | |-INI_GRID :: Control grid array (vert. and hori.) initialisation.
+ C    | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
  C    | | |        :: Grid arrays are held and described in GRID.h.
  C    | | |
- C    | | |-INI_VERTICAL_GRID        :: Initialise vertical grid arrays.
+ C    | | |-INI_VERTICAL_GRID        :: Initialize vertical grid arrays.
  C    | | |
- C    | | |-INI_CARTESIAN_GRID       :: Cartesian horiz. grid initialisation
+ C    | | |-INI_CARTESIAN_GRID       :: Cartesian horiz. grid initialization
  C    | | |                          :: (calculate grid from kernel parameters).
  C    | | |
  C    | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid
- C    | | |                          :: initialisation (calculate grid from
+ C    | | |                          :: initialization (calculate grid from
  C    | | |                          :: kernel parameters).
  C    | | |
  C    | | |-INI_CURVILINEAR_GRID     :: General orthogonal, structured horiz.
- C    | |                            :: grid initialisations. ( input from raw
+ C    | |                            :: grid initializations. ( input from raw
  C    | |                            :: grid files, LONC.bin, DXF.bin etc... )
  C    | |
  C    | |-INI_DEPTHS    :: Read (from "bathyFile") or set bathymetry/orgography.
-Line 1497 
 C    | |
+Line 1513 
 C    | |
  C    | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
  C    | |
  C    | |-INI_CORI          :: Set coriolis term. zero, f-plane, beta-plane,
- C    | |                   :: sphere optins are coded.
+ C    | |                   :: sphere options are coded.
  C    | |
  C    | |-PACAKGES_BOOT      :: Start up the optional package environment.
  C    | |                    :: Runtime selection of active packages.
-Line 1518 
 C    | |
+Line 1534 
 C    | |
  C    | |-PACKAGES_CHECK
  C    | | |
  C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp
- C    | | |-OBCS_CHECK          :: Open bndy Pacakge. pkg/obcs
+ C    | | |-OBCS_CHECK          :: Open bndy Package. pkg/obcs
  C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi
  C    | |
  C    | |-PACKAGES_INIT_FIXED
-Line 1538 
 C    |
+Line 1554 
 C    |
  C    |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
  C    |
  C    |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
- C    !                 :: Auotmatically gerenrated by TAMC/TAF.
+ C    !                 :: Auotmatically generated by TAMC/TAF.
  C    |
  C    |-CTRL_PACK   :: Control vector support package. see pkg/ctrl
  C    |
-Line 1552 
 C    | | |
+Line 1568 
 C    | | |
  C    | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
  C    | | |
  C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,
- C    | | |              :: sphere optins are coded.
+ C    | | |              :: sphere options are coded.
  C    | | |
  C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.
  C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.
-Line 1560 
 C    | | |-INI_MIXING   :: Initialise di
+Line 1576 
 C    | | |-INI_MIXING   :: Initialise di
  C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical
  C    | | |              :: fields).
  C    | | |
- C    | | |-INI_FIELDS   :: Control initialising model fields to non-zero
+ C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero
  C    | | | |-INI_VEL    :: Initialize 3D flow field.
  C    | | | |-INI_THETA  :: Set model initial temperature field.
  C    | | | |-INI_SALT   :: Set model initial salinity field.
-Line 1638 
 C/\  | | |-CALC_EXACT_ETA :: Change SSH
+Line 1654 
 C/\  | | |-CALC_EXACT_ETA :: Change SSH
  C/\  | | |-CALC_SURF_DR   :: Calculate the new surface level thickness.
  C/\  | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
  C/\  | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
- C/\  | | | |                    :: Simple interpolcation between end-points
+ C/\  | | | |                    :: Simple interpolation between end-points
  C/\  | | | |                    :: for forcing datasets.
  C/\  | | | |
  C/\  | | | |-EXCH :: Sync forcing. in overlap regions.
-Line 1786 
 C    |-COMM_STATS     :: Summarise inter
+Line 1802 
 C    |-COMM_STATS     :: Summarise inter
  C                     :: events.
  C
  \end{verbatim}
+ }
  \subsection{Measuring and Characterizing Performance}

 Legend:



Removed from v.1.1
 


changed lines


 
Added in v.1.18
 Legend:



Removed from v.1.1
 


changed lines


 
Added in v.1.18
-Removed from v.1.1
+Added in v.1.18

	ViewVC Help
Powered by ViewVC 1.1.22