--- manual/s_software/text/sarch.tex	2001/10/09 10:33:17	1.1
+++ manual/s_software/text/sarch.tex	2002/02/28 19:32:20	1.7
@@ -1,9 +1,20 @@
+% $Header: /home/ubuntu/mnt/e9_copy/manual/s_software/text/sarch.tex,v 1.7 2002/02/28 19:32:20 cnh Exp $
 
-In this chapter we describe the software architecture and
-implementation strategy for the MITgcm code. The first part of this
-chapter discusses the MITgcm architecture at an abstract level. In the second 
-part of the chapter we described practical details of the MITgcm implementation 
-and of current tools and operating system features that are employed.
+This chapter focuses on describing the {\bf WRAPPER} environment within which
+both the core numerics and the pluggable packages operate. The description
+presented here is intended to be a detailed exposition and contains significant
+background material, as well as advanced details on working with the WRAPPER. 
+The tutorial sections of this manual (see Chapters
+\ref{chap:tutorialI}, \ref{chap:tutorialII} and \ref{chap:tutorialIII}) 
+contain more succinct, step-by-step instructions on running basic numerical
+experiments, of varous types, both sequentially and in parallel. For many 
+projects simply starting from an example code and adapting it to suit a 
+particular situation 
+will be all that is required.
+The first part of this chapter discusses the MITgcm architecture at an 
+abstract level. In the second part of the chapter we described practical 
+details of the MITgcm implementation and of current tools and operating system 
+features that are employed.
 
 \section{Overall architectural goals}
 
@@ -11,17 +22,13 @@
 three-fold
  
 \begin{itemize}
-
 \item We wish to be able to study a very broad range
 of interesting and challenging rotating fluids problems.
-
 \item We wish the model code to be readily targeted to
 a wide range of platforms
-
 \item On any given platform we would like to be
 able to achieve performance comparable to an implementation 
 developed and specialized specifically for that platform.
-
 \end{itemize}
 
 These points are summarized in figure \ref{fig:mitgcm_architecture_goals} 
@@ -30,26 +37,21 @@
 of
 
 \begin{enumerate}
-
 \item A core set of numerical and support code. This is discussed in detail in
-section \ref{sec:partII}.
-
+section \ref{sect:partII}.
 \item A scheme for supporting optional "pluggable" {\bf packages} (containing 
 for example mixed-layer schemes, biogeochemical schemes, atmospheric physics). 
 These packages are used both to overlay alternate dynamics and to introduce 
 specialized physical content onto the core numerical code. An overview of
 the {\bf package} scheme is given at the start of part \ref{part:packages}.
-
-
 \item A support framework called {\bf WRAPPER} (Wrappable Application Parallel 
 Programming Environment Resource), within which the core numerics and pluggable 
 packages operate.
-
 \end{enumerate}
 
 This chapter focuses on describing the {\bf WRAPPER} environment under which
 both the core numerics and the pluggable packages function. The description
-presented here is intended to be a detailed exposistion and contains significant
+presented here is intended to be a detailed exposition and contains significant
 background material, as well as advanced details on working with the WRAPPER. 
 The examples section of this manual (part \ref{part:example}) contains more
 succinct, step-by-step instructions on running basic numerical
@@ -57,19 +59,20 @@
 starting from an example code and adapting it to suit a particular situation 
 will be all that is required.
 
+
 \begin{figure}
 \begin{center}
- \resizebox{!}{2.5in}{
-  \includegraphics*[1.5in,2.4in][9.5in,6.3in]{part4/mitgcm_goals.eps}
- }
+\resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}
 \end{center}
-\caption{The MITgcm architecture is designed to allow simulation of a wide 
+\caption{
+The MITgcm architecture is designed to allow simulation of a wide 
 range of physical problems on a wide range of hardware. The computational 
 resource requirements of the applications targeted range from around 
 $10^7$ bytes ( $\approx 10$ megabytes ) of memory to $10^{11}$ bytes
 ( $\approx 100$ gigabytes). Arithmetic operation counts for the applications of 
 interest range from $10^{9}$ floating point operations to more than $10^{17}$ 
-floating point operations.} \label{fig:mitgcm_architecture_goals}
+floating point operations.}
+\label{fig:mitgcm_architecture_goals}
 \end{figure}
 
 \section{WRAPPER}
@@ -81,30 +84,31 @@
 to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within 
 the WRAPPER means that coding has to follow certain, relatively
 straightforward, rules and conventions ( these are discussed further in 
-section \ref{sec:specifying_a_decomposition} ).
+section \ref{sect:specifying_a_decomposition} ).
 
 The approach taken by the WRAPPER is illustrated in figure 
 \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code 
 that fits within it from architectural differences between hardware platforms 
 and operating systems. This allows numerical code to be easily retargetted. 
+
+
 \begin{figure}
 \begin{center}
- \resizebox{6in}{4.5in}{
-  \includegraphics*[0.6in,0.7in][9.0in,8.5in]{part4/fit_in_wrapper.eps}
- }
+\resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}
 \end{center}
-\caption{ Numerical code is written too fit within a software support
+\caption{
+Numerical code is written too fit within a software support
 infrastructure called WRAPPER. The WRAPPER is portable and
-can be sepcialized for a wide range of specific target hardware and 
+can be specialized for a wide range of specific target hardware and 
 programming environments, without impacting numerical code that fits
 within the WRAPPER. Codes that fit within the WRAPPER can generally be
 made to run as fast on a particular platform as codes specially 
-optimized for that platform.
-} \label{fig:fit_in_wrapper}
+optimized for that platform.}
+\label{fig:fit_in_wrapper}
 \end{figure}
 
 \subsection{Target hardware}
-\label{sec:target_hardware}
+\label{sect:target_hardware}
 
 The WRAPPER is designed to target as broad as possible a range of computer
 systems. The original development of the WRAPPER took place on a 
@@ -124,7 +128,7 @@
 
 \subsection{Supporting hardware neutrality}
 
-The different systems listed in section \ref{sec:target_hardware} can be 
+The different systems listed in section \ref{sect:target_hardware} can be 
 categorized in many different ways. For example, one common distinction is 
 between shared-memory parallel systems (SMP's, PVP's) and distributed memory 
 parallel systems (for example x86 clusters and large MPP systems). This is one 
@@ -142,14 +146,14 @@
 class of machines (for example Parallel Vector Processor Systems). Instead the
 WRAPPER provides applications with an 
 abstract {\it machine model}. The machine model is very general, however, it can
-easily be specialized to fit, in a computationally effificent manner, any 
+easily be specialized to fit, in a computationally efficient manner, any 
 computer architecture currently available to the scientific computing community.
 
 \subsection{Machine model parallelism}
 
  Codes operating under the WRAPPER target an abstract machine that is assumed to
 consist of one or more logical processors that can compute concurrently.  
-Computational work is divided amongst the logical
+Computational work is divided among the logical
 processors by allocating ``ownership'' to 
 each processor of a certain set (or sets) of calculations. Each set of 
 calculations owned by a particular processor is associated with a specific 
@@ -172,7 +176,7 @@
 space allocated to a particular logical processor, there will be data 
 structures (arrays, scalar variables etc...) that hold the simulated state of 
 that region. We refer to these data structures as being {\bf owned} by the 
-pprocessor to which their
+processor to which their
 associated region of physical space has been allocated. Individual
 regions that are allocated to processors are called {\bf tiles}. A 
 processor can own more 
@@ -186,8 +190,8 @@
 
 \begin{figure}
 \begin{center}
- \resizebox{7in}{3in}{
-  \includegraphics*[0.5in,2.7in][12.5in,6.4in]{part4/domain_decomp.eps}
+ \resizebox{5in}{!}{
+  \includegraphics{part4/domain_decomp.eps}
  }
 \end{center}
 \caption{ The WRAPPER provides support for one and two dimensional 
@@ -217,13 +221,13 @@
 whenever it requires values that outside the domain it owns. Periodically 
 processors will make calls to WRAPPER functions to communicate data between 
 tiles, in order to keep the overlap regions up to date (see section 
-\ref{sec:communication_primitives}). The WRAPPER functions can use a
+\ref{sect:communication_primitives}). The WRAPPER functions can use a
 variety of different mechanisms to communicate data between tiles.
 
 \begin{figure}
 \begin{center}
- \resizebox{7in}{3in}{
-  \includegraphics*[4.5in,3.7in][12.5in,6.7in]{part4/tiled-world.eps}
+ \resizebox{5in}{!}{
+  \includegraphics{part4/tiled-world.eps}
  }
 \end{center}
 \caption{ A global grid subdivided into tiles.
@@ -304,7 +308,7 @@
 \end{figure}
 
 \subsection{Shared memory communication}
-\label{sec:shared_memory_communication}
+\label{sect:shared_memory_communication}
 
 Under shared communication independent CPU's are operating
 on the exact same global address space at the application level.
@@ -330,7 +334,7 @@
 communication very efficient provided it is used appropriately.
 
 \subsubsection{Memory consistency}
-\label{sec:memory_consistency}
+\label{sect:memory_consistency}
 
 When using shared memory communication between
 multiple processors the WRAPPER level shields user applications from 
@@ -354,7 +358,7 @@
 ensure memory consistency for a particular platform.
 
 \subsubsection{Cache effects and false sharing}
-\label{sec:cache_effects_and_false_sharing}
+\label{sect:cache_effects_and_false_sharing}
 
 Shared-memory machines often have local to processor memory caches
 which contain mirrored copies of main memory. Automatic cache-coherence
@@ -373,7 +377,7 @@
 threads operating within a single process is the standard mechanism for 
 supporting shared memory that the WRAPPER utilizes. Configuring and launching 
 code to run in multi-threaded mode on specific platforms is discussed in 
-section \ref{sec:running_with_threads}.  However, on many systems, potentially 
+section \ref{sect:running_with_threads}.  However, on many systems, potentially 
 very efficient mechanisms for using shared memory communication between 
 multiple processes (in contrast to multiple threads within a single 
 process) also exist. In most cases this works by making a limited region of 
@@ -386,7 +390,7 @@
 nature.
 
 \subsection{Distributed memory communication}
-\label{sec:distributed_memory_communication}
+\label{sect:distributed_memory_communication}
 Many parallel systems are not constructed in a way where it is
 possible or practical for an application to use shared memory
 for communication. For example cluster systems consist of individual computers
@@ -400,16 +404,16 @@
 highly optimized library.
 
 \subsection{Communication primitives}
-\label{sec:communication_primitives}
+\label{sect:communication_primitives}
 
 \begin{figure}
 \begin{center}
- \resizebox{5in}{3in}{
-  \includegraphics*[1.5in,0.7in][7.9in,4.4in]{part4/comm-primm.eps}
+ \resizebox{5in}{!}{
+  \includegraphics{part4/comm-primm.eps}
  }
 \end{center}
-\caption{Three performance critical parallel primititives are provided
-by the WRAPPER. These primititives are always used to communicate data
+\caption{Three performance critical parallel primitives are provided
+by the WRAPPER. These primitives are always used to communicate data
 between tiles. The figure shows four tiles. The curved arrows indicate
 exchange primitives which transfer data between the overlap regions at tile
 edges and interior regions for nearest-neighbor tiles.
@@ -485,8 +489,8 @@
 
 \begin{figure}
 \begin{center}
- \resizebox{5in}{3in}{
-  \includegraphics*[0.5in,1.3in][7.9in,5.7in]{part4/tiling_detail.eps}
+ \resizebox{5in}{!}{
+  \includegraphics{part4/tiling_detail.eps}
  }
 \end{center}
 \caption{The tiling strategy that the WRAPPER supports allows tiles
@@ -544,16 +548,16 @@
 computing CPU's.
 \end{enumerate} 
 This section describes the details of each of these operations.
-Section \ref{sec:specifying_a_decomposition} explains how the way in which
+Section \ref{sect:specifying_a_decomposition} explains how the way in which
 a domain is decomposed (or composed) is expressed. Section 
-\ref{sec:starting_a_code} describes practical details of running codes
+\ref{sect:starting_a_code} describes practical details of running codes
 in various different parallel modes on contemporary computer systems. 
-Section \ref{sec:controlling_communication} explains the internal information
+Section \ref{sect:controlling_communication} explains the internal information
 that the WRAPPER uses to control how information is communicated between
 tiles.
 
 \subsection{Specifying a domain decomposition}
-\label{sec:specifying_a_decomposition}
+\label{sect:specifying_a_decomposition}
 
 At its heart much of the WRAPPER works only in terms of a collection of tiles
 which are interconnected to each other. This is also true of application
@@ -589,8 +593,8 @@
 
 \begin{figure}
 \begin{center}
- \resizebox{5in}{7in}{
-  \includegraphics*[0.5in,0.3in][7.9in,10.7in]{part4/size_h.eps}
+ \resizebox{5in}{!}{
+  \includegraphics{part4/size_h.eps}
  }
 \end{center}
 \caption{ The three level domain decomposition hierarchy employed by the
@@ -605,7 +609,7 @@
 dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are 
 allocated to different threads of a process that are then bound to
 different physical processors ( see the multi-threaded
-execution discussion in section \ref{sec:starting_the_code} ) then
+execution discussion in section \ref{sect:starting_the_code} ) then
 computation will be performed concurrently on each tile. However, it is also
 possible to run the same decomposition within a process running a single thread on
 a single processor. In this case the tiles will be computed over sequentially.
@@ -795,15 +799,15 @@
 There are six tiles allocated to six separate logical processors ({\em nSx=6}).
 This set of values can be used for a cube sphere calculation.
 Each tile of size $32 \times 32$ represents a face of the
-cube. Initialising the tile connectivity correctly ( see section
-\ref{sec:cube_sphere_communication}. allows the rotations associated with
+cube. Initializing the tile connectivity correctly ( see section
+\ref{sect:cube_sphere_communication}. allows the rotations associated with
 moving between the six cube faces to be embedded within the 
 tile-tile communication code.
 \end{enumerate}
 
 
 \subsection{Starting the code}
-\label{sec:starting_the_code}
+\label{sect:starting_the_code}
 When code is started under the WRAPPER, execution begins in a main routine {\em
 eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred 
 to the application through a routine called {\em THE\_MODEL\_MAIN()}
@@ -812,7 +816,6 @@
 by the application code. The startup calling sequence followed by the 
 WRAPPER is shown in figure \ref{fig:wrapper_startup}.
 
-
 \begin{figure}
 \begin{verbatim}
 
@@ -849,12 +852,13 @@
 \end{figure}
 
 \subsubsection{Multi-threaded execution}
+\label{sect:multi-threaded-execution}
 Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
 WRAPPER may cause several coarse grain threads to be initialized. The routine
 {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
 stack argument which is the thread number, stored in the
 variable {\em myThid}. In addition to specifying a decomposition with
-multiple tiles per process ( see section \ref{sec:specifying_a_decomposition}) 
+multiple tiles per process ( see section \ref{sect:specifying_a_decomposition}) 
 configuring and starting a code to run using multiple threads requires the following
 steps.\\
 
@@ -904,25 +908,6 @@
 \end{enumerate}
 
 
-\paragraph{Environment variables}
-On most systems multi-threaded execution also requires the setting
-of a special environment variable. On many machines this variable
-is called PARALLEL and its values should be set to the number
-of parallel threads required. Generally the help pages associated
-with the multi-threaded compiler on a machine will explain
-how to set the required environment variables for that machines.
-
-\paragraph{Runtime input parameters}
-Finally the file {\em eedata} needs to be configured to indicate
-the number of threads to be used in the x and y directions.
-The variables {\em nTx} and {\em nTy} in this file are used to
-specify the information required. The product of {\em nTx} and
-{\em nTy} must be equal to the number of threads spawned i.e.
-the setting of the environment variable PARALLEL.
-The value of {\em nTx} must subdivide the number of sub-domains
-in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the 
-number of sub-domains in y ({\em nSy}) exactly. 
-
 An example of valid settings for the {\em eedata} file for a
 domain with two subdomains in y and running with two threads is shown 
 below
@@ -955,6 +940,7 @@
 } \\
 
 \subsubsection{Multi-process execution}
+\label{sect:multi-process-execution}
 
 Despite its appealing programming model, multi-threaded execution remains
 less common then multi-process execution. One major reason for this
@@ -966,7 +952,8 @@
 
 Multi-process execution is more ubiquitous.
 In order to run code in a multi-process configuration a decomposition
-specification is given ( in which the at least one of the
+specification ( see section \ref{sect:specifying_a_decomposition})
+is given ( in which the at least one of the
 parameters {\em nPx} or {\em nPy} will be greater than one)
 and then, as for multi-threaded operation,
 appropriate compile time and run time steps must be taken.
@@ -1029,7 +1016,7 @@
 \begin{verbatim}
 mpirun -np 64 -machinefile mf ./mitgcmuv
 \end{verbatim}
-In this example the text {\em -np 64} specifices the number of processes 
+In this example the text {\em -np 64} specifies the number of processes 
 that will be created. The numeric value {\em 64} must be equal to the
 product of the processor grid settings of {\em nPx} and {\em nPy}
 in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
@@ -1046,6 +1033,25 @@
 \end{minipage}
 } \\
 
+
+\paragraph{Environment variables}
+On most systems multi-threaded execution also requires the setting
+of a special environment variable. On many machines this variable
+is called PARALLEL and its values should be set to the number
+of parallel threads required. Generally the help pages associated
+with the multi-threaded compiler on a machine will explain
+how to set the required environment variables for that machines.
+
+\paragraph{Runtime input parameters}
+Finally the file {\em eedata} needs to be configured to indicate
+the number of threads to be used in the x and y directions.
+The variables {\em nTx} and {\em nTy} in this file are used to
+specify the information required. The product of {\em nTx} and
+{\em nTy} must be equal to the number of threads spawned i.e.
+the setting of the environment variable PARALLEL.
+The value of {\em nTx} must subdivide the number of sub-domains
+in x ({\em nSx}) exactly. The value of {\em nTy} must subdivide the 
+number of sub-domains in y ({\em nSy}) exactly. 
 The multiprocess startup of the MITgcm executable {\em mitgcmuv}
 is controlled by the routines {\em EEBOOT\_MINIMAL()} and
 {\em INI\_PROCS()}. The first routine performs basic steps required
@@ -1058,7 +1064,7 @@
 output files {\bf STDOUT.0001} and {\bf STDERR.0001} etc... These files
 are used for reporting status and configuration information and
 for reporting error conditions on a process by process basis.
-The {{\em EEBOOT\_MINIMAL()} procedure also sets the variables 
+The {\em EEBOOT\_MINIMAL()} procedure also sets the variables 
 {\em myProcId} and {\em MPI\_COMM\_MODEL}.
 These variables are related
 to processor identification are are used later in the routine
@@ -1099,6 +1105,7 @@
 The WRAPPER maintains internal information that is used for communication
 operations and that can be customized for different platforms. This section 
 describes the information that is held and used.
+
 \begin{enumerate}
 \item {\bf Tile-tile connectivity information} For each tile the WRAPPER
 sets a flag that sets the tile number to the north, south, east and
@@ -1112,13 +1119,13 @@
 This latter set of variables can take one of the following values
 {\em COMM\_NONE}, {\em COMM\_MSG}, {\em COMM\_PUT} and {\em COMM\_GET}.
 A value of {\em COMM\_NONE} is used to indicate that a tile has no
-neighbor to cummnicate with on a particular face. A value
+neighbor to communicate with on a particular face. A value
 of {\em COMM\_MSG} is used to indicated that some form of distributed
 memory communication is required to communicate between
-these tile faces ( see section \ref{sec:distributed_memory_communication}).
+these tile faces ( see section \ref{sect:distributed_memory_communication}).
 A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate 
 forms of shared memory communication ( see section 
-\ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates 
+\ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates 
 that a CPU should communicate by writing to data structures owned by another 
 CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
 from data structures owned by another CPU. These flags affect the behavior
@@ -1169,7 +1176,7 @@
 are read from the file {\em eedata}. If the value of {\em nThreads}
 is inconsistent with the number of threads requested from the
 operating system (for example by using an environment
-varialble as described in section \ref{sec:multi_threaded_execution})
+variable as described in section \ref{sect:multi_threaded_execution})
 then usually an error will be reported by the routine 
 {\em CHECK\_THREADS}.\\
 
@@ -1186,33 +1193,8 @@
 \end{minipage}
 }
 
-\begin{figure}
-\begin{verbatim}
-C--
-C--  Parallel directives for MIPS Pro Fortran compiler
-C--
-C      Parallel compiler directives for SGI with IRIX
-C$PAR  PARALLEL DO
-C$PAR&  CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
-C$PAR&  SHARE(nThreads),LOCAL(myThid,I)
-C
-      DO I=1,nThreads
-        myThid = I
-
-C--     Invoke nThreads instances of the numerical model
-        CALL THE_MODEL_MAIN(myThid)
-
-      ENDDO
-\end{verbatim}
-\caption{Prior to transferring control to
-the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
-MP directives to spawn multiple threads.
-} \label{fig:mp_directives}
-\end{figure}
-
-
 \item {\bf memsync flags}
-As discussed in section \ref{sec:memory_consistency}, when using shared memory,
+As discussed in section \ref{sect:memory_consistency}, when using shared memory,
 a low-level system function may be need to force memory consistency.
 The routine {\em MEMSYNC()} is used for this purpose. This routine should
 not need modifying and the information below is only provided for
@@ -1228,7 +1210,7 @@
 \begin{verbatim}
 asm("membar #LoadStore|#StoreStore");
 \end{verbatim}
-for an Alpha based sytem the euivalent code reads
+for an Alpha based system the equivalent code reads
 \begin{verbatim}
 asm("mb");
 \end{verbatim}
@@ -1238,9 +1220,9 @@
 \end{verbatim}
 
 \item {\bf Cache line size}
-As discussed in section \ref{sec:cache_effects_and_false_sharing},
+As discussed in section \ref{sect:cache_effects_and_false_sharing},
 milti-threaded codes explicitly avoid penalties associated with excessive 
-coherence traffic on an SMP system. To do this the sgared memory data structures
+coherence traffic on an SMP system. To do this the shared memory data structures
 used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
 are padded. The variables that control the padding are set in the
 header file {\em EEPARAMS.h}. These variables are called
@@ -1248,7 +1230,7 @@
 {\em lShare8}. The default values should not normally need changing.
 \item {\bf \_BARRIER}
 This is a CPP macro that is expanded to a call to a routine
-which synchronises all the logical processors running under the
+which synchronizes all the logical processors running under the
 WRAPPER. Using a macro here preserves flexibility to insert
 a specialized call in-line into application code. By default this
 resolves to calling the procedure {\em BARRIER()}. The default
@@ -1256,17 +1238,17 @@
 
 \item {\bf \_GSUM}
 This is a CPP macro that is expanded to a call to a routine
-which sums up a floating point numner
+which sums up a floating point number
 over all the logical processors running under the
 WRAPPER. Using a macro here provides extra flexibility to insert
 a specialized call in-line into application code. By default this
-resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for
-84=bit floating point operands)
-or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default
+resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
+64-bit floating point operands)
+or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
 setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
 The \_GSUM macro is a performance critical operation, especially for
 large processor count, small tile size configurations.
-The custom communication example discussed in section \ref{sec:jam_example}
+The custom communication example discussed in section \ref{sect:jam_example}
 shows how the macro is used to invoke a custom global sum routine
 for a specific set of hardware.
 
@@ -1280,24 +1262,24 @@
 in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the 
 \_EXCH operation plays a crucial role in scaling to small tile,
 large logical and physical processor count configurations.
-The example in section \ref{sec:jam_example} discusses defining an
-optimised and specialized form on the \_EXCH operation.
+The example in section \ref{sect:jam_example} discusses defining an
+optimized and specialized form on the \_EXCH operation.
 
 The \_EXCH operation is also central to supporting grids such as
 the cube-sphere grid. In this class of grid a rotation may be required
 between tiles. Aligning the coordinate requiring rotation with the
-tile decomposistion, allows the coordinate transformation to 
+tile decomposition, allows the coordinate transformation to 
 be embedded within a custom form of the \_EXCH primitive.
 
 \item {\bf Reverse Mode}
 The communication primitives \_EXCH and \_GSUM both employ 
 hand-written adjoint forms (or reverse mode) forms. 
 These reverse mode forms can be found in the
-sourc code directory {\em pkg/autodiff}.
+source code directory {\em pkg/autodiff}.
 For the global sum primitive the reverse mode form
 calls are to {\em GLOBAL\_ADSUM\_R4} and
 {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
-exchamge primitives are found in routines
+exchange primitives are found in routines
 prefixed {\em ADEXCH}. The exchange routines make calls to
 the same low-level communication primitives as the forward mode
 operations. However, the routine argument {\em simulationMode}
@@ -1309,7 +1291,7 @@
 maximum number of OS threads that a code will use. This
 value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
 For single threaded execution it can be reduced to one if required.
-The va;lue is largely private to the WRAPPER and application code
+The value; is largely private to the WRAPPER and application code
 will nor normally reference the value, except in the following scenario.
 
 For certain physical parametrization schemes it is necessary to have 
@@ -1320,12 +1302,12 @@
 if this might be unavailable then the work arrays can be extended
 with dimensions use the tile dimensioning scheme of {\em nSx}
 and {\em nSy} ( as described in section 
-\ref{sec:specifying_a_decomposition}). However, if the configuration
+\ref{sect:specifying_a_decomposition}). However, if the configuration
 being specified involves many more tiles than OS threads then
 it can save memory resources to reduce the variable
 {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
-will be used and to declare the physical parameterisation
-work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.
+will be used and to declare the physical parameterization
+work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
 An example of this is given in the verification experiment
 {\em aim.5l\_cs}. Here the default setting of 
 {\em MAX\_NO\_THREADS} is altered to
@@ -1338,24 +1320,48 @@
 \begin{verbatim}
       common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
 \end{verbatim}
-This declaration scheme is not used widely, becuase most global data
+This declaration scheme is not used widely, because most global data
 is used for permanent not temporary storage of state information.
 In the case of permanent state information this approach cannot be used
 because there has to be enough storage allocated for all tiles.
 However, the technique can sometimes be a useful scheme for reducing memory 
-requirements in complex physical paramterisations.
-
+requirements in complex physical parameterizations.
 \end{enumerate}
 
+\begin{figure}
+\begin{verbatim}
+C--
+C--  Parallel directives for MIPS Pro Fortran compiler
+C--
+C      Parallel compiler directives for SGI with IRIX
+C$PAR  PARALLEL DO
+C$PAR&  CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
+C$PAR&  SHARE(nThreads),LOCAL(myThid,I)
+C
+      DO I=1,nThreads
+        myThid = I
+
+C--     Invoke nThreads instances of the numerical model
+        CALL THE_MODEL_MAIN(myThid)
+
+      ENDDO
+\end{verbatim}
+\caption{Prior to transferring control to
+the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may use
+MP directives to spawn multiple threads.
+} \label{fig:mp_directives}
+\end{figure}
+
+
 \subsubsection{Specializing the Communication Code}
 
 The isolation of performance critical communication primitives and the
 sub-division of the simulation domain into tiles is a powerful tool.
 Here we show how it can be used to improve application performance and
-how it can be used to adapt to new gridding approaches.
+how it can be used to adapt to new griding approaches.
 
 \subsubsection{JAM example}
-\label{sec:jam_example}
+\label{sect:jam_example}
 On some platforms a big performance boost can be obtained by
 binding the communication routines {\em \_EXCH} and
 {\em \_GSUM} to specialized native libraries ) fro example the
@@ -1371,28 +1377,28 @@
 \item The {\em \_GSUM} and {\em \_EXCH} macro definitions are replaced
 with calls to custom routines ( see {\em gsum\_jam.F} and {\em exch\_jam.F})
 \item a highly specialized form of the exchange operator (optimized
-for overlap regions of width one) is substitued into the elliptic
+for overlap regions of width one) is substituted into the elliptic
 solver routine {\em cg2d.F}.
 \end{itemize}
 Developing specialized code for other libraries follows a similar
 pattern.
 
 \subsubsection{Cube sphere communication}
-\label{sec:cube_sphere_communication}
+\label{sect:cube_sphere_communication}
 Actual {\em \_EXCH} routine code is generated automatically from 
 a series of template files, for example {\em exch\_rx.template}.
 This is done to allow a large number of variations on the exchange 
 process to be maintained. One set of variations supports the
-cube sphere grid. Support for a cube sphere gris in MITgcm is based
+cube sphere grid. Support for a cube sphere grid in MITgcm is based
 on having each face of the cube as a separate tile (or tiles).
-The exchage routines are then able to absorb much of the
+The exchange routines are then able to absorb much of the
 detailed rotation and reorientation required when moving around the
 cube grid. The set of {\em \_EXCH} routines that contain the
 word cube in their name perform these transformations.
 They are invoked when the run-time logical parameter
 {\em useCubedSphereExchange} is set true. To facilitate the
 transformations on a staggered C-grid, exchange operations are defined 
-separately for both vector and scalar quantitities and for
+separately for both vector and scalar quantities and for
 grid-centered and for grid-face and corner quantities.
 Three sets of exchange routines are defined. Routines
 with names of the form {\em exch\_rx} are used to exchange
@@ -1411,10 +1417,10 @@
 
 Fitting together the WRAPPER elements, package elements and
 MITgcm core equation elements of the source code produces calling
-sequence shown in section \ref{sec:calling_sequence}
+sequence shown in section \ref{sect:calling_sequence}
 
 \subsection{Annotated call tree for MITgcm and WRAPPER}
-\label{sec:calling_sequence}
+\label{sect:calling_sequence}
 
 WRAPPER layer.
 
@@ -1457,7 +1463,7 @@
 C  |
 C  |-THE_MODEL_MAIN :: Primary driver for the MITgcm algorithm
 C    |              :: Called from WRAPPER level numerical
-C    |              :: code innvocation routine. On entry
+C    |              :: code invocation routine. On entry
 C    |              :: to THE_MODEL_MAIN separate thread and
 C    |              :: separate processes will have been established.
 C    |              :: Each thread and process will have a unique ID
@@ -1471,22 +1477,22 @@
 C    | |           :: By default kernel parameters are read from file 
 C    | |           :: "data" in directory in which code executes.
 C    | |
-C    | |-MON_INIT :: Initialises monitor pacakge ( see pkg/monitor )
+C    | |-MON_INIT :: Initializes monitor package ( see pkg/monitor )
 C    | |
-C    | |-INI_GRID :: Control grid array (vert. and hori.) initialisation.
+C    | |-INI_GRID :: Control grid array (vert. and hori.) initialization.
 C    | | |        :: Grid arrays are held and described in GRID.h.
 C    | | |
-C    | | |-INI_VERTICAL_GRID        :: Initialise vertical grid arrays.
+C    | | |-INI_VERTICAL_GRID        :: Initialize vertical grid arrays.
 C    | | |
-C    | | |-INI_CARTESIAN_GRID       :: Cartesian horiz. grid initialisation
+C    | | |-INI_CARTESIAN_GRID       :: Cartesian horiz. grid initialization
 C    | | |                          :: (calculate grid from kernel parameters).
 C    | | |
 C    | | |-INI_SPHERICAL_POLAR_GRID :: Spherical polar horiz. grid 
-C    | | |                          :: initialisation (calculate grid from 
+C    | | |                          :: initialization (calculate grid from 
 C    | | |                          :: kernel parameters).
 C    | | |
 C    | | |-INI_CURVILINEAR_GRID     :: General orthogonal, structured horiz.
-C    | |                            :: grid initialisations. ( input from raw
+C    | |                            :: grid initializations. ( input from raw
 C    | |                            :: grid files, LONC.bin, DXF.bin etc... )
 C    | |
 C    | |-INI_DEPTHS    :: Read (from "bathyFile") or set bathymetry/orgography.
@@ -1497,7 +1503,7 @@
 C    | |-INI_LINEAR_PHSURF :: Set ref. surface Bo_surf
 C    | |
 C    | |-INI_CORI          :: Set coriolis term. zero, f-plane, beta-plane,
-C    | |                   :: sphere optins are coded.
+C    | |                   :: sphere options are coded.
 C    | |
 C    | |-PACAKGES_BOOT      :: Start up the optional package environment.
 C    | |                    :: Runtime selection of active packages.
@@ -1518,7 +1524,7 @@
 C    | |-PACKAGES_CHECK
 C    | | |
 C    | | |-KPP_CHECK           :: KPP Package. pkg/kpp
-C    | | |-OBCS_CHECK          :: Open bndy Pacakge. pkg/obcs
+C    | | |-OBCS_CHECK          :: Open bndy Package. pkg/obcs
 C    | | |-GMREDI_CHECK        :: GM Package. pkg/gmredi
 C    | |
 C    | |-PACKAGES_INIT_FIXED
@@ -1538,7 +1544,7 @@
 C    |-CTRL_UNPACK :: Control vector support package. see pkg/ctrl
 C    |
 C    |-ADTHE_MAIN_LOOP :: Derivative evaluating form of main time stepping loop
-C    !                 :: Auotmatically gerenrated by TAMC/TAF.
+C    !                 :: Auotmatically generated by TAMC/TAF.
 C    |
 C    |-CTRL_PACK   :: Control vector support package. see pkg/ctrl
 C    |
@@ -1552,7 +1558,7 @@
 C    | | |-INI_LINEAR_PHISURF :: Set ref. surface Bo_surf
 C    | | |
 C    | | |-INI_CORI     :: Set coriolis term. zero, f-plane, beta-plane,
-C    | | |              :: sphere optins are coded.
+C    | | |              :: sphere options are coded.
 C    | | |
 C    | | |-INI_CG2D     :: 2d con. grad solver initialisation.
 C    | | |-INI_CG3D     :: 3d con. grad solver initialisation.
@@ -1560,7 +1566,7 @@
 C    | | |-INI_DYNVARS  :: Initialise to zero all DYNVARS.h arrays (dynamical
 C    | | |              :: fields).
 C    | | |
-C    | | |-INI_FIELDS   :: Control initialising model fields to non-zero
+C    | | |-INI_FIELDS   :: Control initializing model fields to non-zero
 C    | | | |-INI_VEL    :: Initialize 3D flow field.
 C    | | | |-INI_THETA  :: Set model initial temperature field.
 C    | | | |-INI_SALT   :: Set model initial salinity field.
@@ -1638,7 +1644,7 @@
 C/\  | | |-CALC_SURF_DR   :: Calculate the new surface level thickness.
 C/\  | | |-EXF_GETFORCING :: External forcing package. ( pkg/exf )
 C/\  | | |-EXTERNAL_FIELDS_LOAD :: Control loading time dep. external data.
-C/\  | | | |                    :: Simple interpolcation between end-points 
+C/\  | | | |                    :: Simple interpolation between end-points 
 C/\  | | | |                    :: for forcing datasets.
 C/\  | | | |                  
 C/\  | | | |-EXCH :: Sync forcing. in overlap regions.