--- manual/s_software/text/sarch.tex 2001/11/13 18:32:33 1.5 +++ manual/s_software/text/sarch.tex 2004/10/16 03:40:16 1.20 @@ -1,12 +1,25 @@ -% $Header: /home/ubuntu/mnt/e9_copy/manual/s_software/text/sarch.tex,v 1.5 2001/11/13 18:32:33 cnh Exp $ +% $Header: /home/ubuntu/mnt/e9_copy/manual/s_software/text/sarch.tex,v 1.20 2004/10/16 03:40:16 edhill Exp $ -In this chapter we describe the software architecture and -implementation strategy for the MITgcm code. The first part of this -chapter discusses the MITgcm architecture at an abstract level. In the second -part of the chapter we described practical details of the MITgcm implementation -and of current tools and operating system features that are employed. +This chapter focuses on describing the {\bf WRAPPER} environment within which +both the core numerics and the pluggable packages operate. The description +presented here is intended to be a detailed exposition and contains significant +background material, as well as advanced details on working with the WRAPPER. +The tutorial sections of this manual (see sections +\ref{sect:tutorials} and \ref{sect:tutorialIII}) +contain more succinct, step-by-step instructions on running basic numerical +experiments, of varous types, both sequentially and in parallel. For many +projects simply starting from an example code and adapting it to suit a +particular situation +will be all that is required. +The first part of this chapter discusses the MITgcm architecture at an +abstract level. In the second part of the chapter we described practical +details of the MITgcm implementation and of current tools and operating system +features that are employed. \section{Overall architectural goals} +\begin{rawhtml} + +\end{rawhtml} Broadly, the goals of the software architecture employed in MITgcm are three-fold @@ -28,7 +41,7 @@ \begin{enumerate} \item A core set of numerical and support code. This is discussed in detail in -section \ref{sec:partII}. +section \ref{sect:partII}. \item A scheme for supporting optional "pluggable" {\bf packages} (containing for example mixed-layer schemes, biogeochemical schemes, atmospheric physics). These packages are used both to overlay alternate dynamics and to introduce @@ -66,6 +79,9 @@ \end{figure} \section{WRAPPER} +\begin{rawhtml} + +\end{rawhtml} A significant element of the software architecture utilized in MITgcm is a software superstructure and substructure collectively @@ -74,7 +90,7 @@ to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within the WRAPPER means that coding has to follow certain, relatively straightforward, rules and conventions ( these are discussed further in -section \ref{sec:specifying_a_decomposition} ). +section \ref{sect:specifying_a_decomposition} ). The approach taken by the WRAPPER is illustrated in figure \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code @@ -87,7 +103,7 @@ \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}} \end{center} \caption{ -Numerical code is written too fit within a software support +Numerical code is written to fit within a software support infrastructure called WRAPPER. The WRAPPER is portable and can be specialized for a wide range of specific target hardware and programming environments, without impacting numerical code that fits @@ -98,7 +114,7 @@ \end{figure} \subsection{Target hardware} -\label{sec:target_hardware} +\label{sect:target_hardware} The WRAPPER is designed to target as broad as possible a range of computer systems. The original development of the WRAPPER took place on a @@ -110,7 +126,7 @@ (UMA) and non-uniform memory access (NUMA) designs. Significant work has also been undertaken on x86 cluster systems, Alpha processor based clustered SMP systems, and on cache-coherent NUMA (CC-NUMA) systems from Silicon Graphics. -The MITgcm code, operating within the WRAPPER, is also used routinely used on +The MITgcm code, operating within the WRAPPER, is also routinely used on large scale MPP systems (for example T3E systems and IBM SP systems). In all cases numerical code, operating within the WRAPPER, performs and scales very competitively with equivalent numerical code that has been modified to contain @@ -118,7 +134,7 @@ \subsection{Supporting hardware neutrality} -The different systems listed in section \ref{sec:target_hardware} can be +The different systems listed in section \ref{sect:target_hardware} can be categorized in many different ways. For example, one common distinction is between shared-memory parallel systems (SMP's, PVP's) and distributed memory parallel systems (for example x86 clusters and large MPP systems). This is one @@ -140,6 +156,9 @@ computer architecture currently available to the scientific computing community. \subsection{Machine model parallelism} +\begin{rawhtml} + +\end{rawhtml} Codes operating under the WRAPPER target an abstract machine that is assumed to consist of one or more logical processors that can compute concurrently. @@ -211,7 +230,7 @@ whenever it requires values that outside the domain it owns. Periodically processors will make calls to WRAPPER functions to communicate data between tiles, in order to keep the overlap regions up to date (see section -\ref{sec:communication_primitives}). The WRAPPER functions can use a +\ref{sect:communication_primitives}). The WRAPPER functions can use a variety of different mechanisms to communicate data between tiles. \begin{figure} @@ -298,7 +317,7 @@ \end{figure} \subsection{Shared memory communication} -\label{sec:shared_memory_communication} +\label{sect:shared_memory_communication} Under shared communication independent CPU's are operating on the exact same global address space at the application level. @@ -324,7 +343,7 @@ communication very efficient provided it is used appropriately. \subsubsection{Memory consistency} -\label{sec:memory_consistency} +\label{sect:memory_consistency} When using shared memory communication between multiple processors the WRAPPER level shields user applications from @@ -348,7 +367,7 @@ ensure memory consistency for a particular platform. \subsubsection{Cache effects and false sharing} -\label{sec:cache_effects_and_false_sharing} +\label{sect:cache_effects_and_false_sharing} Shared-memory machines often have local to processor memory caches which contain mirrored copies of main memory. Automatic cache-coherence @@ -367,7 +386,7 @@ threads operating within a single process is the standard mechanism for supporting shared memory that the WRAPPER utilizes. Configuring and launching code to run in multi-threaded mode on specific platforms is discussed in -section \ref{sec:running_with_threads}. However, on many systems, potentially +section \ref{sect:running_with_threads}. However, on many systems, potentially very efficient mechanisms for using shared memory communication between multiple processes (in contrast to multiple threads within a single process) also exist. In most cases this works by making a limited region of @@ -380,7 +399,7 @@ nature. \subsection{Distributed memory communication} -\label{sec:distributed_memory_communication} +\label{sect:distributed_memory_communication} Many parallel systems are not constructed in a way where it is possible or practical for an application to use shared memory for communication. For example cluster systems consist of individual computers @@ -394,7 +413,7 @@ highly optimized library. \subsection{Communication primitives} -\label{sec:communication_primitives} +\label{sect:communication_primitives} \begin{figure} \begin{center} @@ -527,6 +546,9 @@ last 50 years. \section{Using the WRAPPER} +\begin{rawhtml} + +\end{rawhtml} In order to support maximum portability the WRAPPER is implemented primarily in sequential Fortran 77. At a practical level the key steps provided by the @@ -538,16 +560,16 @@ computing CPU's. \end{enumerate} This section describes the details of each of these operations. -Section \ref{sec:specifying_a_decomposition} explains how the way in which +Section \ref{sect:specifying_a_decomposition} explains how the way in which a domain is decomposed (or composed) is expressed. Section -\ref{sec:starting_a_code} describes practical details of running codes +\ref{sect:starting_a_code} describes practical details of running codes in various different parallel modes on contemporary computer systems. -Section \ref{sec:controlling_communication} explains the internal information +Section \ref{sect:controlling_communication} explains the internal information that the WRAPPER uses to control how information is communicated between tiles. \subsection{Specifying a domain decomposition} -\label{sec:specifying_a_decomposition} +\label{sect:specifying_a_decomposition} At its heart much of the WRAPPER works only in terms of a collection of tiles which are interconnected to each other. This is also true of application @@ -599,7 +621,7 @@ dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are allocated to different threads of a process that are then bound to different physical processors ( see the multi-threaded -execution discussion in section \ref{sec:starting_the_code} ) then +execution discussion in section \ref{sect:starting_the_code} ) then computation will be performed concurrently on each tile. However, it is also possible to run the same decomposition within a process running a single thread on a single processor. In this case the tiles will be computed over sequentially. @@ -651,6 +673,12 @@ computation is performed concurrently over as many processes and threads as there are physical processors available to compute. +An exception to the the use of {\em bi} and {\em bj} in loops arises in the +exchange routines used when the exch2 package is used with the cubed +sphere. In this case {\em bj} is generally set to 1 and the loop runs from +1,{\em bi}. Within the loop {\em bi} is used to retrieve the tile number, +which is then used to reference exchange parameters. + The amount of computation that can be embedded a single loop over {\em bi} and {\em bj} varies for different parts of the MITgcm algorithm. Figure \ref{fig:bibj_extract} shows a code extract @@ -771,7 +799,7 @@ forty grid points in y. The two sub-domains in each process will be computed sequentially if they are given to a single thread within a single process. Alternatively if the code is invoked with multiple threads per process -the two domains in y may be computed on concurrently. +the two domains in y may be computed concurrently. \item \begin{verbatim} PARAMETER ( @@ -790,14 +818,14 @@ This set of values can be used for a cube sphere calculation. Each tile of size $32 \times 32$ represents a face of the cube. Initializing the tile connectivity correctly ( see section -\ref{sec:cube_sphere_communication}. allows the rotations associated with +\ref{sect:cube_sphere_communication}. allows the rotations associated with moving between the six cube faces to be embedded within the tile-tile communication code. \end{enumerate} \subsection{Starting the code} -\label{sec:starting_the_code} +\label{sect:starting_the_code} When code is started under the WRAPPER, execution begins in a main routine {\em eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred to the application through a routine called {\em THE\_MODEL\_MAIN()} @@ -807,6 +835,7 @@ WRAPPER is shown in figure \ref{fig:wrapper_startup}. \begin{figure} +{\footnotesize \begin{verbatim} MAIN @@ -835,6 +864,7 @@ \end{verbatim} +} \caption{Main stages of the WRAPPER startup procedure. This process proceeds transfer of control to application code, which occurs through the procedure {\em THE\_MODEL\_MAIN()}. @@ -842,13 +872,13 @@ \end{figure} \subsubsection{Multi-threaded execution} -\label{sec:multi-threaded-execution} +\label{sect:multi-threaded-execution} Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the WRAPPER may cause several coarse grain threads to be initialized. The routine {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single stack argument which is the thread number, stored in the variable {\em myThid}. In addition to specifying a decomposition with -multiple tiles per process ( see section \ref{sec:specifying_a_decomposition}) +multiple tiles per process ( see section \ref{sect:specifying_a_decomposition}) configuring and starting a code to run using multiple threads requires the following steps.\\ @@ -917,7 +947,7 @@ File: {\em eesupp/inc/MAIN\_PDIRECTIVES2.h}\\ File: {\em model/src/THE\_MODEL\_MAIN.F}\\ File: {\em eesupp/src/MAIN.F}\\ -File: {\em tools/genmake}\\ +File: {\em tools/genmake2}\\ File: {\em eedata}\\ CPP: {\em TARGET\_SUN}\\ CPP: {\em TARGET\_DEC}\\ @@ -930,7 +960,7 @@ } \\ \subsubsection{Multi-process execution} -\label{sec:multi-process-execution} +\label{sect:multi-process-execution} Despite its appealing programming model, multi-threaded execution remains less common then multi-process execution. One major reason for this @@ -942,7 +972,7 @@ Multi-process execution is more ubiquitous. In order to run code in a multi-process configuration a decomposition -specification ( see section \ref{sec:specifying_a_decomposition}) +specification ( see section \ref{sect:specifying_a_decomposition}) is given ( in which the at least one of the parameters {\em nPx} or {\em nPy} will be greater than one) and then, as for multi-threaded operation, @@ -956,45 +986,21 @@ of controlling and coordinating the start up of a large number (hundreds and possibly even thousands) of copies of the same program, MPI is used. The calls to the MPI multi-process startup -routines must be activated at compile time. This is done -by setting the {\em ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} -flags in the {\em CPP\_EEOPTIONS.h} file.\\ +routines must be activated at compile time. Currently MPI libraries are +invoked by +specifying the appropriate options file with the +{\tt-of} flag when running the {\em genmake2} +script, which generates the Makefile for compiling and linking MITgcm. +(Previously this was done by setting the {\em ALLOW\_USE\_MPI} and +{\em ALWAYS\_USE\_MPI} flags in the {\em CPP\_EEOPTIONS.h} file.) More +detailed information about the use of {\em genmake2} for specifying +local compiler flags is located in section \ref{sect:genmake}.\\ -\fbox{ -\begin{minipage}{4.75in} -File: {\em eesupp/inc/CPP\_EEOPTIONS.h}\\ -CPP: {\em ALLOW\_USE\_MPI}\\ -CPP: {\em ALWAYS\_USE\_MPI}\\ -Parameter: {\em nPx}\\ -Parameter: {\em nPy} -\end{minipage} -} \\ - -Additionally, compile time options are required to link in the -MPI libraries and header files. Examples of these options -can be found in the {\em genmake} script that creates makefiles -for compilation. When this script is executed with the {bf -mpi} -flag it will generate a makefile that includes -paths for search for MPI head files and for linking in -MPI libraries. For example the {\bf -mpi} flag on a - Silicon Graphics IRIX system causes a -Makefile with the compilation command -Graphics IRIX system \begin{verbatim} -mpif77 -I/usr/local/mpi/include -DALLOW_USE_MPI -DALWAYS_USE_MPI -\end{verbatim} -to be generated. -This is the correct set of options for using the MPICH open-source -version of MPI, when it has been installed under the subdirectory -/usr/local/mpi. -However, on many systems there may be several -versions of MPI installed. For example many systems have both -the open source MPICH set of libraries and a vendor specific native form -of the MPI libraries. The correct setup to use will depend on the -local configuration of your system.\\ \fbox{ \begin{minipage}{4.75in} -File: {\em tools/genmake} +Directory: {\em tools/build\_options}\\ +File: {\em tools/genmake2} \end{minipage} } \\ \paragraph{\bf Execution} The mechanics of starting a program in @@ -1012,7 +1018,7 @@ in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file called ``mf'' will be read to get a list of processor names on which the sixty-four processes will execute. The syntax of this file -is specified by the MPI distribution +is specified by the MPI distribution. \\ \fbox{ @@ -1063,15 +1069,20 @@ Allocation of processes to tiles in controlled by the routine {\em INI\_PROCS()}. For each process this routine sets the variables {\em myXGlobalLo} and {\em myYGlobalLo}. -These variables specify (in index space) the coordinate -of the southern most and western most corner of the -southern most and western most tile owned by this process. +These variables specify in index space the coordinates +of the southernmost and westernmost corner of the +southernmost and westernmost tile owned by this process. The variables {\em pidW}, {\em pidE}, {\em pidS} and {\em pidN} are also set in this routine. These are used to identify processes holding tiles to the west, east, south and north of this process. These values are stored in global storage in the header file {\em EESUPPORT.h} for use by -communication routines. +communication routines. The above does not hold when the +exch2 package is used -- exch2 sets its own parameters to +specify the global indices of tiles and their relationships +to each other. See the documentation on the exch2 package +(\ref{sec:exch2}) for +details. \\ \fbox{ @@ -1097,10 +1108,13 @@ describes the information that is held and used. \begin{enumerate} -\item {\bf Tile-tile connectivity information} For each tile the WRAPPER -sets a flag that sets the tile number to the north, south, east and +\item {\bf Tile-tile connectivity information} +For each tile the WRAPPER +sets a flag that sets the tile number to the north, +south, east and west of that tile. This number is unique over all tiles in a -configuration. The number is held in the variables {\em tileNo} +configuration. Except when using the cubed sphere and the exch2 package, +the number is held in the variables {\em tileNo} ( this holds the tiles own number), {\em tileNoN}, {\em tileNoS}, {\em tileNoE} and {\em tileNoW}. A parameter is also stored with each tile that specifies the type of communication that is used between tiles. @@ -1112,10 +1126,10 @@ neighbor to communicate with on a particular face. A value of {\em COMM\_MSG} is used to indicated that some form of distributed memory communication is required to communicate between -these tile faces ( see section \ref{sec:distributed_memory_communication}). +these tile faces ( see section \ref{sect:distributed_memory_communication}). A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared memory communication ( see section -\ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates +\ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates that a CPU should communicate by writing to data structures owned by another CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading from data structures owned by another CPU. These flags affect the behavior @@ -1123,7 +1137,13 @@ (see figure \ref{fig:communication_primitives}). The routine {\em ini\_communication\_patterns()} is responsible for setting the communication mode values for each tile. -\\ + +When using the cubed sphere configuration with the exch2 package, the +relationships between tiles and their communication methods are set +by the package in other variables. See the exch2 package documentation +(\ref{sec:exch2} for details. + + \fbox{ \begin{minipage}{4.75in} @@ -1166,7 +1186,7 @@ are read from the file {\em eedata}. If the value of {\em nThreads} is inconsistent with the number of threads requested from the operating system (for example by using an environment -variable as described in section \ref{sec:multi_threaded_execution}) +variable as described in section \ref{sect:multi_threaded_execution}) then usually an error will be reported by the routine {\em CHECK\_THREADS}.\\ @@ -1184,7 +1204,7 @@ } \item {\bf memsync flags} -As discussed in section \ref{sec:memory_consistency}, when using shared memory, +As discussed in section \ref{sect:memory_consistency}, when using shared memory, a low-level system function may be need to force memory consistency. The routine {\em MEMSYNC()} is used for this purpose. This routine should not need modifying and the information below is only provided for @@ -1210,7 +1230,7 @@ \end{verbatim} \item {\bf Cache line size} -As discussed in section \ref{sec:cache_effects_and_false_sharing}, +As discussed in section \ref{sect:cache_effects_and_false_sharing}, milti-threaded codes explicitly avoid penalties associated with excessive coherence traffic on an SMP system. To do this the shared memory data structures used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines @@ -1238,7 +1258,7 @@ setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}. The \_GSUM macro is a performance critical operation, especially for large processor count, small tile size configurations. -The custom communication example discussed in section \ref{sec:jam_example} +The custom communication example discussed in section \ref{sect:jam_example} shows how the macro is used to invoke a custom global sum routine for a specific set of hardware. @@ -1252,14 +1272,16 @@ in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH operation plays a crucial role in scaling to small tile, large logical and physical processor count configurations. -The example in section \ref{sec:jam_example} discusses defining an +The example in section \ref{sect:jam_example} discusses defining an optimized and specialized form on the \_EXCH operation. The \_EXCH operation is also central to supporting grids such as the cube-sphere grid. In this class of grid a rotation may be required between tiles. Aligning the coordinate requiring rotation with the tile decomposition, allows the coordinate transformation to -be embedded within a custom form of the \_EXCH primitive. +be embedded within a custom form of the \_EXCH primitive. In these +cases \_EXCH is mapped to exch2 routines, as detailed in the exch2 +package documentation \ref{sec:exch2}. \item {\bf Reverse Mode} The communication primitives \_EXCH and \_GSUM both employ @@ -1276,6 +1298,7 @@ is set to the value {\em REVERSE\_SIMULATION}. This signifies ti the low-level routines that the adjoint forms of the appropriate communication operation should be performed. + \item {\bf MAX\_NO\_THREADS} The variable {\em MAX\_NO\_THREADS} is used to indicate the maximum number of OS threads that a code will use. This @@ -1292,7 +1315,7 @@ if this might be unavailable then the work arrays can be extended with dimensions use the tile dimensioning scheme of {\em nSx} and {\em nSy} ( as described in section -\ref{sec:specifying_a_decomposition}). However, if the configuration +\ref{sect:specifying_a_decomposition}). However, if the configuration being specified involves many more tiles than OS threads then it can save memory resources to reduce the variable {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that @@ -1351,7 +1374,7 @@ how it can be used to adapt to new griding approaches. \subsubsection{JAM example} -\label{sec:jam_example} +\label{sect:jam_example} On some platforms a big performance boost can be obtained by binding the communication routines {\em \_EXCH} and {\em \_GSUM} to specialized native libraries ) fro example the @@ -1374,13 +1397,13 @@ pattern. \subsubsection{Cube sphere communication} -\label{sec:cube_sphere_communication} +\label{sect:cube_sphere_communication} Actual {\em \_EXCH} routine code is generated automatically from a series of template files, for example {\em exch\_rx.template}. This is done to allow a large number of variations on the exchange process to be maintained. One set of variations supports the cube sphere grid. Support for a cube sphere grid in MITgcm is based -on having each face of the cube as a separate tile (or tiles). +on having each face of the cube as a separate tile or tiles. The exchange routines are then able to absorb much of the detailed rotation and reorientation required when moving around the cube grid. The set of {\em \_EXCH} routines that contain the @@ -1404,16 +1427,20 @@ \section{MITgcm execution under WRAPPER} +\begin{rawhtml} + +\end{rawhtml} Fitting together the WRAPPER elements, package elements and MITgcm core equation elements of the source code produces calling -sequence shown in section \ref{sec:calling_sequence} +sequence shown in section \ref{sect:calling_sequence} \subsection{Annotated call tree for MITgcm and WRAPPER} -\label{sec:calling_sequence} +\label{sect:calling_sequence} WRAPPER layer. +{\footnotesize \begin{verbatim} MAIN @@ -1441,9 +1468,11 @@ |--THE_MODEL_MAIN :: Numerical code top-level driver routine \end{verbatim} +} Core equations plus packages. +{\footnotesize \begin{verbatim} C C @@ -1782,6 +1811,7 @@ C :: events. C \end{verbatim} +} \subsection{Measuring and Characterizing Performance}