/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Diff of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.5 by cnh, Tue Nov 13 18:32:33 2001 UTC revision 1.6 by adcroft, Tue Nov 13 20:13:55 2001 UTC
# Line 28  of Line 28  of
28    
29  \begin{enumerate}  \begin{enumerate}
30  \item A core set of numerical and support code. This is discussed in detail in  \item A core set of numerical and support code. This is discussed in detail in
31  section \ref{sec:partII}.  section \ref{sect:partII}.
32  \item A scheme for supporting optional "pluggable" {\bf packages} (containing  \item A scheme for supporting optional "pluggable" {\bf packages} (containing
33  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).  for example mixed-layer schemes, biogeochemical schemes, atmospheric physics).
34  These packages are used both to overlay alternate dynamics and to introduce  These packages are used both to overlay alternate dynamics and to introduce
# Line 74  Environment Resource). All numerical and Line 74  Environment Resource). All numerical and
74  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within  to ``fit'' within the WRAPPER infrastructure. Writing code to ``fit'' within
75  the WRAPPER means that coding has to follow certain, relatively  the WRAPPER means that coding has to follow certain, relatively
76  straightforward, rules and conventions ( these are discussed further in  straightforward, rules and conventions ( these are discussed further in
77  section \ref{sec:specifying_a_decomposition} ).  section \ref{sect:specifying_a_decomposition} ).
78    
79  The approach taken by the WRAPPER is illustrated in figure  The approach taken by the WRAPPER is illustrated in figure
80  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to insulate code
# Line 98  optimized for that platform.} Line 98  optimized for that platform.}
98  \end{figure}  \end{figure}
99    
100  \subsection{Target hardware}  \subsection{Target hardware}
101  \label{sec:target_hardware}  \label{sect:target_hardware}
102    
103  The WRAPPER is designed to target as broad as possible a range of computer  The WRAPPER is designed to target as broad as possible a range of computer
104  systems. The original development of the WRAPPER took place on a  systems. The original development of the WRAPPER took place on a
# Line 118  native optimizations for a particular sy Line 118  native optimizations for a particular sy
118    
119  \subsection{Supporting hardware neutrality}  \subsection{Supporting hardware neutrality}
120    
121  The different systems listed in section \ref{sec:target_hardware} can be  The different systems listed in section \ref{sect:target_hardware} can be
122  categorized in many different ways. For example, one common distinction is  categorized in many different ways. For example, one common distinction is
123  between shared-memory parallel systems (SMP's, PVP's) and distributed memory  between shared-memory parallel systems (SMP's, PVP's) and distributed memory
124  parallel systems (for example x86 clusters and large MPP systems). This is one  parallel systems (for example x86 clusters and large MPP systems). This is one
# Line 211  computational phases a processor will re Line 211  computational phases a processor will re
211  whenever it requires values that outside the domain it owns. Periodically  whenever it requires values that outside the domain it owns. Periodically
212  processors will make calls to WRAPPER functions to communicate data between  processors will make calls to WRAPPER functions to communicate data between
213  tiles, in order to keep the overlap regions up to date (see section  tiles, in order to keep the overlap regions up to date (see section
214  \ref{sec:communication_primitives}). The WRAPPER functions can use a  \ref{sect:communication_primitives}). The WRAPPER functions can use a
215  variety of different mechanisms to communicate data between tiles.  variety of different mechanisms to communicate data between tiles.
216    
217  \begin{figure}  \begin{figure}
# Line 298  value to be communicated between CPU's. Line 298  value to be communicated between CPU's.
298  \end{figure}  \end{figure}
299    
300  \subsection{Shared memory communication}  \subsection{Shared memory communication}
301  \label{sec:shared_memory_communication}  \label{sect:shared_memory_communication}
302    
303  Under shared communication independent CPU's are operating  Under shared communication independent CPU's are operating
304  on the exact same global address space at the application level.  on the exact same global address space at the application level.
# Line 324  the systems main-memory interconnect. Th Line 324  the systems main-memory interconnect. Th
324  communication very efficient provided it is used appropriately.  communication very efficient provided it is used appropriately.
325    
326  \subsubsection{Memory consistency}  \subsubsection{Memory consistency}
327  \label{sec:memory_consistency}  \label{sect:memory_consistency}
328    
329  When using shared memory communication between  When using shared memory communication between
330  multiple processors the WRAPPER level shields user applications from  multiple processors the WRAPPER level shields user applications from
# Line 348  memory, the WRAPPER provides a place to Line 348  memory, the WRAPPER provides a place to
348  ensure memory consistency for a particular platform.  ensure memory consistency for a particular platform.
349    
350  \subsubsection{Cache effects and false sharing}  \subsubsection{Cache effects and false sharing}
351  \label{sec:cache_effects_and_false_sharing}  \label{sect:cache_effects_and_false_sharing}
352    
353  Shared-memory machines often have local to processor memory caches  Shared-memory machines often have local to processor memory caches
354  which contain mirrored copies of main memory. Automatic cache-coherence  which contain mirrored copies of main memory. Automatic cache-coherence
# Line 367  in an application are potentially visibl Line 367  in an application are potentially visibl
367  threads operating within a single process is the standard mechanism for  threads operating within a single process is the standard mechanism for
368  supporting shared memory that the WRAPPER utilizes. Configuring and launching  supporting shared memory that the WRAPPER utilizes. Configuring and launching
369  code to run in multi-threaded mode on specific platforms is discussed in  code to run in multi-threaded mode on specific platforms is discussed in
370  section \ref{sec:running_with_threads}.  However, on many systems, potentially  section \ref{sect:running_with_threads}.  However, on many systems, potentially
371  very efficient mechanisms for using shared memory communication between  very efficient mechanisms for using shared memory communication between
372  multiple processes (in contrast to multiple threads within a single  multiple processes (in contrast to multiple threads within a single
373  process) also exist. In most cases this works by making a limited region of  process) also exist. In most cases this works by making a limited region of
# Line 380  distributed with the default WRAPPER sou Line 380  distributed with the default WRAPPER sou
380  nature.  nature.
381    
382  \subsection{Distributed memory communication}  \subsection{Distributed memory communication}
383  \label{sec:distributed_memory_communication}  \label{sect:distributed_memory_communication}
384  Many parallel systems are not constructed in a way where it is  Many parallel systems are not constructed in a way where it is
385  possible or practical for an application to use shared memory  possible or practical for an application to use shared memory
386  for communication. For example cluster systems consist of individual computers  for communication. For example cluster systems consist of individual computers
# Line 394  described in \ref{hoe-hill:99} substitut Line 394  described in \ref{hoe-hill:99} substitut
394  highly optimized library.  highly optimized library.
395    
396  \subsection{Communication primitives}  \subsection{Communication primitives}
397  \label{sec:communication_primitives}  \label{sect:communication_primitives}
398    
399  \begin{figure}  \begin{figure}
400  \begin{center}  \begin{center}
# Line 538  WRAPPER are Line 538  WRAPPER are
538  computing CPU's.  computing CPU's.
539  \end{enumerate}  \end{enumerate}
540  This section describes the details of each of these operations.  This section describes the details of each of these operations.
541  Section \ref{sec:specifying_a_decomposition} explains how the way in which  Section \ref{sect:specifying_a_decomposition} explains how the way in which
542  a domain is decomposed (or composed) is expressed. Section  a domain is decomposed (or composed) is expressed. Section
543  \ref{sec:starting_a_code} describes practical details of running codes  \ref{sect:starting_a_code} describes practical details of running codes
544  in various different parallel modes on contemporary computer systems.  in various different parallel modes on contemporary computer systems.
545  Section \ref{sec:controlling_communication} explains the internal information  Section \ref{sect:controlling_communication} explains the internal information
546  that the WRAPPER uses to control how information is communicated between  that the WRAPPER uses to control how information is communicated between
547  tiles.  tiles.
548    
549  \subsection{Specifying a domain decomposition}  \subsection{Specifying a domain decomposition}
550  \label{sec:specifying_a_decomposition}  \label{sect:specifying_a_decomposition}
551    
552  At its heart much of the WRAPPER works only in terms of a collection of tiles  At its heart much of the WRAPPER works only in terms of a collection of tiles
553  which are interconnected to each other. This is also true of application  which are interconnected to each other. This is also true of application
# Line 599  be created within a single process. Each Line 599  be created within a single process. Each
599  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
600  allocated to different threads of a process that are then bound to  allocated to different threads of a process that are then bound to
601  different physical processors ( see the multi-threaded  different physical processors ( see the multi-threaded
602  execution discussion in section \ref{sec:starting_the_code} ) then  execution discussion in section \ref{sect:starting_the_code} ) then
603  computation will be performed concurrently on each tile. However, it is also  computation will be performed concurrently on each tile. However, it is also
604  possible to run the same decomposition within a process running a single thread on  possible to run the same decomposition within a process running a single thread on
605  a single processor. In this case the tiles will be computed over sequentially.  a single processor. In this case the tiles will be computed over sequentially.
# Line 790  There are six tiles allocated to six sep Line 790  There are six tiles allocated to six sep
790  This set of values can be used for a cube sphere calculation.  This set of values can be used for a cube sphere calculation.
791  Each tile of size $32 \times 32$ represents a face of the  Each tile of size $32 \times 32$ represents a face of the
792  cube. Initializing the tile connectivity correctly ( see section  cube. Initializing the tile connectivity correctly ( see section
793  \ref{sec:cube_sphere_communication}. allows the rotations associated with  \ref{sect:cube_sphere_communication}. allows the rotations associated with
794  moving between the six cube faces to be embedded within the  moving between the six cube faces to be embedded within the
795  tile-tile communication code.  tile-tile communication code.
796  \end{enumerate}  \end{enumerate}
797    
798    
799  \subsection{Starting the code}  \subsection{Starting the code}
800  \label{sec:starting_the_code}  \label{sect:starting_the_code}
801  When code is started under the WRAPPER, execution begins in a main routine {\em  When code is started under the WRAPPER, execution begins in a main routine {\em
802  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
803  to the application through a routine called {\em THE\_MODEL\_MAIN()}  to the application through a routine called {\em THE\_MODEL\_MAIN()}
# Line 842  occurs through the procedure {\em THE\_M Line 842  occurs through the procedure {\em THE\_M
842  \end{figure}  \end{figure}
843    
844  \subsubsection{Multi-threaded execution}  \subsubsection{Multi-threaded execution}
845  \label{sec:multi-threaded-execution}  \label{sect:multi-threaded-execution}
846  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
847  WRAPPER may cause several coarse grain threads to be initialized. The routine  WRAPPER may cause several coarse grain threads to be initialized. The routine
848  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
849  stack argument which is the thread number, stored in the  stack argument which is the thread number, stored in the
850  variable {\em myThid}. In addition to specifying a decomposition with  variable {\em myThid}. In addition to specifying a decomposition with
851  multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})  multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})
852  configuring and starting a code to run using multiple threads requires the following  configuring and starting a code to run using multiple threads requires the following
853  steps.\\  steps.\\
854    
# Line 930  Parameter:  {\em nTy} Line 930  Parameter:  {\em nTy}
930  } \\  } \\
931    
932  \subsubsection{Multi-process execution}  \subsubsection{Multi-process execution}
933  \label{sec:multi-process-execution}  \label{sect:multi-process-execution}
934    
935  Despite its appealing programming model, multi-threaded execution remains  Despite its appealing programming model, multi-threaded execution remains
936  less common then multi-process execution. One major reason for this  less common then multi-process execution. One major reason for this
# Line 942  models varies between systems. Line 942  models varies between systems.
942    
943  Multi-process execution is more ubiquitous.  Multi-process execution is more ubiquitous.
944  In order to run code in a multi-process configuration a decomposition  In order to run code in a multi-process configuration a decomposition
945  specification ( see section \ref{sec:specifying_a_decomposition})  specification ( see section \ref{sect:specifying_a_decomposition})
946  is given ( in which the at least one of the  is given ( in which the at least one of the
947  parameters {\em nPx} or {\em nPy} will be greater than one)  parameters {\em nPx} or {\em nPy} will be greater than one)
948  and then, as for multi-threaded operation,  and then, as for multi-threaded operation,
# Line 1112  A value of {\em COMM\_NONE} is used to i Line 1112  A value of {\em COMM\_NONE} is used to i
1112  neighbor to communicate with on a particular face. A value  neighbor to communicate with on a particular face. A value
1113  of {\em COMM\_MSG} is used to indicated that some form of distributed  of {\em COMM\_MSG} is used to indicated that some form of distributed
1114  memory communication is required to communicate between  memory communication is required to communicate between
1115  these tile faces ( see section \ref{sec:distributed_memory_communication}).  these tile faces ( see section \ref{sect:distributed_memory_communication}).
1116  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate  A value of {\em COMM\_PUT} or {\em COMM\_GET} is used to indicate
1117  forms of shared memory communication ( see section  forms of shared memory communication ( see section
1118  \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value indicates  \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value indicates
1119  that a CPU should communicate by writing to data structures owned by another  that a CPU should communicate by writing to data structures owned by another
1120  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading  CPU. A {\em COMM\_GET} value indicates that a CPU should communicate by reading
1121  from data structures owned by another CPU. These flags affect the behavior  from data structures owned by another CPU. These flags affect the behavior
# Line 1166  the product of the parameters {\em nTx} Line 1166  the product of the parameters {\em nTx}
1166  are read from the file {\em eedata}. If the value of {\em nThreads}  are read from the file {\em eedata}. If the value of {\em nThreads}
1167  is inconsistent with the number of threads requested from the  is inconsistent with the number of threads requested from the
1168  operating system (for example by using an environment  operating system (for example by using an environment
1169  variable as described in section \ref{sec:multi_threaded_execution})  variable as described in section \ref{sect:multi_threaded_execution})
1170  then usually an error will be reported by the routine  then usually an error will be reported by the routine
1171  {\em CHECK\_THREADS}.\\  {\em CHECK\_THREADS}.\\
1172    
# Line 1184  Parameter: {\em nTy} \\ Line 1184  Parameter: {\em nTy} \\
1184  }  }
1185    
1186  \item {\bf memsync flags}  \item {\bf memsync flags}
1187  As discussed in section \ref{sec:memory_consistency}, when using shared memory,  As discussed in section \ref{sect:memory_consistency}, when using shared memory,
1188  a low-level system function may be need to force memory consistency.  a low-level system function may be need to force memory consistency.
1189  The routine {\em MEMSYNC()} is used for this purpose. This routine should  The routine {\em MEMSYNC()} is used for this purpose. This routine should
1190  not need modifying and the information below is only provided for  not need modifying and the information below is only provided for
# Line 1210  asm("lock; addl $0,0(%%esp)": : :"memory Line 1210  asm("lock; addl $0,0(%%esp)": : :"memory
1210  \end{verbatim}  \end{verbatim}
1211    
1212  \item {\bf Cache line size}  \item {\bf Cache line size}
1213  As discussed in section \ref{sec:cache_effects_and_false_sharing},  As discussed in section \ref{sect:cache_effects_and_false_sharing},
1214  milti-threaded codes explicitly avoid penalties associated with excessive  milti-threaded codes explicitly avoid penalties associated with excessive
1215  coherence traffic on an SMP system. To do this the shared memory data structures  coherence traffic on an SMP system. To do this the shared memory data structures
1216  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
# Line 1238  or {\em GLOBAL\_SUM\_R4()} (for 32-bit f Line 1238  or {\em GLOBAL\_SUM\_R4()} (for 32-bit f
1238  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1239  The \_GSUM macro is a performance critical operation, especially for  The \_GSUM macro is a performance critical operation, especially for
1240  large processor count, small tile size configurations.  large processor count, small tile size configurations.
1241  The custom communication example discussed in section \ref{sec:jam_example}  The custom communication example discussed in section \ref{sect:jam_example}
1242  shows how the macro is used to invoke a custom global sum routine  shows how the macro is used to invoke a custom global sum routine
1243  for a specific set of hardware.  for a specific set of hardware.
1244    
# Line 1252  physical fields and whether fields are 3 Line 1252  physical fields and whether fields are 3
1252  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the  in the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the
1253  \_EXCH operation plays a crucial role in scaling to small tile,  \_EXCH operation plays a crucial role in scaling to small tile,
1254  large logical and physical processor count configurations.  large logical and physical processor count configurations.
1255  The example in section \ref{sec:jam_example} discusses defining an  The example in section \ref{sect:jam_example} discusses defining an
1256  optimized and specialized form on the \_EXCH operation.  optimized and specialized form on the \_EXCH operation.
1257    
1258  The \_EXCH operation is also central to supporting grids such as  The \_EXCH operation is also central to supporting grids such as
# Line 1292  This can be achieved using a Fortran 90 Line 1292  This can be achieved using a Fortran 90
1292  if this might be unavailable then the work arrays can be extended  if this might be unavailable then the work arrays can be extended
1293  with dimensions use the tile dimensioning scheme of {\em nSx}  with dimensions use the tile dimensioning scheme of {\em nSx}
1294  and {\em nSy} ( as described in section  and {\em nSy} ( as described in section
1295  \ref{sec:specifying_a_decomposition}). However, if the configuration  \ref{sect:specifying_a_decomposition}). However, if the configuration
1296  being specified involves many more tiles than OS threads then  being specified involves many more tiles than OS threads then
1297  it can save memory resources to reduce the variable  it can save memory resources to reduce the variable
1298  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
# Line 1351  Here we show how it can be used to impro Line 1351  Here we show how it can be used to impro
1351  how it can be used to adapt to new griding approaches.  how it can be used to adapt to new griding approaches.
1352    
1353  \subsubsection{JAM example}  \subsubsection{JAM example}
1354  \label{sec:jam_example}  \label{sect:jam_example}
1355  On some platforms a big performance boost can be obtained by  On some platforms a big performance boost can be obtained by
1356  binding the communication routines {\em \_EXCH} and  binding the communication routines {\em \_EXCH} and
1357  {\em \_GSUM} to specialized native libraries ) fro example the  {\em \_GSUM} to specialized native libraries ) fro example the
# Line 1374  Developing specialized code for other li Line 1374  Developing specialized code for other li
1374  pattern.  pattern.
1375    
1376  \subsubsection{Cube sphere communication}  \subsubsection{Cube sphere communication}
1377  \label{sec:cube_sphere_communication}  \label{sect:cube_sphere_communication}
1378  Actual {\em \_EXCH} routine code is generated automatically from  Actual {\em \_EXCH} routine code is generated automatically from
1379  a series of template files, for example {\em exch\_rx.template}.  a series of template files, for example {\em exch\_rx.template}.
1380  This is done to allow a large number of variations on the exchange  This is done to allow a large number of variations on the exchange
# Line 1407  quantities at the C-grid vorticity point Line 1407  quantities at the C-grid vorticity point
1407    
1408  Fitting together the WRAPPER elements, package elements and  Fitting together the WRAPPER elements, package elements and
1409  MITgcm core equation elements of the source code produces calling  MITgcm core equation elements of the source code produces calling
1410  sequence shown in section \ref{sec:calling_sequence}  sequence shown in section \ref{sect:calling_sequence}
1411    
1412  \subsection{Annotated call tree for MITgcm and WRAPPER}  \subsection{Annotated call tree for MITgcm and WRAPPER}
1413  \label{sec:calling_sequence}  \label{sect:calling_sequence}
1414    
1415  WRAPPER layer.  WRAPPER layer.
1416    

Legend:
Removed from v.1.5  
changed lines
  Added in v.1.6

  ViewVC Help
Powered by ViewVC 1.1.22