/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Diff of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.4 by cnh, Thu Oct 25 18:36:55 2001 UTC revision 1.5 by cnh, Tue Nov 13 18:32:33 2001 UTC
# Line 136  particular machine (for example an IBM S Line 136  particular machine (for example an IBM S
136  class of machines (for example Parallel Vector Processor Systems). Instead the  class of machines (for example Parallel Vector Processor Systems). Instead the
137  WRAPPER provides applications with an  WRAPPER provides applications with an
138  abstract {\it machine model}. The machine model is very general, however, it can  abstract {\it machine model}. The machine model is very general, however, it can
139  easily be specialized to fit, in a computationally effificent manner, any  easily be specialized to fit, in a computationally efficient manner, any
140  computer architecture currently available to the scientific computing community.  computer architecture currently available to the scientific computing community.
141    
142  \subsection{Machine model parallelism}  \subsection{Machine model parallelism}
143    
144   Codes operating under the WRAPPER target an abstract machine that is assumed to   Codes operating under the WRAPPER target an abstract machine that is assumed to
145  consist of one or more logical processors that can compute concurrently.    consist of one or more logical processors that can compute concurrently.  
146  Computational work is divided amongst the logical  Computational work is divided among the logical
147  processors by allocating ``ownership'' to  processors by allocating ``ownership'' to
148  each processor of a certain set (or sets) of calculations. Each set of  each processor of a certain set (or sets) of calculations. Each set of
149  calculations owned by a particular processor is associated with a specific  calculations owned by a particular processor is associated with a specific
# Line 402  highly optimized library. Line 402  highly optimized library.
402    \includegraphics{part4/comm-primm.eps}    \includegraphics{part4/comm-primm.eps}
403   }   }
404  \end{center}  \end{center}
405  \caption{Three performance critical parallel primititives are provided  \caption{Three performance critical parallel primitives are provided
406  by the WRAPPER. These primititives are always used to communicate data  by the WRAPPER. These primitives are always used to communicate data
407  between tiles. The figure shows four tiles. The curved arrows indicate  between tiles. The figure shows four tiles. The curved arrows indicate
408  exchange primitives which transfer data between the overlap regions at tile  exchange primitives which transfer data between the overlap regions at tile
409  edges and interior regions for nearest-neighbor tiles.  edges and interior regions for nearest-neighbor tiles.
# Line 1006  using a command such as Line 1006  using a command such as
1006  \begin{verbatim}  \begin{verbatim}
1007  mpirun -np 64 -machinefile mf ./mitgcmuv  mpirun -np 64 -machinefile mf ./mitgcmuv
1008  \end{verbatim}  \end{verbatim}
1009  In this example the text {\em -np 64} specifices the number of processes  In this example the text {\em -np 64} specifies the number of processes
1010  that will be created. The numeric value {\em 64} must be equal to the  that will be created. The numeric value {\em 64} must be equal to the
1011  product of the processor grid settings of {\em nPx} and {\em nPy}  product of the processor grid settings of {\em nPx} and {\em nPy}
1012  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file  in the file {\em SIZE.h}. The parameter {\em mf} specifies that a text file
# Line 1212  asm("lock; addl $0,0(%%esp)": : :"memory Line 1212  asm("lock; addl $0,0(%%esp)": : :"memory
1212  \item {\bf Cache line size}  \item {\bf Cache line size}
1213  As discussed in section \ref{sec:cache_effects_and_false_sharing},  As discussed in section \ref{sec:cache_effects_and_false_sharing},
1214  milti-threaded codes explicitly avoid penalties associated with excessive  milti-threaded codes explicitly avoid penalties associated with excessive
1215  coherence traffic on an SMP system. To do this the sgared memory data structures  coherence traffic on an SMP system. To do this the shared memory data structures
1216  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines  used by the {\em GLOBAL\_SUM}, {\em GLOBAL\_MAX} and {\em BARRIER} routines
1217  are padded. The variables that control the padding are set in the  are padded. The variables that control the padding are set in the
1218  header file {\em EEPARAMS.h}. These variables are called  header file {\em EEPARAMS.h}. These variables are called
# Line 1220  header file {\em EEPARAMS.h}. These vari Line 1220  header file {\em EEPARAMS.h}. These vari
1220  {\em lShare8}. The default values should not normally need changing.  {\em lShare8}. The default values should not normally need changing.
1221  \item {\bf \_BARRIER}  \item {\bf \_BARRIER}
1222  This is a CPP macro that is expanded to a call to a routine  This is a CPP macro that is expanded to a call to a routine
1223  which synchronises all the logical processors running under the  which synchronizes all the logical processors running under the
1224  WRAPPER. Using a macro here preserves flexibility to insert  WRAPPER. Using a macro here preserves flexibility to insert
1225  a specialized call in-line into application code. By default this  a specialized call in-line into application code. By default this
1226  resolves to calling the procedure {\em BARRIER()}. The default  resolves to calling the procedure {\em BARRIER()}. The default
# Line 1228  setting for the \_BARRIER macro is given Line 1228  setting for the \_BARRIER macro is given
1228    
1229  \item {\bf \_GSUM}  \item {\bf \_GSUM}
1230  This is a CPP macro that is expanded to a call to a routine  This is a CPP macro that is expanded to a call to a routine
1231  which sums up a floating point numner  which sums up a floating point number
1232  over all the logical processors running under the  over all the logical processors running under the
1233  WRAPPER. Using a macro here provides extra flexibility to insert  WRAPPER. Using a macro here provides extra flexibility to insert
1234  a specialized call in-line into application code. By default this  a specialized call in-line into application code. By default this
1235  resolves to calling the procedure {\em GLOBAL\_SOM\_R8()} ( for  resolves to calling the procedure {\em GLOBAL\_SUM\_R8()} ( for
1236  84=bit floating point operands)  64-bit floating point operands)
1237  or {\em GLOBAL\_SOM\_R4()} (for 32-bit floating point operands). The default  or {\em GLOBAL\_SUM\_R4()} (for 32-bit floating point operands). The default
1238  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.  setting for the \_GSUM macro is given in the file {\em CPP\_EEMACROS.h}.
1239  The \_GSUM macro is a performance critical operation, especially for  The \_GSUM macro is a performance critical operation, especially for
1240  large processor count, small tile size configurations.  large processor count, small tile size configurations.
# Line 1253  in the header file {\em CPP\_EEMACROS.h} Line 1253  in the header file {\em CPP\_EEMACROS.h}
1253  \_EXCH operation plays a crucial role in scaling to small tile,  \_EXCH operation plays a crucial role in scaling to small tile,
1254  large logical and physical processor count configurations.  large logical and physical processor count configurations.
1255  The example in section \ref{sec:jam_example} discusses defining an  The example in section \ref{sec:jam_example} discusses defining an
1256  optimised and specialized form on the \_EXCH operation.  optimized and specialized form on the \_EXCH operation.
1257    
1258  The \_EXCH operation is also central to supporting grids such as  The \_EXCH operation is also central to supporting grids such as
1259  the cube-sphere grid. In this class of grid a rotation may be required  the cube-sphere grid. In this class of grid a rotation may be required
1260  between tiles. Aligning the coordinate requiring rotation with the  between tiles. Aligning the coordinate requiring rotation with the
1261  tile decomposistion, allows the coordinate transformation to  tile decomposition, allows the coordinate transformation to
1262  be embedded within a custom form of the \_EXCH primitive.  be embedded within a custom form of the \_EXCH primitive.
1263    
1264  \item {\bf Reverse Mode}  \item {\bf Reverse Mode}
1265  The communication primitives \_EXCH and \_GSUM both employ  The communication primitives \_EXCH and \_GSUM both employ
1266  hand-written adjoint forms (or reverse mode) forms.  hand-written adjoint forms (or reverse mode) forms.
1267  These reverse mode forms can be found in the  These reverse mode forms can be found in the
1268  sourc code directory {\em pkg/autodiff}.  source code directory {\em pkg/autodiff}.
1269  For the global sum primitive the reverse mode form  For the global sum primitive the reverse mode form
1270  calls are to {\em GLOBAL\_ADSUM\_R4} and  calls are to {\em GLOBAL\_ADSUM\_R4} and
1271  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the  {\em GLOBAL\_ADSUM\_R8}. The reverse mode form of the
1272  exchamge primitives are found in routines  exchange primitives are found in routines
1273  prefixed {\em ADEXCH}. The exchange routines make calls to  prefixed {\em ADEXCH}. The exchange routines make calls to
1274  the same low-level communication primitives as the forward mode  the same low-level communication primitives as the forward mode
1275  operations. However, the routine argument {\em simulationMode}  operations. However, the routine argument {\em simulationMode}
# Line 1281  The variable {\em MAX\_NO\_THREADS} is u Line 1281  The variable {\em MAX\_NO\_THREADS} is u
1281  maximum number of OS threads that a code will use. This  maximum number of OS threads that a code will use. This
1282  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.  value defaults to thirty-two and is set in the file {\em EEPARAMS.h}.
1283  For single threaded execution it can be reduced to one if required.  For single threaded execution it can be reduced to one if required.
1284  The va;lue is largely private to the WRAPPER and application code  The value; is largely private to the WRAPPER and application code
1285  will nor normally reference the value, except in the following scenario.  will nor normally reference the value, except in the following scenario.
1286    
1287  For certain physical parametrization schemes it is necessary to have  For certain physical parametrization schemes it is necessary to have
# Line 1296  and {\em nSy} ( as described in section Line 1296  and {\em nSy} ( as described in section
1296  being specified involves many more tiles than OS threads then  being specified involves many more tiles than OS threads then
1297  it can save memory resources to reduce the variable  it can save memory resources to reduce the variable
1298  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that  {\em MAX\_NO\_THREADS} to be equal to the actual number of threads that
1299  will be used and to declare the physical parameterisation  will be used and to declare the physical parameterization
1300  work arrays with a sinble {\em MAX\_NO\_THREADS} extra dimension.  work arrays with a single {\em MAX\_NO\_THREADS} extra dimension.
1301  An example of this is given in the verification experiment  An example of this is given in the verification experiment
1302  {\em aim.5l\_cs}. Here the default setting of  {\em aim.5l\_cs}. Here the default setting of
1303  {\em MAX\_NO\_THREADS} is altered to  {\em MAX\_NO\_THREADS} is altered to
# Line 1310  created with declarations of the form. Line 1310  created with declarations of the form.
1310  \begin{verbatim}  \begin{verbatim}
1311        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)        common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1312  \end{verbatim}  \end{verbatim}
1313  This declaration scheme is not used widely, becuase most global data  This declaration scheme is not used widely, because most global data
1314  is used for permanent not temporary storage of state information.  is used for permanent not temporary storage of state information.
1315  In the case of permanent state information this approach cannot be used  In the case of permanent state information this approach cannot be used
1316  because there has to be enough storage allocated for all tiles.  because there has to be enough storage allocated for all tiles.
1317  However, the technique can sometimes be a useful scheme for reducing memory  However, the technique can sometimes be a useful scheme for reducing memory
1318  requirements in complex physical paramterisations.  requirements in complex physical parameterizations.
1319  \end{enumerate}  \end{enumerate}
1320    
1321  \begin{figure}  \begin{figure}
# Line 1348  MP directives to spawn multiple threads. Line 1348  MP directives to spawn multiple threads.
1348  The isolation of performance critical communication primitives and the  The isolation of performance critical communication primitives and the
1349  sub-division of the simulation domain into tiles is a powerful tool.  sub-division of the simulation domain into tiles is a powerful tool.
1350  Here we show how it can be used to improve application performance and  Here we show how it can be used to improve application performance and
1351  how it can be used to adapt to new gridding approaches.  how it can be used to adapt to new griding approaches.
1352    
1353  \subsubsection{JAM example}  \subsubsection{JAM example}
1354  \label{sec:jam_example}  \label{sec:jam_example}

Legend:
Removed from v.1.4  
changed lines
  Added in v.1.5

  ViewVC Help
Powered by ViewVC 1.1.22