/[MITgcm]/manual/s_software/text/sarch.tex
ViewVC logotype

Diff of /manual/s_software/text/sarch.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.23 by edhill, Tue Apr 4 19:14:14 2006 UTC revision 1.26 by jmc, Mon Aug 30 23:09:22 2010 UTC
# Line 5  within which both the core numerics and Line 5  within which both the core numerics and
5  operate. The description presented here is intended to be a detailed  operate. The description presented here is intended to be a detailed
6  exposition and contains significant background material, as well as  exposition and contains significant background material, as well as
7  advanced details on working with the WRAPPER.  The tutorial sections  advanced details on working with the WRAPPER.  The tutorial sections
8  of this manual (see sections \ref{sect:tutorials} and  of this manual (see sections \ref{sec:modelExamples} and
9  \ref{sect:tutorialIII}) contain more succinct, step-by-step  \ref{sec:tutorialIII}) contain more succinct, step-by-step
10  instructions on running basic numerical experiments, of varous types,  instructions on running basic numerical experiments, of varous types,
11  both sequentially and in parallel. For many projects simply starting  both sequentially and in parallel. For many projects simply starting
12  from an example code and adapting it to suit a particular situation  from an example code and adapting it to suit a particular situation
# Line 69  is required. Line 69  is required.
69    
70  \begin{figure}  \begin{figure}
71  \begin{center}  \begin{center}
72  \resizebox{!}{2.5in}{\includegraphics{part4/mitgcm_goals.eps}}  \resizebox{!}{2.5in}{\includegraphics{s_software/figs/mitgcm_goals.eps}}
73  \end{center}  \end{center}
74  \caption{ The MITgcm architecture is designed to allow simulation of a  \caption{ The MITgcm architecture is designed to allow simulation of a
75    wide range of physical problems on a wide range of hardware. The    wide range of physical problems on a wide range of hardware. The
# Line 93  Resource). All numerical and support cod Line 93  Resource). All numerical and support cod
93  ``fit'' within the WRAPPER infrastructure. Writing code to ``fit''  ``fit'' within the WRAPPER infrastructure. Writing code to ``fit''
94  within the WRAPPER means that coding has to follow certain, relatively  within the WRAPPER means that coding has to follow certain, relatively
95  straightforward, rules and conventions (these are discussed further in  straightforward, rules and conventions (these are discussed further in
96  section \ref{sect:specifying_a_decomposition}).  section \ref{sec:specifying_a_decomposition}).
97    
98  The approach taken by the WRAPPER is illustrated in figure  The approach taken by the WRAPPER is illustrated in figure
99  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to  \ref{fig:fit_in_wrapper} which shows how the WRAPPER serves to
# Line 104  numerical code to be easily retargetted. Line 104  numerical code to be easily retargetted.
104    
105  \begin{figure}  \begin{figure}
106  \begin{center}  \begin{center}
107  \resizebox{!}{4.5in}{\includegraphics{part4/fit_in_wrapper.eps}}  \resizebox{!}{4.5in}{\includegraphics{s_software/figs/fit_in_wrapper.eps}}
108  \end{center}  \end{center}
109  \caption{  \caption{
110  Numerical code is written to fit within a software support  Numerical code is written to fit within a software support
# Line 118  optimized for that platform.} Line 118  optimized for that platform.}
118  \end{figure}  \end{figure}
119    
120  \subsection{Target hardware}  \subsection{Target hardware}
121  \label{sect:target_hardware}  \label{sec:target_hardware}
122    
123  The WRAPPER is designed to target as broad as possible a range of  The WRAPPER is designed to target as broad as possible a range of
124  computer systems.  The original development of the WRAPPER took place  computer systems.  The original development of the WRAPPER took place
# Line 136  routinely used on large scale MPP system Line 136  routinely used on large scale MPP system
136  IBM SP systems). In all cases numerical code, operating within the  IBM SP systems). In all cases numerical code, operating within the
137  WRAPPER, performs and scales very competitively with equivalent  WRAPPER, performs and scales very competitively with equivalent
138  numerical code that has been modified to contain native optimizations  numerical code that has been modified to contain native optimizations
139  for a particular system \ref{ref hoe and hill, ecmwf}.  for a particular system \cite{hoe-hill:99}.
140    
141  \subsection{Supporting hardware neutrality}  \subsection{Supporting hardware neutrality}
142    
143  The different systems listed in section \ref{sect:target_hardware} can  The different systems listed in section \ref{sec:target_hardware} can
144  be categorized in many different ways. For example, one common  be categorized in many different ways. For example, one common
145  distinction is between shared-memory parallel systems (SMP and PVP)  distinction is between shared-memory parallel systems (SMP and PVP)
146  and distributed memory parallel systems (for example x86 clusters and  and distributed memory parallel systems (for example x86 clusters and
# Line 165  manner, any computer architecture curren Line 165  manner, any computer architecture curren
165  scientific computing community.  scientific computing community.
166    
167  \subsection{Machine model parallelism}  \subsection{Machine model parallelism}
168    \label{sec:domain_decomposition}
169  \begin{rawhtml}  \begin{rawhtml}
170  <!-- CMIREDIR:domain_decomp: -->  <!-- CMIREDIR:domain_decomp: -->
171  \end{rawhtml}  \end{rawhtml}
# Line 210  fashion. Line 211  fashion.
211  \begin{figure}  \begin{figure}
212  \begin{center}  \begin{center}
213   \resizebox{5in}{!}{   \resizebox{5in}{!}{
214    \includegraphics{part4/domain_decomp.eps}    \includegraphics{s_software/figs/domain_decomp.eps}
215   }   }
216  \end{center}  \end{center}
217  \caption{ The WRAPPER provides support for one and two dimensional  \caption{ The WRAPPER provides support for one and two dimensional
# Line 239  an overlap region whenever it requires v Line 240  an overlap region whenever it requires v
240  domain it owns.  Periodically processors will make calls to WRAPPER  domain it owns.  Periodically processors will make calls to WRAPPER
241  functions to communicate data between tiles, in order to keep the  functions to communicate data between tiles, in order to keep the
242  overlap regions up to date (see section  overlap regions up to date (see section
243  \ref{sect:communication_primitives}).  The WRAPPER functions can use a  \ref{sec:communication_primitives}).  The WRAPPER functions can use a
244  variety of different mechanisms to communicate data between tiles.  variety of different mechanisms to communicate data between tiles.
245    
246  \begin{figure}  \begin{figure}
247  \begin{center}  \begin{center}
248   \resizebox{5in}{!}{   \resizebox{5in}{!}{
249    \includegraphics{part4/tiled-world.eps}    \includegraphics{s_software/figs/tiled-world.eps}
250   }   }
251  \end{center}  \end{center}
252  \caption{ A global grid subdivided into tiles.  \caption{ A global grid subdivided into tiles.
# Line 279  possible mechanisms. Line 280  possible mechanisms.
280    call a function in the API of the communication library to    call a function in the API of the communication library to
281    communicate data from a tile that it owns to a tile that another CPU    communicate data from a tile that it owns to a tile that another CPU
282    owns. By default the WRAPPER binds to the MPI communication library    owns. By default the WRAPPER binds to the MPI communication library
283    \ref{MPI} for this style of communication.    \cite{MPI-std-20} for this style of communication.
284  \end{itemize}  \end{itemize}
285    
286  The WRAPPER assumes that communication will use one of these two styles  The WRAPPER assumes that communication will use one of these two styles
# Line 328  value to be communicated between CPUs. Line 329  value to be communicated between CPUs.
329  \end{figure}  \end{figure}
330    
331  \subsection{Shared memory communication}  \subsection{Shared memory communication}
332  \label{sect:shared_memory_communication}  \label{sec:shared_memory_communication}
333    
334  Under shared communication independent CPUs are operating on the  Under shared communication independent CPUs are operating on the
335  exact same global address space at the application level.  This means  exact same global address space at the application level.  This means
# Line 355  method of communication very efficient p Line 356  method of communication very efficient p
356  appropriately.  appropriately.
357    
358  \subsubsection{Memory consistency}  \subsubsection{Memory consistency}
359  \label{sect:memory_consistency}  \label{sec:memory_consistency}
360    
361  When using shared memory communication between multiple processors the  When using shared memory communication between multiple processors the
362  WRAPPER level shields user applications from certain counter-intuitive  WRAPPER level shields user applications from certain counter-intuitive
# Line 381  invoke the appropriate mechanism to ensu Line 382  invoke the appropriate mechanism to ensu
382  particular platform.  particular platform.
383    
384  \subsubsection{Cache effects and false sharing}  \subsubsection{Cache effects and false sharing}
385  \label{sect:cache_effects_and_false_sharing}  \label{sec:cache_effects_and_false_sharing}
386    
387  Shared-memory machines often have local to processor memory caches  Shared-memory machines often have local to processor memory caches
388  which contain mirrored copies of main memory.  Automatic cache-coherence  which contain mirrored copies of main memory.  Automatic cache-coherence
# Line 401  compute threads. Multiple threads operat Line 402  compute threads. Multiple threads operat
402  the standard mechanism for supporting shared memory that the WRAPPER  the standard mechanism for supporting shared memory that the WRAPPER
403  utilizes. Configuring and launching code to run in multi-threaded mode  utilizes. Configuring and launching code to run in multi-threaded mode
404  on specific platforms is discussed in section  on specific platforms is discussed in section
405  \ref{sect:multi-threaded-execution}.  However, on many systems,  \ref{sec:multi_threaded_execution}.  However, on many systems,
406  potentially very efficient mechanisms for using shared memory  potentially very efficient mechanisms for using shared memory
407  communication between multiple processes (in contrast to multiple  communication between multiple processes (in contrast to multiple
408  threads within a single process) also exist. In most cases this works  threads within a single process) also exist. In most cases this works
409  by making a limited region of memory shared between processes. The  by making a limited region of memory shared between processes. The
410  MMAP \ref{magicgarden} and IPC \ref{magicgarden} facilities in UNIX  MMAP %\ref{magicgarden}
411    and IPC %\ref{magicgarden}
412    facilities in UNIX
413  systems provide this capability as do vendor specific tools like LAPI  systems provide this capability as do vendor specific tools like LAPI
414  \ref{IBMLAPI} and IMC \ref{Memorychannel}.  Extensions exist for the  %\ref{IBMLAPI}
415    and IMC. %\ref{Memorychannel}.  
416    Extensions exist for the
417  WRAPPER that allow these mechanisms to be used for shared memory  WRAPPER that allow these mechanisms to be used for shared memory
418  communication. However, these mechanisms are not distributed with the  communication. However, these mechanisms are not distributed with the
419  default WRAPPER sources, because of their proprietary nature.  default WRAPPER sources, because of their proprietary nature.
420    
421  \subsection{Distributed memory communication}  \subsection{Distributed memory communication}
422  \label{sect:distributed_memory_communication}  \label{sec:distributed_memory_communication}
423  Many parallel systems are not constructed in a way where it is  Many parallel systems are not constructed in a way where it is
424  possible or practical for an application to use shared memory for  possible or practical for an application to use shared memory for
425  communication. For example cluster systems consist of individual  communication. For example cluster systems consist of individual
# Line 425  communication library (see figure \ref{f Line 430  communication library (see figure \ref{f
430  communication library used is MPI \cite{MPI-std-20}. However, it is  communication library used is MPI \cite{MPI-std-20}. However, it is
431  relatively straightforward to implement bindings to optimized platform  relatively straightforward to implement bindings to optimized platform
432  specific communication libraries. For example the work described in  specific communication libraries. For example the work described in
433  \ref{hoe-hill:99} substituted standard MPI communication for a highly  \cite{hoe-hill:99} substituted standard MPI communication for a highly
434  optimized library.  optimized library.
435    
436  \subsection{Communication primitives}  \subsection{Communication primitives}
437  \label{sect:communication_primitives}  \label{sec:communication_primitives}
438    
439  \begin{figure}  \begin{figure}
440  \begin{center}  \begin{center}
441   \resizebox{5in}{!}{   \resizebox{5in}{!}{
442    \includegraphics{part4/comm-primm.eps}    \includegraphics{s_software/figs/comm-primm.eps}
443   }   }
444  \end{center}  \end{center}
445  \caption{Three performance critical parallel primitives are provided  \caption{Three performance critical parallel primitives are provided
# Line 517  sub-domains. Line 522  sub-domains.
522  \begin{figure}  \begin{figure}
523  \begin{center}  \begin{center}
524   \resizebox{5in}{!}{   \resizebox{5in}{!}{
525    \includegraphics{part4/tiling_detail.eps}    \includegraphics{s_software/figs/tiling_detail.eps}
526   }   }
527  \end{center}  \end{center}
528  \caption{The tiling strategy that the WRAPPER supports allows tiles  \caption{The tiling strategy that the WRAPPER supports allows tiles
# Line 577  provided by the WRAPPER are Line 582  provided by the WRAPPER are
582    computing CPUs.    computing CPUs.
583  \end{enumerate}  \end{enumerate}
584  This section describes the details of each of these operations.  This section describes the details of each of these operations.
585  Section \ref{sect:specifying_a_decomposition} explains how the way in  Section \ref{sec:specifying_a_decomposition} explains how the way in
586  which a domain is decomposed (or composed) is expressed. Section  which a domain is decomposed (or composed) is expressed. Section
587  \ref{sect:starting_a_code} describes practical details of running  \ref{sec:starting_the_code} describes practical details of running
588  codes in various different parallel modes on contemporary computer  codes in various different parallel modes on contemporary computer
589  systems.  Section \ref{sect:controlling_communication} explains the  systems.  Section \ref{sec:controlling_communication} explains the
590  internal information that the WRAPPER uses to control how information  internal information that the WRAPPER uses to control how information
591  is communicated between tiles.  is communicated between tiles.
592    
593  \subsection{Specifying a domain decomposition}  \subsection{Specifying a domain decomposition}
594  \label{sect:specifying_a_decomposition}  \label{sec:specifying_a_decomposition}
595    
596  At its heart much of the WRAPPER works only in terms of a collection of tiles  At its heart much of the WRAPPER works only in terms of a collection of tiles
597  which are interconnected to each other. This is also true of application  which are interconnected to each other. This is also true of application
# Line 623  not cause any other problems. Line 628  not cause any other problems.
628  \begin{figure}  \begin{figure}
629  \begin{center}  \begin{center}
630   \resizebox{5in}{!}{   \resizebox{5in}{!}{
631    \includegraphics{part4/size_h.eps}    \includegraphics{s_software/figs/size_h.eps}
632   }   }
633  \end{center}  \end{center}
634  \caption{ The three level domain decomposition hierarchy employed by the  \caption{ The three level domain decomposition hierarchy employed by the
# Line 638  be created within a single process. Each Line 643  be created within a single process. Each
643  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are  dimensions of {\em sNx} and {\em sNy}. If, when the code is executed, these tiles are
644  allocated to different threads of a process that are then bound to  allocated to different threads of a process that are then bound to
645  different physical processors ( see the multi-threaded  different physical processors ( see the multi-threaded
646  execution discussion in section \ref{sect:starting_the_code} ) then  execution discussion in section \ref{sec:starting_the_code} ) then
647  computation will be performed concurrently on each tile. However, it is also  computation will be performed concurrently on each tile. However, it is also
648  possible to run the same decomposition within a process running a single thread on  possible to run the same decomposition within a process running a single thread on
649  a single processor. In this case the tiles will be computed over sequentially.  a single processor. In this case the tiles will be computed over sequentially.
# Line 835  There are six tiles allocated to six sep Line 840  There are six tiles allocated to six sep
840  This set of values can be used for a cube sphere calculation.  This set of values can be used for a cube sphere calculation.
841  Each tile of size $32 \times 32$ represents a face of the  Each tile of size $32 \times 32$ represents a face of the
842  cube. Initializing the tile connectivity correctly ( see section  cube. Initializing the tile connectivity correctly ( see section
843  \ref{sect:cube_sphere_communication}. allows the rotations associated with  \ref{sec:cube_sphere_communication}. allows the rotations associated with
844  moving between the six cube faces to be embedded within the  moving between the six cube faces to be embedded within the
845  tile-tile communication code.  tile-tile communication code.
846  \end{enumerate}  \end{enumerate}
847    
848    
849  \subsection{Starting the code}  \subsection{Starting the code}
850  \label{sect:starting_the_code}  \label{sec:starting_the_code}
851  When code is started under the WRAPPER, execution begins in a main routine {\em  When code is started under the WRAPPER, execution begins in a main routine {\em
852  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred  eesupp/src/main.F} that is owned by the WRAPPER. Control is transferred
853  to the application through a routine called {\em THE\_MODEL\_MAIN()}  to the application through a routine called {\em THE\_MODEL\_MAIN()}
# Line 889  occurs through the procedure {\em THE\_M Line 894  occurs through the procedure {\em THE\_M
894  \end{figure}  \end{figure}
895    
896  \subsubsection{Multi-threaded execution}  \subsubsection{Multi-threaded execution}
897  \label{sect:multi-threaded-execution}  \label{sec:multi_threaded_execution}
898  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the  Prior to transferring control to the procedure {\em THE\_MODEL\_MAIN()} the
899  WRAPPER may cause several coarse grain threads to be initialized. The routine  WRAPPER may cause several coarse grain threads to be initialized. The routine
900  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single  {\em THE\_MODEL\_MAIN()} is called once for each thread and is passed a single
901  stack argument which is the thread number, stored in the  stack argument which is the thread number, stored in the
902  variable {\em myThid}. In addition to specifying a decomposition with  variable {\em myThid}. In addition to specifying a decomposition with
903  multiple tiles per process ( see section \ref{sect:specifying_a_decomposition})  multiple tiles per process ( see section \ref{sec:specifying_a_decomposition})
904  configuring and starting a code to run using multiple threads requires the following  configuring and starting a code to run using multiple threads requires the following
905  steps.\\  steps.\\
906    
# Line 977  Parameter:  {\em nTy} Line 982  Parameter:  {\em nTy}
982  } \\  } \\
983    
984  \subsubsection{Multi-process execution}  \subsubsection{Multi-process execution}
985  \label{sect:multi-process-execution}  \label{sec:multi_process_execution}
986    
987  Despite its appealing programming model, multi-threaded execution  Despite its appealing programming model, multi-threaded execution
988  remains less common than multi-process execution. One major reason for  remains less common than multi-process execution. One major reason for
# Line 989  support for multi-threaded programming m Line 994  support for multi-threaded programming m
994    
995  Multi-process execution is more ubiquitous.  In order to run code in a  Multi-process execution is more ubiquitous.  In order to run code in a
996  multi-process configuration a decomposition specification (see section  multi-process configuration a decomposition specification (see section
997  \ref{sect:specifying_a_decomposition}) is given (in which the at least  \ref{sec:specifying_a_decomposition}) is given (in which the at least
998  one of the parameters {\em nPx} or {\em nPy} will be greater than one)  one of the parameters {\em nPx} or {\em nPy} will be greater than one)
999  and then, as for multi-threaded operation, appropriate compile time  and then, as for multi-threaded operation, appropriate compile time
1000  and run time steps must be taken.  and run time steps must be taken.
# Line 1009  linking MITgcm.  (Previously this was do Line 1014  linking MITgcm.  (Previously this was do
1014    ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em    ALLOW\_USE\_MPI} and {\em ALWAYS\_USE\_MPI} flags in the {\em
1015    CPP\_EEOPTIONS.h} file.)  More detailed information about the use of    CPP\_EEOPTIONS.h} file.)  More detailed information about the use of
1016  {\em genmake2} for specifying  {\em genmake2} for specifying
1017  local compiler flags is located in section \ref{sect:genmake}.\\  local compiler flags is located in section \ref{sec:genmake}.\\
1018    
1019    
1020  \fbox{  \fbox{
# Line 1113  Parameter: {\em pidN       } Line 1118  Parameter: {\em pidN       }
1118    
1119    
1120  \subsection{Controlling communication}  \subsection{Controlling communication}
1121    \label{sec:controlling_communication}
1122  The WRAPPER maintains internal information that is used for communication  The WRAPPER maintains internal information that is used for communication
1123  operations and that can be customized for different platforms. This section  operations and that can be customized for different platforms. This section
1124  describes the information that is held and used.  describes the information that is held and used.
# Line 1135  describes the information that is held a Line 1141  describes the information that is held a
1141    a particular face. A value of {\em COMM\_MSG} is used to indicate    a particular face. A value of {\em COMM\_MSG} is used to indicate
1142    that some form of distributed memory communication is required to    that some form of distributed memory communication is required to
1143    communicate between these tile faces (see section    communicate between these tile faces (see section
1144    \ref{sect:distributed_memory_communication}).  A value of {\em    \ref{sec:distributed_memory_communication}).  A value of {\em
1145      COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared      COMM\_PUT} or {\em COMM\_GET} is used to indicate forms of shared
1146    memory communication (see section    memory communication (see section
1147    \ref{sect:shared_memory_communication}). The {\em COMM\_PUT} value    \ref{sec:shared_memory_communication}). The {\em COMM\_PUT} value
1148    indicates that a CPU should communicate by writing to data    indicates that a CPU should communicate by writing to data
1149    structures owned by another CPU. A {\em COMM\_GET} value indicates    structures owned by another CPU. A {\em COMM\_GET} value indicates
1150    that a CPU should communicate by reading from data structures owned    that a CPU should communicate by reading from data structures owned
# Line 1195  Parameter: {\em tileCommModeS} \\ Line 1201  Parameter: {\em tileCommModeS} \\
1201    the file {\em eedata}. If the value of {\em nThreads} is    the file {\em eedata}. If the value of {\em nThreads} is
1202    inconsistent with the number of threads requested from the operating    inconsistent with the number of threads requested from the operating
1203    system (for example by using an environment variable as described in    system (for example by using an environment variable as described in
1204    section \ref{sect:multi_threaded_execution}) then usually an error    section \ref{sec:multi_threaded_execution}) then usually an error
1205    will be reported by the routine {\em CHECK\_THREADS}.    will be reported by the routine {\em CHECK\_THREADS}.
1206    
1207  \fbox{  \fbox{
# Line 1212  Parameter: {\em nTy} \\ Line 1218  Parameter: {\em nTy} \\
1218  }  }
1219    
1220  \item {\bf memsync flags}  \item {\bf memsync flags}
1221    As discussed in section \ref{sect:memory_consistency}, a low-level    As discussed in section \ref{sec:memory_consistency}, a low-level
1222    system function may be need to force memory consistency on some    system function may be need to force memory consistency on some
1223    shared memory systems.  The routine {\em MEMSYNC()} is used for this    shared memory systems.  The routine {\em MEMSYNC()} is used for this
1224    purpose. This routine should not need modifying and the information    purpose. This routine should not need modifying and the information
# Line 1238  asm("lock; addl $0,0(%%esp)": : :"memory Line 1244  asm("lock; addl $0,0(%%esp)": : :"memory
1244  \end{verbatim}  \end{verbatim}
1245    
1246  \item {\bf Cache line size}  \item {\bf Cache line size}
1247    As discussed in section \ref{sect:cache_effects_and_false_sharing},    As discussed in section \ref{sec:cache_effects_and_false_sharing},
1248    milti-threaded codes explicitly avoid penalties associated with    milti-threaded codes explicitly avoid penalties associated with
1249    excessive coherence traffic on an SMP system. To do this the shared    excessive coherence traffic on an SMP system. To do this the shared
1250    memory data structures used by the {\em GLOBAL\_SUM}, {\em    memory data structures used by the {\em GLOBAL\_SUM}, {\em
# Line 1268  asm("lock; addl $0,0(%%esp)": : :"memory Line 1274  asm("lock; addl $0,0(%%esp)": : :"memory
1274      CPP\_EEMACROS.h}.  The \_GSUM macro is a performance critical      CPP\_EEMACROS.h}.  The \_GSUM macro is a performance critical
1275    operation, especially for large processor count, small tile size    operation, especially for large processor count, small tile size
1276    configurations.  The custom communication example discussed in    configurations.  The custom communication example discussed in
1277    section \ref{sect:jam_example} shows how the macro is used to invoke    section \ref{sec:jam_example} shows how the macro is used to invoke
1278    a custom global sum routine for a specific set of hardware.    a custom global sum routine for a specific set of hardware.
1279    
1280  \item {\bf \_EXCH}  \item {\bf \_EXCH}
# Line 1281  asm("lock; addl $0,0(%%esp)": : :"memory Line 1287  asm("lock; addl $0,0(%%esp)": : :"memory
1287    the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH    the header file {\em CPP\_EEMACROS.h}. As with \_GSUM, the \_EXCH
1288    operation plays a crucial role in scaling to small tile, large    operation plays a crucial role in scaling to small tile, large
1289    logical and physical processor count configurations.  The example in    logical and physical processor count configurations.  The example in
1290    section \ref{sect:jam_example} discusses defining an optimized and    section \ref{sec:jam_example} discusses defining an optimized and
1291    specialized form on the \_EXCH operation.    specialized form on the \_EXCH operation.
1292    
1293    The \_EXCH operation is also central to supporting grids such as the    The \_EXCH operation is also central to supporting grids such as the
# Line 1322  asm("lock; addl $0,0(%%esp)": : :"memory Line 1328  asm("lock; addl $0,0(%%esp)": : :"memory
1328    if this mechanism is unavailable then the work arrays can be extended    if this mechanism is unavailable then the work arrays can be extended
1329    with dimensions using the tile dimensioning scheme of {\em nSx} and    with dimensions using the tile dimensioning scheme of {\em nSx} and
1330    {\em nSy} (as described in section    {\em nSy} (as described in section
1331    \ref{sect:specifying_a_decomposition}). However, if the    \ref{sec:specifying_a_decomposition}). However, if the
1332    configuration being specified involves many more tiles than OS    configuration being specified involves many more tiles than OS
1333    threads then it can save memory resources to reduce the variable    threads then it can save memory resources to reduce the variable
1334    {\em MAX\_NO\_THREADS} to be equal to the actual number of threads    {\em MAX\_NO\_THREADS} to be equal to the actual number of threads
# Line 1380  Here we show how it can be used to impro Line 1386  Here we show how it can be used to impro
1386  how it can be used to adapt to new griding approaches.  how it can be used to adapt to new griding approaches.
1387    
1388  \subsubsection{JAM example}  \subsubsection{JAM example}
1389  \label{sect:jam_example}  \label{sec:jam_example}
1390  On some platforms a big performance boost can be obtained by binding  On some platforms a big performance boost can be obtained by binding
1391  the communication routines {\em \_EXCH} and {\em \_GSUM} to  the communication routines {\em \_EXCH} and {\em \_GSUM} to
1392  specialized native libraries (for example, the shmem library on CRAY  specialized native libraries (for example, the shmem library on CRAY
# Line 1404  Developing specialized code for other li Line 1410  Developing specialized code for other li
1410  pattern.  pattern.
1411    
1412  \subsubsection{Cube sphere communication}  \subsubsection{Cube sphere communication}
1413  \label{sect:cube_sphere_communication}  \label{sec:cube_sphere_communication}
1414  Actual {\em \_EXCH} routine code is generated automatically from a  Actual {\em \_EXCH} routine code is generated automatically from a
1415  series of template files, for example {\em exch\_rx.template}.  This  series of template files, for example {\em exch\_rx.template}.  This
1416  is done to allow a large number of variations on the exchange process  is done to allow a large number of variations on the exchange process
# Line 1439  point locations. Line 1445  point locations.
1445    
1446  Fitting together the WRAPPER elements, package elements and  Fitting together the WRAPPER elements, package elements and
1447  MITgcm core equation elements of the source code produces calling  MITgcm core equation elements of the source code produces calling
1448  sequence shown in section \ref{sect:calling_sequence}  sequence shown in section \ref{sec:calling_sequence}
1449    
1450  \subsection{Annotated call tree for MITgcm and WRAPPER}  \subsection{Annotated call tree for MITgcm and WRAPPER}
1451  \label{sect:calling_sequence}  \label{sec:calling_sequence}
1452    
1453  WRAPPER layer.  WRAPPER layer.
1454    

Legend:
Removed from v.1.23  
changed lines
  Added in v.1.26

  ViewVC Help
Powered by ViewVC 1.1.22