Annotation of /MITgcm/doc/notes

$Header$

Miscellaneous notes relating to MITgcm UV
=========================================

o Something really weird is happening - variables keep
  changing value!

  Apart from the usual problems of out of bounds array refs.
  and various bugs itis important to be sure that "stack"
  variables really are stack variables in multi-threaded execution.
  Some compilers put subroutines local variables in static storage.
  This can result in an apparently private variable in a local
  routine being mysteriously changed by concurrently executing
  thread.

                    =====================================

o Something really weird is happening - the code gets stuck in
  a loop somewhere!

  The routines in barrier.F should be compiled without any
  optimisation. The routines check variables that are updated by other threads
  Compiler optimisations generally assume that the code being optimised 
  will obey the sequential semantics of regular Fortran. That means they
  will assume that a variable is not going to change value unless the
  code it is optimising changes it. Obviously this can cause problems.

                    =====================================

o Is the Fortran SAVE statement a problem.

  Yes. On the whole the Fortran SAVE statement should not be used
  for data in a multi-threaded code. SAVE causes data to be held in 
  static storage meaning that all threads will see the same location. 
  Therefore, generally if one thread updates the location all other threads 
  will see it. Note - there is often no specification for what should happen
  in this situation in a multi-threaded environment, so this is
  not a robust machanism for sharing data.
  For most cases where SAVE might be appropriate either of the following 
  recipes should be used instead. Both these schemes are potential 
  performance bottlenecks if they are over-used. 
  Method 1
  ********
   1. Put the SAVE variable in a common block
   2. Update the SAVE variable in a _BEGIN_MASTER, _END_MASTER block.
   3. Include a _BARRIER after the _BEGIN_MASTER, _END_MASTER block.
   e.g
   C nIter - Current iteration counter
   COMMON /PARAMS/ nIter
   INTEGER nIter

   _BEGIN_MASTER(myThid)
    nIter = nIter+1
   _END_MASTER(myThid)
   _BARRIER

   Note. The _BARRIER operation is potentially expensive. Be conservative
         in your use of this scheme.

  Method 2
  ********
   1. Put the SAVE variable in a common block but with an extra dimension
      for the thread number.
   2. Change the updates and references to the SAVE variable to a per thread 
      basis.
   e.g
   C nIter - Current iteration counter
   COMMON /PARAMS/ nIter
   INTEGER nIter(MAX_NO_THREADS)
 
    nIter(myThid) = nIter(myThid)+1

   Note. nIter(myThid) and nIter(myThid+1) will share the same
         cache line. The update will cause extra low-level memory 
         traffic to maintain cache coherence. If the update is in
         a tight loop this will be a problem and nIter will need
         padding.
         In a NUMA system nIter(1:MAX_NO_THREADS) is likely to reside
         in a single page of physical memory on a single box. Again in
         a tight loop this would cause lots of remote/far memory references
         and would be a problem. Some compilers provide a machanism
         for helping overcome this problem.

                    =====================================

o Can I debug using write statements.

  Many systems do not have "thread-safe" Fortran I/O libraries.
  On these systems I/O generally orks but it gets a bit intermingled!
  Occaisionally doing multi-threaded I/O with an unsafe  Fortran I/O library
  will actual cause the program to fail. Note: SGI has a "thread-safe" Fortran 
  I/O library.

                    =====================================

o Mapping virtual memory to physical memory.

  The current code declares arrays as
       real aW2d (1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
  This raises an issue on shared virtual-memory machines that have
  an underlying non-uniform memory subsystem e.g. HP Exemplar, SGI 
  Origin, DG, Sequent etc.. . What most machines implement is a scheme 
  in which the physical memory that backs the virtual memory is allocated           
  on a page basis at 
  run-time. The OS manages this allocation and without exception
  pages are assigned to physical memory on the box where the thread 
  which caused the page-fault is running. Pages are typically 4-8KB in 
  size. This means that in some environments it would make sense to   
  declare arrays
       real aW2d (1-OLx:sNx+OLx+PX,1-OLy:sNy+OLy+PY,nSx,nSy)
  where PX and PY are chosen so that the divides between near and 
  far memory will coincide with the boundaries of the virtual memory
  regions a thread works on. In principle this is easy but it is
  also inelegant and really one would like the OS/hardware to take
  care of this issue. Doing it oneself requires PX and PY to be recalculated whenever 
  the mapping of the nSx, nSy blocks to nTx and nTy threads is changed. Also
  different PX and PY are required depending on
   page size
   array element size ( real*4, real*8 )
   array dimensions ( 2d, 3d Nz, 3d Nz+1 ) - in 3d a PZ would also be needed!
  Note: 1. A C implementation would be a lot easier. An F90 including allocation
           would also be fairly straightforward.
        2. The padding really ought to be between the "collection" of blocks
           that all the threads using the same near memory work on. To save on wasted 
           memory the padding really should be between these blocks. The 
           PX, PY, PZ mechanism does this three levels down on the heirarchy. This
           wastes more memory.
        3. For large problems this is less of an issue. For a large problem
           even for a 2d array there might be say 16 pages per array per processor
           and at least 4 processors in a uniform memory access box. Assuming a 
           sensible mapping of processors to blocks only one page (1.5% of the
           memory) referenced by processors in another box.
           On the other hand for a very small per processor problem size e.g.
           32x32 per processor and again four processors per box as many as
           50% of the memory references could be to far memory for 2d fields. 
           This could be very bad!

                    =====================================

                    =====================================

                    =====================================

                    =====================================

                    =====================================

                    =====================================
1	cnh	1.2	$Header$
2
3	cnh	1.1	Miscellaneous notes relating to MITgcm UV
4			=========================================
5
6			o Something really weird is happening - variables keep
7			changing value!
8
9			Apart from the usual problems of out of bounds array refs.
10			and various bugs itis important to be sure that "stack"
11			variables really are stack variables in multi-threaded execution.
12			Some compilers put subroutines local variables in static storage.
13			This can result in an apparently private variable in a local
14			routine being mysteriously changed by concurrently executing
15			thread.
16
17			=====================================
18
19			o Something really weird is happening - the code gets stuck in
20			a loop somewhere!
21
22			The routines in barrier.F should be compiled without any
23			optimisation. The routines check variables that are updated by other threads
24			Compiler optimisations generally assume that the code being optimised
25			will obey the sequential semantics of regular Fortran. That means they
26			will assume that a variable is not going to change value unless the
27			code it is optimising changes it. Obviously this can cause problems.
28
29			=====================================
30
31			o Is the Fortran SAVE statement a problem.
32
33			Yes. On the whole the Fortran SAVE statement should not be used
34			for data in a multi-threaded code. SAVE causes data to be held in
35			static storage meaning that all threads will see the same location.
36			Therefore, generally if one thread updates the location all other threads
37			will see it. Note - there is often no specification for what should happen
38			in this situation in a multi-threaded environment, so this is
39			not a robust machanism for sharing data.
40			For most cases where SAVE might be appropriate either of the following
41			recipes should be used instead. Both these schemes are potential
42			performance bottlenecks if they are over-used.
43			Method 1
44			********
45			1. Put the SAVE variable in a common block
46			2. Update the SAVE variable in a _BEGIN_MASTER, _END_MASTER block.
47			3. Include a _BARRIER after the _BEGIN_MASTER, _END_MASTER block.
48			e.g
49			C nIter - Current iteration counter
50			COMMON /PARAMS/ nIter
51			INTEGER nIter
52
53			_BEGIN_MASTER(myThid)
54			nIter = nIter+1
55			_END_MASTER(myThid)
56			_BARRIER
57
58			Note. The _BARRIER operation is potentially expensive. Be conservative
59			in your use of this scheme.
60
61			Method 2
62			********
63			1. Put the SAVE variable in a common block but with an extra dimension
64			for the thread number.
65			2. Change the updates and references to the SAVE variable to a per thread
66			basis.
67			e.g
68			C nIter - Current iteration counter
69			COMMON /PARAMS/ nIter
70			INTEGER nIter(MAX_NO_THREADS)
71
72			nIter(myThid) = nIter(myThid)+1
73
74			Note. nIter(myThid) and nIter(myThid+1) will share the same
75			cache line. The update will cause extra low-level memory
76			traffic to maintain cache coherence. If the update is in
77			a tight loop this will be a problem and nIter will need
78			padding.
79			In a NUMA system nIter(1:MAX_NO_THREADS) is likely to reside
80			in a single page of physical memory on a single box. Again in
81			a tight loop this would cause lots of remote/far memory references
82			and would be a problem. Some compilers provide a machanism
83			for helping overcome this problem.
84
85			=====================================
86
87			o Can I debug using write statements.
88
89			Many systems do not have "thread-safe" Fortran I/O libraries.
90			On these systems I/O generally orks but it gets a bit intermingled!
91			Occaisionally doing multi-threaded I/O with an unsafe Fortran I/O library
92			will actual cause the program to fail. Note: SGI has a "thread-safe" Fortran
93			I/O library.
94
95			=====================================
96
97			o Mapping virtual memory to physical memory.
98
99			The current code declares arrays as
100			real aW2d (1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
101			This raises an issue on shared virtual-memory machines that have
102			an underlying non-uniform memory subsystem e.g. HP Exemplar, SGI
103			Origin, DG, Sequent etc.. . What most machines implement is a scheme
104			in which the physical memory that backs the virtual memory is allocated
105			on a page basis at
106			run-time. The OS manages this allocation and without exception
107			pages are assigned to physical memory on the box where the thread
108			which caused the page-fault is running. Pages are typically 4-8KB in
109			size. This means that in some environments it would make sense to
110			declare arrays
111			real aW2d (1-OLx:sNx+OLx+PX,1-OLy:sNy+OLy+PY,nSx,nSy)
112			where PX and PY are chosen so that the divides between near and
113			far memory will coincide with the boundaries of the virtual memory
114			regions a thread works on. In principle this is easy but it is
115			also inelegant and really one would like the OS/hardware to take
116			care of this issue. Doing it oneself requires PX and PY to be recalculated whenever
117			the mapping of the nSx, nSy blocks to nTx and nTy threads is changed. Also
118			different PX and PY are required depending on
119			page size
120			array element size ( real4, real8 )
121			array dimensions ( 2d, 3d Nz, 3d Nz+1 ) - in 3d a PZ would also be needed!
122			Note: 1. A C implementation would be a lot easier. An F90 including allocation
123			would also be fairly straightforward.
124			2. The padding really ought to be between the "collection" of blocks
125			that all the threads using the same near memory work on. To save on wasted
126			memory the padding really should be between these blocks. The
127			PX, PY, PZ mechanism does this three levels down on the heirarchy. This
128			wastes more memory.
129			3. For large problems this is less of an issue. For a large problem
130			even for a 2d array there might be say 16 pages per array per processor
131			and at least 4 processors in a uniform memory access box. Assuming a
132			sensible mapping of processors to blocks only one page (1.5% of the
133			memory) referenced by processors in another box.
134			On the other hand for a very small per processor problem size e.g.
135			32x32 per processor and again four processors per box as many as
136			50% of the memory references could be to far memory for 2d fields.
137			This could be very bad!
138
139			=====================================
140
141			=====================================
142
143			=====================================
144
145			=====================================
146
147			=====================================
148
149			=====================================