| 1 |
$Header: /u/gcmpack/models/MITgcmUV/doc/notes,v 1.2 1998/04/24 02:36:52 cnh Exp $ |
| 2 |
|
| 3 |
Miscellaneous notes relating to MITgcm UV |
| 4 |
========================================= |
| 5 |
|
| 6 |
This files form is close to that of an FAQ. If you are having |
| 7 |
a problem getting the model to behave as you might expect they |
| 8 |
may be some helpful clues in this file. |
| 9 |
|
| 10 |
|
| 11 |
o Something really weird is happening - variables keep |
| 12 |
changing value! |
| 13 |
|
| 14 |
Apart from the usual problems of out of bounds array refs. |
| 15 |
and various bugs itis important to be sure that "stack" |
| 16 |
variables really are stack variables in multi-threaded execution. |
| 17 |
Some compilers put subroutines local variables in static storage. |
| 18 |
This can result in an apparently private variable in a local |
| 19 |
routine being mysteriously changed by concurrently executing |
| 20 |
thread. |
| 21 |
|
| 22 |
===================================== |
| 23 |
|
| 24 |
o Something really weird is happening - the code gets stuck in |
| 25 |
a loop somewhere! |
| 26 |
|
| 27 |
The routines in barrier.F should be compiled without any |
| 28 |
optimisation. The routines check variables that are updated by other threads |
| 29 |
Compiler optimisations generally assume that the code being optimised |
| 30 |
will obey the sequential semantics of regular Fortran. That means they |
| 31 |
will assume that a variable is not going to change value unless the |
| 32 |
code it is optimising changes it. Obviously this can cause problems. |
| 33 |
|
| 34 |
===================================== |
| 35 |
|
| 36 |
o Is the Fortran SAVE statement a problem. |
| 37 |
|
| 38 |
Yes. On the whole the Fortran SAVE statement should not be used |
| 39 |
for data in a multi-threaded code. SAVE causes data to be held in |
| 40 |
static storage meaning that all threads will see the same location. |
| 41 |
Therefore, generally if one thread updates the location all other threads |
| 42 |
will see it. Note - there is often no specification for what should happen |
| 43 |
in this situation in a multi-threaded environment, so this is |
| 44 |
not a robust machanism for sharing data. |
| 45 |
For most cases where SAVE might be appropriate either of the following |
| 46 |
recipes should be used instead. Both these schemes are potential |
| 47 |
performance bottlenecks if they are over-used. |
| 48 |
Method 1 |
| 49 |
******** |
| 50 |
1. Put the SAVE variable in a common block |
| 51 |
2. Update the SAVE variable in a _BEGIN_MASTER, _END_MASTER block. |
| 52 |
3. Include a _BARRIER after the _BEGIN_MASTER, _END_MASTER block. |
| 53 |
e.g |
| 54 |
C nIter - Current iteration counter |
| 55 |
COMMON /PARAMS/ nIter |
| 56 |
INTEGER nIter |
| 57 |
|
| 58 |
_BEGIN_MASTER(myThid) |
| 59 |
nIter = nIter+1 |
| 60 |
_END_MASTER(myThid) |
| 61 |
_BARRIER |
| 62 |
|
| 63 |
Note. The _BARRIER operation is potentially expensive. Be conservative |
| 64 |
in your use of this scheme. |
| 65 |
|
| 66 |
Method 2 |
| 67 |
******** |
| 68 |
1. Put the SAVE variable in a common block but with an extra dimension |
| 69 |
for the thread number. |
| 70 |
2. Change the updates and references to the SAVE variable to a per thread |
| 71 |
basis. |
| 72 |
e.g |
| 73 |
C nIter - Current iteration counter |
| 74 |
COMMON /PARAMS/ nIter |
| 75 |
INTEGER nIter(MAX_NO_THREADS) |
| 76 |
|
| 77 |
nIter(myThid) = nIter(myThid)+1 |
| 78 |
|
| 79 |
Note. nIter(myThid) and nIter(myThid+1) will share the same |
| 80 |
cache line. The update will cause extra low-level memory |
| 81 |
traffic to maintain cache coherence. If the update is in |
| 82 |
a tight loop this will be a problem and nIter will need |
| 83 |
padding. |
| 84 |
In a NUMA system nIter(1:MAX_NO_THREADS) is likely to reside |
| 85 |
in a single page of physical memory on a single box. Again in |
| 86 |
a tight loop this would cause lots of remote/far memory references |
| 87 |
and would be a problem. Some compilers provide a machanism |
| 88 |
for helping overcome this problem. |
| 89 |
|
| 90 |
===================================== |
| 91 |
|
| 92 |
o Can I debug using write statements. |
| 93 |
|
| 94 |
Many systems do not have "thread-safe" Fortran I/O libraries. |
| 95 |
On these systems I/O generally orks but it gets a bit intermingled! |
| 96 |
Occaisionally doing multi-threaded I/O with an unsafe Fortran I/O library |
| 97 |
will actual cause the program to fail. Note: SGI has a "thread-safe" Fortran |
| 98 |
I/O library. |
| 99 |
|
| 100 |
===================================== |
| 101 |
|
| 102 |
o Mapping virtual memory to physical memory. |
| 103 |
|
| 104 |
The current code declares arrays as |
| 105 |
real aW2d (1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy) |
| 106 |
This raises an issue on shared virtual-memory machines that have |
| 107 |
an underlying non-uniform memory subsystem e.g. HP Exemplar, SGI |
| 108 |
Origin, DG, Sequent etc.. . What most machines implement is a scheme |
| 109 |
in which the physical memory that backs the virtual memory is allocated |
| 110 |
on a page basis at |
| 111 |
run-time. The OS manages this allocation and without exception |
| 112 |
pages are assigned to physical memory on the box where the thread |
| 113 |
which caused the page-fault is running. Pages are typically 4-8KB in |
| 114 |
size. This means that in some environments it would make sense to |
| 115 |
declare arrays |
| 116 |
real aW2d (1-OLx:sNx+OLx+PX,1-OLy:sNy+OLy+PY,nSx,nSy) |
| 117 |
where PX and PY are chosen so that the divides between near and |
| 118 |
far memory will coincide with the boundaries of the virtual memory |
| 119 |
regions a thread works on. In principle this is easy but it is |
| 120 |
also inelegant and really one would like the OS/hardware to take |
| 121 |
care of this issue. Doing it oneself requires PX and PY to be recalculated whenever |
| 122 |
the mapping of the nSx, nSy blocks to nTx and nTy threads is changed. Also |
| 123 |
different PX and PY are required depending on |
| 124 |
page size |
| 125 |
array element size ( real*4, real*8 ) |
| 126 |
array dimensions ( 2d, 3d Nz, 3d Nz+1 ) - in 3d a PZ would also be needed! |
| 127 |
Note: 1. A C implementation would be a lot easier. An F90 including allocation |
| 128 |
would also be fairly straightforward. |
| 129 |
2. The padding really ought to be between the "collection" of blocks |
| 130 |
that all the threads using the same near memory work on. To save on wasted |
| 131 |
memory the padding really should be between these blocks. The |
| 132 |
PX, PY, PZ mechanism does this three levels down on the heirarchy. This |
| 133 |
wastes more memory. |
| 134 |
3. For large problems this is less of an issue. For a large problem |
| 135 |
even for a 2d array there might be say 16 pages per array per processor |
| 136 |
and at least 4 processors in a uniform memory access box. Assuming a |
| 137 |
sensible mapping of processors to blocks only one page (1.5% of the |
| 138 |
memory) referenced by processors in another box. |
| 139 |
On the other hand for a very small per processor problem size e.g. |
| 140 |
32x32 per processor and again four processors per box as many as |
| 141 |
50% of the memory references could be to far memory for 2d fields. |
| 142 |
This could be very bad! |
| 143 |
|
| 144 |
===================================== |
| 145 |
|
| 146 |
===================================== |
| 147 |
|
| 148 |
===================================== |
| 149 |
|
| 150 |
===================================== |
| 151 |
|
| 152 |
===================================== |
| 153 |
|
| 154 |
===================================== |