1 |
$Header: /u/gcmpack/models/MITgcmUV/doc/notes,v 1.2 1998/04/24 02:36:52 cnh Exp $ |
2 |
|
3 |
Miscellaneous notes relating to MITgcm UV |
4 |
========================================= |
5 |
|
6 |
This files form is close to that of an FAQ. If you are having |
7 |
a problem getting the model to behave as you might expect they |
8 |
may be some helpful clues in this file. |
9 |
|
10 |
|
11 |
o Something really weird is happening - variables keep |
12 |
changing value! |
13 |
|
14 |
Apart from the usual problems of out of bounds array refs. |
15 |
and various bugs itis important to be sure that "stack" |
16 |
variables really are stack variables in multi-threaded execution. |
17 |
Some compilers put subroutines local variables in static storage. |
18 |
This can result in an apparently private variable in a local |
19 |
routine being mysteriously changed by concurrently executing |
20 |
thread. |
21 |
|
22 |
===================================== |
23 |
|
24 |
o Something really weird is happening - the code gets stuck in |
25 |
a loop somewhere! |
26 |
|
27 |
The routines in barrier.F should be compiled without any |
28 |
optimisation. The routines check variables that are updated by other threads |
29 |
Compiler optimisations generally assume that the code being optimised |
30 |
will obey the sequential semantics of regular Fortran. That means they |
31 |
will assume that a variable is not going to change value unless the |
32 |
code it is optimising changes it. Obviously this can cause problems. |
33 |
|
34 |
===================================== |
35 |
|
36 |
o Is the Fortran SAVE statement a problem. |
37 |
|
38 |
Yes. On the whole the Fortran SAVE statement should not be used |
39 |
for data in a multi-threaded code. SAVE causes data to be held in |
40 |
static storage meaning that all threads will see the same location. |
41 |
Therefore, generally if one thread updates the location all other threads |
42 |
will see it. Note - there is often no specification for what should happen |
43 |
in this situation in a multi-threaded environment, so this is |
44 |
not a robust machanism for sharing data. |
45 |
For most cases where SAVE might be appropriate either of the following |
46 |
recipes should be used instead. Both these schemes are potential |
47 |
performance bottlenecks if they are over-used. |
48 |
Method 1 |
49 |
******** |
50 |
1. Put the SAVE variable in a common block |
51 |
2. Update the SAVE variable in a _BEGIN_MASTER, _END_MASTER block. |
52 |
3. Include a _BARRIER after the _BEGIN_MASTER, _END_MASTER block. |
53 |
e.g |
54 |
C nIter - Current iteration counter |
55 |
COMMON /PARAMS/ nIter |
56 |
INTEGER nIter |
57 |
|
58 |
_BEGIN_MASTER(myThid) |
59 |
nIter = nIter+1 |
60 |
_END_MASTER(myThid) |
61 |
_BARRIER |
62 |
|
63 |
Note. The _BARRIER operation is potentially expensive. Be conservative |
64 |
in your use of this scheme. |
65 |
|
66 |
Method 2 |
67 |
******** |
68 |
1. Put the SAVE variable in a common block but with an extra dimension |
69 |
for the thread number. |
70 |
2. Change the updates and references to the SAVE variable to a per thread |
71 |
basis. |
72 |
e.g |
73 |
C nIter - Current iteration counter |
74 |
COMMON /PARAMS/ nIter |
75 |
INTEGER nIter(MAX_NO_THREADS) |
76 |
|
77 |
nIter(myThid) = nIter(myThid)+1 |
78 |
|
79 |
Note. nIter(myThid) and nIter(myThid+1) will share the same |
80 |
cache line. The update will cause extra low-level memory |
81 |
traffic to maintain cache coherence. If the update is in |
82 |
a tight loop this will be a problem and nIter will need |
83 |
padding. |
84 |
In a NUMA system nIter(1:MAX_NO_THREADS) is likely to reside |
85 |
in a single page of physical memory on a single box. Again in |
86 |
a tight loop this would cause lots of remote/far memory references |
87 |
and would be a problem. Some compilers provide a machanism |
88 |
for helping overcome this problem. |
89 |
|
90 |
===================================== |
91 |
|
92 |
o Can I debug using write statements. |
93 |
|
94 |
Many systems do not have "thread-safe" Fortran I/O libraries. |
95 |
On these systems I/O generally orks but it gets a bit intermingled! |
96 |
Occaisionally doing multi-threaded I/O with an unsafe Fortran I/O library |
97 |
will actual cause the program to fail. Note: SGI has a "thread-safe" Fortran |
98 |
I/O library. |
99 |
|
100 |
===================================== |
101 |
|
102 |
o Mapping virtual memory to physical memory. |
103 |
|
104 |
The current code declares arrays as |
105 |
real aW2d (1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy) |
106 |
This raises an issue on shared virtual-memory machines that have |
107 |
an underlying non-uniform memory subsystem e.g. HP Exemplar, SGI |
108 |
Origin, DG, Sequent etc.. . What most machines implement is a scheme |
109 |
in which the physical memory that backs the virtual memory is allocated |
110 |
on a page basis at |
111 |
run-time. The OS manages this allocation and without exception |
112 |
pages are assigned to physical memory on the box where the thread |
113 |
which caused the page-fault is running. Pages are typically 4-8KB in |
114 |
size. This means that in some environments it would make sense to |
115 |
declare arrays |
116 |
real aW2d (1-OLx:sNx+OLx+PX,1-OLy:sNy+OLy+PY,nSx,nSy) |
117 |
where PX and PY are chosen so that the divides between near and |
118 |
far memory will coincide with the boundaries of the virtual memory |
119 |
regions a thread works on. In principle this is easy but it is |
120 |
also inelegant and really one would like the OS/hardware to take |
121 |
care of this issue. Doing it oneself requires PX and PY to be recalculated whenever |
122 |
the mapping of the nSx, nSy blocks to nTx and nTy threads is changed. Also |
123 |
different PX and PY are required depending on |
124 |
page size |
125 |
array element size ( real*4, real*8 ) |
126 |
array dimensions ( 2d, 3d Nz, 3d Nz+1 ) - in 3d a PZ would also be needed! |
127 |
Note: 1. A C implementation would be a lot easier. An F90 including allocation |
128 |
would also be fairly straightforward. |
129 |
2. The padding really ought to be between the "collection" of blocks |
130 |
that all the threads using the same near memory work on. To save on wasted |
131 |
memory the padding really should be between these blocks. The |
132 |
PX, PY, PZ mechanism does this three levels down on the heirarchy. This |
133 |
wastes more memory. |
134 |
3. For large problems this is less of an issue. For a large problem |
135 |
even for a 2d array there might be say 16 pages per array per processor |
136 |
and at least 4 processors in a uniform memory access box. Assuming a |
137 |
sensible mapping of processors to blocks only one page (1.5% of the |
138 |
memory) referenced by processors in another box. |
139 |
On the other hand for a very small per processor problem size e.g. |
140 |
32x32 per processor and again four processors per box as many as |
141 |
50% of the memory references could be to far memory for 2d fields. |
142 |
This could be very bad! |
143 |
|
144 |
===================================== |
145 |
|
146 |
===================================== |
147 |
|
148 |
===================================== |
149 |
|
150 |
===================================== |
151 |
|
152 |
===================================== |
153 |
|
154 |
===================================== |