1 |
$Header$ |
2 |
|
3 |
Miscellaneous notes relating to MITgcm UV |
4 |
========================================= |
5 |
|
6 |
o Something really weird is happening - variables keep |
7 |
changing value! |
8 |
|
9 |
Apart from the usual problems of out of bounds array refs. |
10 |
and various bugs itis important to be sure that "stack" |
11 |
variables really are stack variables in multi-threaded execution. |
12 |
Some compilers put subroutines local variables in static storage. |
13 |
This can result in an apparently private variable in a local |
14 |
routine being mysteriously changed by concurrently executing |
15 |
thread. |
16 |
|
17 |
===================================== |
18 |
|
19 |
o Something really weird is happening - the code gets stuck in |
20 |
a loop somewhere! |
21 |
|
22 |
The routines in barrier.F should be compiled without any |
23 |
optimisation. The routines check variables that are updated by other threads |
24 |
Compiler optimisations generally assume that the code being optimised |
25 |
will obey the sequential semantics of regular Fortran. That means they |
26 |
will assume that a variable is not going to change value unless the |
27 |
code it is optimising changes it. Obviously this can cause problems. |
28 |
|
29 |
===================================== |
30 |
|
31 |
o Is the Fortran SAVE statement a problem. |
32 |
|
33 |
Yes. On the whole the Fortran SAVE statement should not be used |
34 |
for data in a multi-threaded code. SAVE causes data to be held in |
35 |
static storage meaning that all threads will see the same location. |
36 |
Therefore, generally if one thread updates the location all other threads |
37 |
will see it. Note - there is often no specification for what should happen |
38 |
in this situation in a multi-threaded environment, so this is |
39 |
not a robust machanism for sharing data. |
40 |
For most cases where SAVE might be appropriate either of the following |
41 |
recipes should be used instead. Both these schemes are potential |
42 |
performance bottlenecks if they are over-used. |
43 |
Method 1 |
44 |
******** |
45 |
1. Put the SAVE variable in a common block |
46 |
2. Update the SAVE variable in a _BEGIN_MASTER, _END_MASTER block. |
47 |
3. Include a _BARRIER after the _BEGIN_MASTER, _END_MASTER block. |
48 |
e.g |
49 |
C nIter - Current iteration counter |
50 |
COMMON /PARAMS/ nIter |
51 |
INTEGER nIter |
52 |
|
53 |
_BEGIN_MASTER(myThid) |
54 |
nIter = nIter+1 |
55 |
_END_MASTER(myThid) |
56 |
_BARRIER |
57 |
|
58 |
Note. The _BARRIER operation is potentially expensive. Be conservative |
59 |
in your use of this scheme. |
60 |
|
61 |
Method 2 |
62 |
******** |
63 |
1. Put the SAVE variable in a common block but with an extra dimension |
64 |
for the thread number. |
65 |
2. Change the updates and references to the SAVE variable to a per thread |
66 |
basis. |
67 |
e.g |
68 |
C nIter - Current iteration counter |
69 |
COMMON /PARAMS/ nIter |
70 |
INTEGER nIter(MAX_NO_THREADS) |
71 |
|
72 |
nIter(myThid) = nIter(myThid)+1 |
73 |
|
74 |
Note. nIter(myThid) and nIter(myThid+1) will share the same |
75 |
cache line. The update will cause extra low-level memory |
76 |
traffic to maintain cache coherence. If the update is in |
77 |
a tight loop this will be a problem and nIter will need |
78 |
padding. |
79 |
In a NUMA system nIter(1:MAX_NO_THREADS) is likely to reside |
80 |
in a single page of physical memory on a single box. Again in |
81 |
a tight loop this would cause lots of remote/far memory references |
82 |
and would be a problem. Some compilers provide a machanism |
83 |
for helping overcome this problem. |
84 |
|
85 |
===================================== |
86 |
|
87 |
o Can I debug using write statements. |
88 |
|
89 |
Many systems do not have "thread-safe" Fortran I/O libraries. |
90 |
On these systems I/O generally orks but it gets a bit intermingled! |
91 |
Occaisionally doing multi-threaded I/O with an unsafe Fortran I/O library |
92 |
will actual cause the program to fail. Note: SGI has a "thread-safe" Fortran |
93 |
I/O library. |
94 |
|
95 |
===================================== |
96 |
|
97 |
o Mapping virtual memory to physical memory. |
98 |
|
99 |
The current code declares arrays as |
100 |
real aW2d (1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy) |
101 |
This raises an issue on shared virtual-memory machines that have |
102 |
an underlying non-uniform memory subsystem e.g. HP Exemplar, SGI |
103 |
Origin, DG, Sequent etc.. . What most machines implement is a scheme |
104 |
in which the physical memory that backs the virtual memory is allocated |
105 |
on a page basis at |
106 |
run-time. The OS manages this allocation and without exception |
107 |
pages are assigned to physical memory on the box where the thread |
108 |
which caused the page-fault is running. Pages are typically 4-8KB in |
109 |
size. This means that in some environments it would make sense to |
110 |
declare arrays |
111 |
real aW2d (1-OLx:sNx+OLx+PX,1-OLy:sNy+OLy+PY,nSx,nSy) |
112 |
where PX and PY are chosen so that the divides between near and |
113 |
far memory will coincide with the boundaries of the virtual memory |
114 |
regions a thread works on. In principle this is easy but it is |
115 |
also inelegant and really one would like the OS/hardware to take |
116 |
care of this issue. Doing it oneself requires PX and PY to be recalculated whenever |
117 |
the mapping of the nSx, nSy blocks to nTx and nTy threads is changed. Also |
118 |
different PX and PY are required depending on |
119 |
page size |
120 |
array element size ( real*4, real*8 ) |
121 |
array dimensions ( 2d, 3d Nz, 3d Nz+1 ) - in 3d a PZ would also be needed! |
122 |
Note: 1. A C implementation would be a lot easier. An F90 including allocation |
123 |
would also be fairly straightforward. |
124 |
2. The padding really ought to be between the "collection" of blocks |
125 |
that all the threads using the same near memory work on. To save on wasted |
126 |
memory the padding really should be between these blocks. The |
127 |
PX, PY, PZ mechanism does this three levels down on the heirarchy. This |
128 |
wastes more memory. |
129 |
3. For large problems this is less of an issue. For a large problem |
130 |
even for a 2d array there might be say 16 pages per array per processor |
131 |
and at least 4 processors in a uniform memory access box. Assuming a |
132 |
sensible mapping of processors to blocks only one page (1.5% of the |
133 |
memory) referenced by processors in another box. |
134 |
On the other hand for a very small per processor problem size e.g. |
135 |
32x32 per processor and again four processors per box as many as |
136 |
50% of the memory references could be to far memory for 2d fields. |
137 |
This could be very bad! |
138 |
|
139 |
===================================== |
140 |
|
141 |
===================================== |
142 |
|
143 |
===================================== |
144 |
|
145 |
===================================== |
146 |
|
147 |
===================================== |
148 |
|
149 |
===================================== |