1 |
Miscellaneous notes relating to MITgcm UV |
2 |
========================================= |
3 |
|
4 |
o Something really weird is happening - variables keep |
5 |
changing value! |
6 |
|
7 |
Apart from the usual problems of out of bounds array refs. |
8 |
and various bugs itis important to be sure that "stack" |
9 |
variables really are stack variables in multi-threaded execution. |
10 |
Some compilers put subroutines local variables in static storage. |
11 |
This can result in an apparently private variable in a local |
12 |
routine being mysteriously changed by concurrently executing |
13 |
thread. |
14 |
|
15 |
===================================== |
16 |
|
17 |
o Something really weird is happening - the code gets stuck in |
18 |
a loop somewhere! |
19 |
|
20 |
The routines in barrier.F should be compiled without any |
21 |
optimisation. The routines check variables that are updated by other threads |
22 |
Compiler optimisations generally assume that the code being optimised |
23 |
will obey the sequential semantics of regular Fortran. That means they |
24 |
will assume that a variable is not going to change value unless the |
25 |
code it is optimising changes it. Obviously this can cause problems. |
26 |
|
27 |
===================================== |
28 |
|
29 |
o Is the Fortran SAVE statement a problem. |
30 |
|
31 |
Yes. On the whole the Fortran SAVE statement should not be used |
32 |
for data in a multi-threaded code. SAVE causes data to be held in |
33 |
static storage meaning that all threads will see the same location. |
34 |
Therefore, generally if one thread updates the location all other threads |
35 |
will see it. Note - there is often no specification for what should happen |
36 |
in this situation in a multi-threaded environment, so this is |
37 |
not a robust machanism for sharing data. |
38 |
For most cases where SAVE might be appropriate either of the following |
39 |
recipes should be used instead. Both these schemes are potential |
40 |
performance bottlenecks if they are over-used. |
41 |
Method 1 |
42 |
******** |
43 |
1. Put the SAVE variable in a common block |
44 |
2. Update the SAVE variable in a _BEGIN_MASTER, _END_MASTER block. |
45 |
3. Include a _BARRIER after the _BEGIN_MASTER, _END_MASTER block. |
46 |
e.g |
47 |
C nIter - Current iteration counter |
48 |
COMMON /PARAMS/ nIter |
49 |
INTEGER nIter |
50 |
|
51 |
_BEGIN_MASTER(myThid) |
52 |
nIter = nIter+1 |
53 |
_END_MASTER(myThid) |
54 |
_BARRIER |
55 |
|
56 |
Note. The _BARRIER operation is potentially expensive. Be conservative |
57 |
in your use of this scheme. |
58 |
|
59 |
Method 2 |
60 |
******** |
61 |
1. Put the SAVE variable in a common block but with an extra dimension |
62 |
for the thread number. |
63 |
2. Change the updates and references to the SAVE variable to a per thread |
64 |
basis. |
65 |
e.g |
66 |
C nIter - Current iteration counter |
67 |
COMMON /PARAMS/ nIter |
68 |
INTEGER nIter(MAX_NO_THREADS) |
69 |
|
70 |
nIter(myThid) = nIter(myThid)+1 |
71 |
|
72 |
Note. nIter(myThid) and nIter(myThid+1) will share the same |
73 |
cache line. The update will cause extra low-level memory |
74 |
traffic to maintain cache coherence. If the update is in |
75 |
a tight loop this will be a problem and nIter will need |
76 |
padding. |
77 |
In a NUMA system nIter(1:MAX_NO_THREADS) is likely to reside |
78 |
in a single page of physical memory on a single box. Again in |
79 |
a tight loop this would cause lots of remote/far memory references |
80 |
and would be a problem. Some compilers provide a machanism |
81 |
for helping overcome this problem. |
82 |
|
83 |
===================================== |
84 |
|
85 |
o Can I debug using write statements. |
86 |
|
87 |
Many systems do not have "thread-safe" Fortran I/O libraries. |
88 |
On these systems I/O generally orks but it gets a bit intermingled! |
89 |
Occaisionally doing multi-threaded I/O with an unsafe Fortran I/O library |
90 |
will actual cause the program to fail. Note: SGI has a "thread-safe" Fortran |
91 |
I/O library. |
92 |
|
93 |
===================================== |
94 |
|
95 |
o Mapping virtual memory to physical memory. |
96 |
|
97 |
The current code declares arrays as |
98 |
real aW2d (1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy) |
99 |
This raises an issue on shared virtual-memory machines that have |
100 |
an underlying non-uniform memory subsystem e.g. HP Exemplar, SGI |
101 |
Origin, DG, Sequent etc.. . What most machines implement is a scheme |
102 |
in which the physical memory that backs the virtual memory is allocated |
103 |
on a page basis at |
104 |
run-time. The OS manages this allocation and without exception |
105 |
pages are assigned to physical memory on the box where the thread |
106 |
which caused the page-fault is running. Pages are typically 4-8KB in |
107 |
size. This means that in some environments it would make sense to |
108 |
declare arrays |
109 |
real aW2d (1-OLx:sNx+OLx+PX,1-OLy:sNy+OLy+PY,nSx,nSy) |
110 |
where PX and PY are chosen so that the divides between near and |
111 |
far memory will coincide with the boundaries of the virtual memory |
112 |
regions a thread works on. In principle this is easy but it is |
113 |
also inelegant and really one would like the OS/hardware to take |
114 |
care of this issue. Doing it oneself requires PX and PY to be recalculated whenever |
115 |
the mapping of the nSx, nSy blocks to nTx and nTy threads is changed. Also |
116 |
different PX and PY are required depending on |
117 |
page size |
118 |
array element size ( real*4, real*8 ) |
119 |
array dimensions ( 2d, 3d Nz, 3d Nz+1 ) - in 3d a PZ would also be needed! |
120 |
Note: 1. A C implementation would be a lot easier. An F90 including allocation |
121 |
would also be fairly straightforward. |
122 |
2. The padding really ought to be between the "collection" of blocks |
123 |
that all the threads using the same near memory work on. To save on wasted |
124 |
memory the padding really should be between these blocks. The |
125 |
PX, PY, PZ mechanism does this three levels down on the heirarchy. This |
126 |
wastes more memory. |
127 |
3. For large problems this is less of an issue. For a large problem |
128 |
even for a 2d array there might be say 16 pages per array per processor |
129 |
and at least 4 processors in a uniform memory access box. Assuming a |
130 |
sensible mapping of processors to blocks only one page (1.5% of the |
131 |
memory) referenced by processors in another box. |
132 |
On the other hand for a very small per processor problem size e.g. |
133 |
32x32 per processor and again four processors per box as many as |
134 |
50% of the memory references could be to far memory for 2d fields. |
135 |
This could be very bad! |
136 |
|
137 |
===================================== |
138 |
|
139 |
===================================== |
140 |
|
141 |
===================================== |
142 |
|
143 |
===================================== |
144 |
|
145 |
===================================== |
146 |
|
147 |
===================================== |