1 |
cnh |
1.1 |
Miscellaneous notes relating to MITgcm UV |
2 |
|
|
========================================= |
3 |
|
|
|
4 |
|
|
o Something really weird is happening - variables keep |
5 |
|
|
changing value! |
6 |
|
|
|
7 |
|
|
Apart from the usual problems of out of bounds array refs. |
8 |
|
|
and various bugs itis important to be sure that "stack" |
9 |
|
|
variables really are stack variables in multi-threaded execution. |
10 |
|
|
Some compilers put subroutines local variables in static storage. |
11 |
|
|
This can result in an apparently private variable in a local |
12 |
|
|
routine being mysteriously changed by concurrently executing |
13 |
|
|
thread. |
14 |
|
|
|
15 |
|
|
===================================== |
16 |
|
|
|
17 |
|
|
o Something really weird is happening - the code gets stuck in |
18 |
|
|
a loop somewhere! |
19 |
|
|
|
20 |
|
|
The routines in barrier.F should be compiled without any |
21 |
|
|
optimisation. The routines check variables that are updated by other threads |
22 |
|
|
Compiler optimisations generally assume that the code being optimised |
23 |
|
|
will obey the sequential semantics of regular Fortran. That means they |
24 |
|
|
will assume that a variable is not going to change value unless the |
25 |
|
|
code it is optimising changes it. Obviously this can cause problems. |
26 |
|
|
|
27 |
|
|
===================================== |
28 |
|
|
|
29 |
|
|
o Is the Fortran SAVE statement a problem. |
30 |
|
|
|
31 |
|
|
Yes. On the whole the Fortran SAVE statement should not be used |
32 |
|
|
for data in a multi-threaded code. SAVE causes data to be held in |
33 |
|
|
static storage meaning that all threads will see the same location. |
34 |
|
|
Therefore, generally if one thread updates the location all other threads |
35 |
|
|
will see it. Note - there is often no specification for what should happen |
36 |
|
|
in this situation in a multi-threaded environment, so this is |
37 |
|
|
not a robust machanism for sharing data. |
38 |
|
|
For most cases where SAVE might be appropriate either of the following |
39 |
|
|
recipes should be used instead. Both these schemes are potential |
40 |
|
|
performance bottlenecks if they are over-used. |
41 |
|
|
Method 1 |
42 |
|
|
******** |
43 |
|
|
1. Put the SAVE variable in a common block |
44 |
|
|
2. Update the SAVE variable in a _BEGIN_MASTER, _END_MASTER block. |
45 |
|
|
3. Include a _BARRIER after the _BEGIN_MASTER, _END_MASTER block. |
46 |
|
|
e.g |
47 |
|
|
C nIter - Current iteration counter |
48 |
|
|
COMMON /PARAMS/ nIter |
49 |
|
|
INTEGER nIter |
50 |
|
|
|
51 |
|
|
_BEGIN_MASTER(myThid) |
52 |
|
|
nIter = nIter+1 |
53 |
|
|
_END_MASTER(myThid) |
54 |
|
|
_BARRIER |
55 |
|
|
|
56 |
|
|
Note. The _BARRIER operation is potentially expensive. Be conservative |
57 |
|
|
in your use of this scheme. |
58 |
|
|
|
59 |
|
|
Method 2 |
60 |
|
|
******** |
61 |
|
|
1. Put the SAVE variable in a common block but with an extra dimension |
62 |
|
|
for the thread number. |
63 |
|
|
2. Change the updates and references to the SAVE variable to a per thread |
64 |
|
|
basis. |
65 |
|
|
e.g |
66 |
|
|
C nIter - Current iteration counter |
67 |
|
|
COMMON /PARAMS/ nIter |
68 |
|
|
INTEGER nIter(MAX_NO_THREADS) |
69 |
|
|
|
70 |
|
|
nIter(myThid) = nIter(myThid)+1 |
71 |
|
|
|
72 |
|
|
Note. nIter(myThid) and nIter(myThid+1) will share the same |
73 |
|
|
cache line. The update will cause extra low-level memory |
74 |
|
|
traffic to maintain cache coherence. If the update is in |
75 |
|
|
a tight loop this will be a problem and nIter will need |
76 |
|
|
padding. |
77 |
|
|
In a NUMA system nIter(1:MAX_NO_THREADS) is likely to reside |
78 |
|
|
in a single page of physical memory on a single box. Again in |
79 |
|
|
a tight loop this would cause lots of remote/far memory references |
80 |
|
|
and would be a problem. Some compilers provide a machanism |
81 |
|
|
for helping overcome this problem. |
82 |
|
|
|
83 |
|
|
===================================== |
84 |
|
|
|
85 |
|
|
o Can I debug using write statements. |
86 |
|
|
|
87 |
|
|
Many systems do not have "thread-safe" Fortran I/O libraries. |
88 |
|
|
On these systems I/O generally orks but it gets a bit intermingled! |
89 |
|
|
Occaisionally doing multi-threaded I/O with an unsafe Fortran I/O library |
90 |
|
|
will actual cause the program to fail. Note: SGI has a "thread-safe" Fortran |
91 |
|
|
I/O library. |
92 |
|
|
|
93 |
|
|
===================================== |
94 |
|
|
|
95 |
|
|
o Mapping virtual memory to physical memory. |
96 |
|
|
|
97 |
|
|
The current code declares arrays as |
98 |
|
|
real aW2d (1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy) |
99 |
|
|
This raises an issue on shared virtual-memory machines that have |
100 |
|
|
an underlying non-uniform memory subsystem e.g. HP Exemplar, SGI |
101 |
|
|
Origin, DG, Sequent etc.. . What most machines implement is a scheme |
102 |
|
|
in which the physical memory that backs the virtual memory is allocated |
103 |
|
|
on a page basis at |
104 |
|
|
run-time. The OS manages this allocation and without exception |
105 |
|
|
pages are assigned to physical memory on the box where the thread |
106 |
|
|
which caused the page-fault is running. Pages are typically 4-8KB in |
107 |
|
|
size. This means that in some environments it would make sense to |
108 |
|
|
declare arrays |
109 |
|
|
real aW2d (1-OLx:sNx+OLx+PX,1-OLy:sNy+OLy+PY,nSx,nSy) |
110 |
|
|
where PX and PY are chosen so that the divides between near and |
111 |
|
|
far memory will coincide with the boundaries of the virtual memory |
112 |
|
|
regions a thread works on. In principle this is easy but it is |
113 |
|
|
also inelegant and really one would like the OS/hardware to take |
114 |
|
|
care of this issue. Doing it oneself requires PX and PY to be recalculated whenever |
115 |
|
|
the mapping of the nSx, nSy blocks to nTx and nTy threads is changed. Also |
116 |
|
|
different PX and PY are required depending on |
117 |
|
|
page size |
118 |
|
|
array element size ( real*4, real*8 ) |
119 |
|
|
array dimensions ( 2d, 3d Nz, 3d Nz+1 ) - in 3d a PZ would also be needed! |
120 |
|
|
Note: 1. A C implementation would be a lot easier. An F90 including allocation |
121 |
|
|
would also be fairly straightforward. |
122 |
|
|
2. The padding really ought to be between the "collection" of blocks |
123 |
|
|
that all the threads using the same near memory work on. To save on wasted |
124 |
|
|
memory the padding really should be between these blocks. The |
125 |
|
|
PX, PY, PZ mechanism does this three levels down on the heirarchy. This |
126 |
|
|
wastes more memory. |
127 |
|
|
3. For large problems this is less of an issue. For a large problem |
128 |
|
|
even for a 2d array there might be say 16 pages per array per processor |
129 |
|
|
and at least 4 processors in a uniform memory access box. Assuming a |
130 |
|
|
sensible mapping of processors to blocks only one page (1.5% of the |
131 |
|
|
memory) referenced by processors in another box. |
132 |
|
|
On the other hand for a very small per processor problem size e.g. |
133 |
|
|
32x32 per processor and again four processors per box as many as |
134 |
|
|
50% of the memory references could be to far memory for 2d fields. |
135 |
|
|
This could be very bad! |
136 |
|
|
|
137 |
|
|
===================================== |
138 |
|
|
|
139 |
|
|
===================================== |
140 |
|
|
|
141 |
|
|
===================================== |
142 |
|
|
|
143 |
|
|
===================================== |
144 |
|
|
|
145 |
|
|
===================================== |
146 |
|
|
|
147 |
|
|
===================================== |