1 |
ce107 |
1.3 |
# $Id: README,v 1.2 2006/05/12 22:32:02 ce107 Exp $ |
2 |
ce107 |
1.1 |
Benchmarking routine of the CG2D solver in MITgcm (barotropic solve) |
3 |
|
|
|
4 |
|
|
To build: |
5 |
|
|
|
6 |
ce107 |
1.2 |
a) Parameterizations: |
7 |
|
|
i) of SIZE.h: |
8 |
ce107 |
1.1 |
sNx = size of tile in x-direction (ideally fits in cache, 30-60) |
9 |
|
|
sNy = size of tile in y-direction (ideally fits in cache, 30-60) |
10 |
|
|
OLx = overlap size in x-direction (1 or 3 usually) |
11 |
|
|
OLy = overlap size in y-direction (1 or 3 usually) |
12 |
ce107 |
1.3 |
ii) of ini_parms.F: |
13 |
ce107 |
1.2 |
nTimeSteps = number of pseudo-timesteps to run for |
14 |
|
|
cg2dMaxIters = maximum number of CG iterations per timestep |
15 |
|
|
|
16 |
ce107 |
1.1 |
b) Compilation |
17 |
|
|
$CC $CFLAGS -c tim.c |
18 |
|
|
$FC $DEFINES $INCLUDES $FCFLAGS -o cg2d *.F tim.o $LIBS -lm |
19 |
|
|
|
20 |
|
|
$DEFINES: |
21 |
|
|
1) For single precision add |
22 |
|
|
-DUSE_SINGLE_PRECISION |
23 |
|
|
2) For mixed (single for most ops, double for reductions) precision add |
24 |
|
|
-DUSE_MIXED_PRECISION to -DUSE_SINGLE_PRECISION |
25 |
|
|
3) Parallel (MPI) operation |
26 |
|
|
-DALLOW_MPI -DUSE_MPI_INIT -DUSE_MPI_GSUM -DUSE_MPI_EXCH |
27 |
|
|
4) Use MPI timing routines |
28 |
|
|
-DUSE_MPI_TIME |
29 |
|
|
5) Use of MPI_Sendrecv() instead of MPI_Isend/MPI_Irecv()/MPI_Waitall() |
30 |
|
|
-DUSE_SNDRCV |
31 |
|
|
6) Use of JAM for exchanges (not available without the hardware) |
32 |
|
|
-DUSE_JAM_EXCH |
33 |
|
|
7) Use of JAM for the global sum (not available without the hardware) |
34 |
|
|
-DUSE_JAM_GSUM |
35 |
|
|
8) In order to avoid doing the global sum in MPI do not define |
36 |
|
|
-DUSE_MPI_GSUM |
37 |
|
|
and all processors will see their own residual instead (dangerous) |
38 |
|
|
9) In order to avoid doing the exchanges in MPI do not define |
39 |
|
|
-DUSE_MPI_EXCH |
40 |
|
|
and all processors avoid exchanging shadow regions (dangerous) |
41 |
|
|
10) Performance counters |
42 |
|
|
-DUSE_PAPI_FLOPS To use PAPI to produce Mflop/s |
43 |
|
|
or |
44 |
|
|
-DUSE_PAPI_FLIPS To use PAPI to produce Mflip/s |
45 |
|
|
To produce this information for every iteration instead of each "timestep" |
46 |
|
|
add a -DPAPI_PER_ITERATION to the above |
47 |
|
|
11) Extra (nearest neighbor) exchange steps to stress comms |
48 |
|
|
-DTEN_EXTRA_EXCHS |
49 |
|
|
12) Extra (global) sum steps to stress comms |
50 |
|
|
-DHUNDRED_EXTRA_SUMS |
51 |
|
|
13) 2D (PxQ) vs 1D decomposition |
52 |
|
|
-DDECOMP2D |
53 |
|
|
14) To output the residual every iteration: |
54 |
|
|
-DRESIDUAL_PER_ITERATION |
55 |
|
|
|
56 |
|
|
$INCLUDES (if using PAPI) |
57 |
|
|
-I$PAPI_ROOT/include |
58 |
|
|
|
59 |
|
|
$LIBS (if using PAPI - depending on the platform extra libs may be needed) |
60 |
|
|
-L$PAPI_ROOT/lib -lpapi |
61 |
|
|
|
62 |
|
|
c) Running |
63 |
|
|
|
64 |
|
|
1) Allowing the system to choose the PxQ decomposition if setup for it: |
65 |
|
|
mpiexec -n $NPROCS ./cg2d |
66 |
|
|
2) Create a decomp.touse with the P & Q dimensions declared in the first |
67 |
|
|
two lines as two integers, eg. |
68 |
|
|
|
69 |
|
|
cat > decomp.touse << EOF |
70 |
|
|
10 |
71 |
|
|
20 |
72 |
|
|
EOF |
73 |
|
|
|
74 |
|
|
mpiexec -n 200 ./cg2d |
75 |
|
|
|