| 1 |
# $Id: README,v 1.2 2006/05/12 22:32:02 ce107 Exp $ |
| 2 |
Benchmarking routine of the CG2D solver in MITgcm (barotropic solve) |
| 3 |
|
| 4 |
To build: |
| 5 |
|
| 6 |
a) Parameterizations: |
| 7 |
i) of SIZE.h: |
| 8 |
sNx = size of tile in x-direction (ideally fits in cache, 30-60) |
| 9 |
sNy = size of tile in y-direction (ideally fits in cache, 30-60) |
| 10 |
OLx = overlap size in x-direction (1 or 3 usually) |
| 11 |
OLy = overlap size in y-direction (1 or 3 usually) |
| 12 |
ii) of ini_parms.F: |
| 13 |
nTimeSteps = number of pseudo-timesteps to run for |
| 14 |
cg2dMaxIters = maximum number of CG iterations per timestep |
| 15 |
|
| 16 |
b) Compilation |
| 17 |
$CC $CFLAGS -c tim.c |
| 18 |
$FC $DEFINES $INCLUDES $FCFLAGS -o cg2d *.F tim.o $LIBS -lm |
| 19 |
|
| 20 |
$DEFINES: |
| 21 |
1) For single precision add |
| 22 |
-DUSE_SINGLE_PRECISION |
| 23 |
2) For mixed (single for most ops, double for reductions) precision add |
| 24 |
-DUSE_MIXED_PRECISION to -DUSE_SINGLE_PRECISION |
| 25 |
3) Parallel (MPI) operation |
| 26 |
-DALLOW_MPI -DUSE_MPI_INIT -DUSE_MPI_GSUM -DUSE_MPI_EXCH |
| 27 |
4) Use MPI timing routines |
| 28 |
-DUSE_MPI_TIME |
| 29 |
5) Use of MPI_Sendrecv() instead of MPI_Isend/MPI_Irecv()/MPI_Waitall() |
| 30 |
-DUSE_SNDRCV |
| 31 |
6) Use of JAM for exchanges (not available without the hardware) |
| 32 |
-DUSE_JAM_EXCH |
| 33 |
7) Use of JAM for the global sum (not available without the hardware) |
| 34 |
-DUSE_JAM_GSUM |
| 35 |
8) In order to avoid doing the global sum in MPI do not define |
| 36 |
-DUSE_MPI_GSUM |
| 37 |
and all processors will see their own residual instead (dangerous) |
| 38 |
9) In order to avoid doing the exchanges in MPI do not define |
| 39 |
-DUSE_MPI_EXCH |
| 40 |
and all processors avoid exchanging shadow regions (dangerous) |
| 41 |
10) Performance counters |
| 42 |
-DUSE_PAPI_FLOPS To use PAPI to produce Mflop/s |
| 43 |
or |
| 44 |
-DUSE_PAPI_FLIPS To use PAPI to produce Mflip/s |
| 45 |
To produce this information for every iteration instead of each "timestep" |
| 46 |
add a -DPAPI_PER_ITERATION to the above |
| 47 |
11) Extra (nearest neighbor) exchange steps to stress comms |
| 48 |
-DTEN_EXTRA_EXCHS |
| 49 |
12) Extra (global) sum steps to stress comms |
| 50 |
-DHUNDRED_EXTRA_SUMS |
| 51 |
13) 2D (PxQ) vs 1D decomposition |
| 52 |
-DDECOMP2D |
| 53 |
14) To output the residual every iteration: |
| 54 |
-DRESIDUAL_PER_ITERATION |
| 55 |
|
| 56 |
$INCLUDES (if using PAPI) |
| 57 |
-I$PAPI_ROOT/include |
| 58 |
|
| 59 |
$LIBS (if using PAPI - depending on the platform extra libs may be needed) |
| 60 |
-L$PAPI_ROOT/lib -lpapi |
| 61 |
|
| 62 |
c) Running |
| 63 |
|
| 64 |
1) Allowing the system to choose the PxQ decomposition if setup for it: |
| 65 |
mpiexec -n $NPROCS ./cg2d |
| 66 |
2) Create a decomp.touse with the P & Q dimensions declared in the first |
| 67 |
two lines as two integers, eg. |
| 68 |
|
| 69 |
cat > decomp.touse << EOF |
| 70 |
10 |
| 71 |
20 |
| 72 |
EOF |
| 73 |
|
| 74 |
mpiexec -n 200 ./cg2d |