| 1 | #	$Id: README,v 1.2 2006/05/12 22:32:02 ce107 Exp $ | 
| 2 | Benchmarking routine of the CG2D solver in MITgcm (barotropic solve) | 
| 3 |  | 
| 4 | To build: | 
| 5 |  | 
| 6 | a) Parameterizations: | 
| 7 | i) of SIZE.h: | 
| 8 | sNx = size of tile in x-direction (ideally fits in cache, 30-60) | 
| 9 | sNy = size of tile in y-direction (ideally fits in cache, 30-60) | 
| 10 | OLx = overlap size in x-direction (1 or 3 usually) | 
| 11 | OLy = overlap size in y-direction (1 or 3 usually) | 
| 12 | ii) of ini_parms.F: | 
| 13 | nTimeSteps = number of pseudo-timesteps to run for | 
| 14 | cg2dMaxIters = maximum number of CG iterations per timestep | 
| 15 |  | 
| 16 | b) Compilation | 
| 17 | $CC $CFLAGS -c tim.c | 
| 18 | $FC $DEFINES $INCLUDES $FCFLAGS -o cg2d *.F tim.o $LIBS -lm | 
| 19 |  | 
| 20 | $DEFINES: | 
| 21 | 1) For single precision add | 
| 22 | -DUSE_SINGLE_PRECISION | 
| 23 | 2) For mixed (single for most ops, double for reductions) precision add | 
| 24 | -DUSE_MIXED_PRECISION to -DUSE_SINGLE_PRECISION | 
| 25 | 3) Parallel (MPI) operation | 
| 26 | -DALLOW_MPI -DUSE_MPI_INIT -DUSE_MPI_GSUM -DUSE_MPI_EXCH | 
| 27 | 4) Use MPI timing routines | 
| 28 | -DUSE_MPI_TIME | 
| 29 | 5) Use of MPI_Sendrecv() instead of MPI_Isend/MPI_Irecv()/MPI_Waitall() | 
| 30 | -DUSE_SNDRCV | 
| 31 | 6) Use of JAM for exchanges (not available without the hardware) | 
| 32 | -DUSE_JAM_EXCH | 
| 33 | 7) Use of JAM for the global sum (not available without the hardware) | 
| 34 | -DUSE_JAM_GSUM | 
| 35 | 8) In order to avoid doing the global sum in MPI do not define | 
| 36 | -DUSE_MPI_GSUM | 
| 37 | and all processors will see their own residual instead (dangerous) | 
| 38 | 9) In order to avoid doing the exchanges in MPI do not define | 
| 39 | -DUSE_MPI_EXCH | 
| 40 | and all processors avoid exchanging shadow regions (dangerous) | 
| 41 | 10) Performance counters | 
| 42 | -DUSE_PAPI_FLOPS	To use PAPI to produce Mflop/s | 
| 43 | or | 
| 44 | -DUSE_PAPI_FLIPS	To use PAPI to produce Mflip/s | 
| 45 | To produce this information for every iteration instead of each "timestep" | 
| 46 | add a -DPAPI_PER_ITERATION to the above | 
| 47 | 11) Extra (nearest neighbor) exchange steps to stress comms | 
| 48 | -DTEN_EXTRA_EXCHS | 
| 49 | 12) Extra (global) sum steps to stress comms | 
| 50 | -DHUNDRED_EXTRA_SUMS | 
| 51 | 13) 2D (PxQ) vs 1D decomposition | 
| 52 | -DDECOMP2D | 
| 53 | 14) To output the residual every iteration: | 
| 54 | -DRESIDUAL_PER_ITERATION | 
| 55 |  | 
| 56 | $INCLUDES (if using PAPI) | 
| 57 | -I$PAPI_ROOT/include | 
| 58 |  | 
| 59 | $LIBS (if using PAPI - depending on the platform extra libs may be needed) | 
| 60 | -L$PAPI_ROOT/lib -lpapi | 
| 61 |  | 
| 62 | c) Running | 
| 63 |  | 
| 64 | 1) Allowing the system to choose the PxQ decomposition if setup for it: | 
| 65 | mpiexec -n $NPROCS ./cg2d | 
| 66 | 2) Create a decomp.touse with the P & Q dimensions declared in the first | 
| 67 | two lines as two integers, eg. | 
| 68 |  | 
| 69 | cat > decomp.touse << EOF | 
| 70 | 10 | 
| 71 | 20 | 
| 72 | EOF | 
| 73 |  | 
| 74 | mpiexec -n 200 ./cg2d |