# $Id: README,v 1.3 2006/05/12 22:32:31 ce107 Exp $ Benchmarking routine of the CG2D solver in MITgcm (barotropic solve) To build: a) Parameterizations: i) of SIZE.h: sNx = size of tile in x-direction (ideally fits in cache, 30-60) sNy = size of tile in y-direction (ideally fits in cache, 30-60) OLx = overlap size in x-direction (1 or 3 usually) OLy = overlap size in y-direction (1 or 3 usually) ii) of ini_parms.F: nTimeSteps = number of pseudo-timesteps to run for cg2dMaxIters = maximum number of CG iterations per timestep b) Compilation $CC $CFLAGS -c tim.c $FC $DEFINES $INCLUDES $FCFLAGS -o cg2d *.F tim.o $LIBS -lm $DEFINES: 1) For single precision add -DUSE_SINGLE_PRECISION 2) For mixed (single for most ops, double for reductions) precision add -DUSE_MIXED_PRECISION to -DUSE_SINGLE_PRECISION 3) Parallel (MPI) operation -DALLOW_MPI -DUSE_MPI_INIT -DUSE_MPI_GSUM -DUSE_MPI_EXCH 4) Use MPI timing routines -DUSE_MPI_TIME 5) Use of MPI_Sendrecv() instead of MPI_Isend/MPI_Irecv()/MPI_Waitall() -DUSE_SNDRCV 6) Use of JAM for exchanges (not available without the hardware) -DUSE_JAM_EXCH 7) Use of JAM for the global sum (not available without the hardware) -DUSE_JAM_GSUM 8) In order to avoid doing the global sum in MPI do not define -DUSE_MPI_GSUM and all processors will see their own residual instead (dangerous) 9) In order to avoid doing the exchanges in MPI do not define -DUSE_MPI_EXCH and all processors avoid exchanging shadow regions (dangerous) 10) Performance counters -DUSE_PAPI_FLOPS To use PAPI to produce Mflop/s or -DUSE_PAPI_FLIPS To use PAPI to produce Mflip/s To produce this information for every iteration instead of each "timestep" add a -DPAPI_PER_ITERATION to the above 11) Extra (nearest neighbor) exchange steps to stress comms -DTEN_EXTRA_EXCHS 12) Extra (global) sum steps to stress comms -DHUNDRED_EXTRA_SUMS 13) 2D (PxQ) vs 1D decomposition -DDECOMP2D 14) To output the residual every iteration: -DRESIDUAL_PER_ITERATION $INCLUDES (if using PAPI) -I$PAPI_ROOT/include $LIBS (if using PAPI - depending on the platform extra libs may be needed) -L$PAPI_ROOT/lib -lpapi c) Running 1) Allowing the system to choose the PxQ decomposition if setup for it: mpiexec -n $NPROCS ./cg2d 2) Create a decomp.touse with the P & Q dimensions declared in the first two lines as two integers, eg. cat > decomp.touse << EOF 10 20 EOF mpiexec -n 200 ./cg2d