Annotation of /MITgcm_contrib/cg2d_bench/README

#	$Id: README,v 1.2 2006/05/12 22:32:02 ce107 Exp $	
Benchmarking routine of the CG2D solver in MITgcm (barotropic solve)

To build:

a) Parameterizations:
i) of SIZE.h:
sNx = size of tile in x-direction (ideally fits in cache, 30-60)
sNy = size of tile in y-direction (ideally fits in cache, 30-60)
OLx = overlap size in x-direction (1 or 3 usually)
OLy = overlap size in y-direction (1 or 3 usually)
ii) of ini_parms.F:
nTimeSteps = number of pseudo-timesteps to run for
cg2dMaxIters = maximum number of CG iterations per timestep

b) Compilation
$CC $CFLAGS -c tim.c
$FC $DEFINES $INCLUDES $FCFLAGS -o cg2d *.F tim.o $LIBS -lm

$DEFINES:
1) For single precision add
-DUSE_SINGLE_PRECISION
2) For mixed (single for most ops, double for reductions) precision add
-DUSE_MIXED_PRECISION to -DUSE_SINGLE_PRECISION
3) Parallel (MPI) operation
-DALLOW_MPI -DUSE_MPI_INIT -DUSE_MPI_GSUM -DUSE_MPI_EXCH
4) Use MPI timing routines
-DUSE_MPI_TIME
5) Use of MPI_Sendrecv() instead of MPI_Isend/MPI_Irecv()/MPI_Waitall()
-DUSE_SNDRCV
6) Use of JAM for exchanges (not available without the hardware)
-DUSE_JAM_EXCH
7) Use of JAM for the global sum (not available without the hardware)
-DUSE_JAM_GSUM
8) In order to avoid doing the global sum in MPI do not define
-DUSE_MPI_GSUM
and all processors will see their own residual instead (dangerous)
9) In order to avoid doing the exchanges in MPI do not define
-DUSE_MPI_EXCH
and all processors avoid exchanging shadow regions (dangerous)
10) Performance counters
-DUSE_PAPI_FLOPS	To use PAPI to produce Mflop/s
or
-DUSE_PAPI_FLIPS	To use PAPI to produce Mflip/s
To produce this information for every iteration instead of each "timestep" 
add a -DPAPI_PER_ITERATION to the above
11) Extra (nearest neighbor) exchange steps to stress comms
-DTEN_EXTRA_EXCHS
12) Extra (global) sum steps to stress comms
-DHUNDRED_EXTRA_SUMS
13) 2D (PxQ) vs 1D decomposition
-DDECOMP2D
14) To output the residual every iteration:
-DRESIDUAL_PER_ITERATION

$INCLUDES (if using PAPI)
-I$PAPI_ROOT/include

$LIBS (if using PAPI - depending on the platform extra libs may be needed)
-L$PAPI_ROOT/lib -lpapi

c) Running

1) Allowing the system to choose the PxQ decomposition if setup for it:
mpiexec -n $NPROCS ./cg2d
2) Create a decomp.touse with the P & Q dimensions declared in the first
two lines as two integers, eg.

cat > decomp.touse << EOF
10
20
EOF

mpiexec -n 200 ./cg2d

1	ce107	1.3	# $Id: README,v 1.2 2006/05/12 22:32:02 ce107 Exp $
2	ce107	1.1	Benchmarking routine of the CG2D solver in MITgcm (barotropic solve)
3
4			To build:
5
6	ce107	1.2	a) Parameterizations:
7			i) of SIZE.h:
8	ce107	1.1	sNx = size of tile in x-direction (ideally fits in cache, 30-60)
9			sNy = size of tile in y-direction (ideally fits in cache, 30-60)
10			OLx = overlap size in x-direction (1 or 3 usually)
11			OLy = overlap size in y-direction (1 or 3 usually)
12	ce107	1.3	ii) of ini_parms.F:
13	ce107	1.2	nTimeSteps = number of pseudo-timesteps to run for
14			cg2dMaxIters = maximum number of CG iterations per timestep
15
16	ce107	1.1	b) Compilation
17			$CC $CFLAGS -c tim.c
18			$FC $DEFINES $INCLUDES $FCFLAGS -o cg2d *.F tim.o $LIBS -lm
19
20			$DEFINES:
21			1) For single precision add
22			-DUSE_SINGLE_PRECISION
23			2) For mixed (single for most ops, double for reductions) precision add
24			-DUSE_MIXED_PRECISION to -DUSE_SINGLE_PRECISION
25			3) Parallel (MPI) operation
26			-DALLOW_MPI -DUSE_MPI_INIT -DUSE_MPI_GSUM -DUSE_MPI_EXCH
27			4) Use MPI timing routines
28			-DUSE_MPI_TIME
29			5) Use of MPI_Sendrecv() instead of MPI_Isend/MPI_Irecv()/MPI_Waitall()
30			-DUSE_SNDRCV
31			6) Use of JAM for exchanges (not available without the hardware)
32			-DUSE_JAM_EXCH
33			7) Use of JAM for the global sum (not available without the hardware)
34			-DUSE_JAM_GSUM
35			8) In order to avoid doing the global sum in MPI do not define
36			-DUSE_MPI_GSUM
37			and all processors will see their own residual instead (dangerous)
38			9) In order to avoid doing the exchanges in MPI do not define
39			-DUSE_MPI_EXCH
40			and all processors avoid exchanging shadow regions (dangerous)
41			10) Performance counters
42			-DUSE_PAPI_FLOPS To use PAPI to produce Mflop/s
43			or
44			-DUSE_PAPI_FLIPS To use PAPI to produce Mflip/s
45			To produce this information for every iteration instead of each "timestep"
46			add a -DPAPI_PER_ITERATION to the above
47			11) Extra (nearest neighbor) exchange steps to stress comms
48			-DTEN_EXTRA_EXCHS
49			12) Extra (global) sum steps to stress comms
50			-DHUNDRED_EXTRA_SUMS
51			13) 2D (PxQ) vs 1D decomposition
52			-DDECOMP2D
53			14) To output the residual every iteration:
54			-DRESIDUAL_PER_ITERATION
55
56			$INCLUDES (if using PAPI)
57			-I$PAPI_ROOT/include
58
59			$LIBS (if using PAPI - depending on the platform extra libs may be needed)
60			-L$PAPI_ROOT/lib -lpapi
61
62			c) Running
63
64			1) Allowing the system to choose the PxQ decomposition if setup for it:
65			mpiexec -n $NPROCS ./cg2d
66			2) Create a decomp.touse with the P & Q dimensions declared in the first
67			two lines as two integers, eg.
68
69			cat > decomp.touse << EOF
70			10
71			20
72			EOF
73
74			mpiexec -n 200 ./cg2d
75