Scalasca  (Scalasca 2.6, revision 748ac9e9)
Scalable Performance Analysis of Large-Scale Applications
Preparing a reference execution

As a first step of every performance analysis, a reference execution using an uninstrumented executable should be performed. First, this step verifies that the code executes cleanly and produces correct results. Second, it later allows to assess the run-time overhead introduced by instrumentation and measurement. And finally, it provides a baseline to compare with after applying some code optimizations. At this stage an appropriate test configuration should be chosen, such that it is both repeatable and long enough to be representative. (Note that excessively long execution durations can make measurement analysis inconvenient or even prohibitive, and therefore should be avoided.)

After unpacking the NPB-MPI source archive, the build system has to be adjusted to the respective environment. For the NAS benchmarks, this is accomplished by a Makefile snippet defining a number of variables used by a generic Makefile. This snippet is called make.def and has to reside in the config/ subdirectory, which already contains a template file that can be copied and adjusted appropriately. In particular, the MPI Fortran compiler wrapper and flags need to be specified, for example:

  MPIF77     = mpifort
  FFLAGS     = -O2
  FLINKFLAGS = -O2

Note that the MPI C compiler wrapper and flags are not used for building BT, but may also be set in the config/make.def file accordingly to experiment with other NPB benchmarks.

Next, the benchmark can be built from the top-level directory by running make, specifying the number of MPI ranks to use via the NPROCS variable—for BT, this is required to be a square number—as well as the problem size via the CLASS variable on the command line. Valid problem classes (of increasing size) are W, S, A, B, C, D, and E, and can be used to adjust the benchmark runtime to the execution environment. For example, class W or S is appropriate for execution on a laptop with 4 MPI ranks, while the other problem sizes are more suitable for "real" configurations. For the example run on JURECA, 144 MPI ranks and problem class D have been chosen:

  $ make bt NPROCS=144 CLASS=D
     =========================================
     =      NAS Parallel Benchmarks 3.3      =
     =      MPI/F77/C                        =
     =========================================

  cd BT; make NPROCS=144 CLASS=D SUBTYPE= VERSION=
  make[1]: Entering directory `/tmp/NPB3.3-MPI/BT'
  make[2]: Entering directory `/tmp/NPB3.3-MPI/sys'
  cc -g  -o setparams setparams.c
  make[2]: Leaving directory `/tmp/NPB3.3-MPI/sys'
  ../sys/setparams bt 144 D
  make[2]: Entering directory `/tmp/NPB3.3-MPI/BT'
  mpifort -c  -O2 bt.f
  mpifort -c  -O2 make_set.f
  mpifort -c  -O2 initialize.f
  mpifort -c  -O2 exact_solution.f
  mpifort -c  -O2 exact_rhs.f
  mpifort -c  -O2 set_constants.f
  mpifort -c  -O2 adi.f
  mpifort -c  -O2 define.f
  mpifort -c  -O2 copy_faces.f
  mpifort -c  -O2 rhs.f
  mpifort -c  -O2 solve_subs.f
  mpifort -c  -O2 x_solve.f
  mpifort -c  -O2 y_solve.f
  mpifort -c  -O2 z_solve.f
  mpifort -c  -O2 add.f
  mpifort -c  -O2 error.f
  mpifort -c  -O2 verify.f
  mpifort -c  -O2 setup_mpi.f
  cd ../common; mpifort -c  -O2 print_results.f
  cd ../common; mpifort -c  -O2 timers.f
  make[3]: Entering directory `/tmp/NPB3.3-MPI/BT'
  mpifort -c  -O2 btio.f
  mpifort -O2 -o ../bin/bt.D.144 bt.o make_set.o initialize.o exact_solution.o \
      exact_rhs.o set_constants.o adi.o define.o copy_faces.o rhs.o solve_subs.o \
      x_solve.o y_solve.o z_solve.o add.o error.o verify.o setup_mpi.o \
      ../common/print_results.o ../common/timers.o btio.o
  make[3]: Leaving directory `/tmp/NPB3.3-MPI/BT'
  make[2]: Leaving directory `/tmp/NPB3.3-MPI/BT'
  make[1]: Leaving directory `/tmp/NPB3.3-MPI/BT'

The resulting executable encodes the benchmark configuration in its name and is placed into the bin/ subdirectory. For the example make command above, it is named bt.D.144. This binary can now be executed, either via submitting an appropriate batch job (which is beyond the scope of this user guide) or directly in an interactive session.

  $ cd bin
  $ mpiexec -n 144 ./bt.D.144


   NAS Parallel Benchmarks 3.3 -- BT Benchmark

   No input file inputbt.data. Using compiled defaults
   Size:  408x 408x 408
   Iterations:  250    dt:   0.0000200
   Number of active processes:   144

   Time step    1
   Time step   20
   Time step   40
   Time step   60
   Time step   80
   Time step  100
   Time step  120
   Time step  140
   Time step  160
   Time step  180
   Time step  200
   Time step  220
   Time step  240
   Time step  250
   Verification being performed for class D
   accuracy setting for epsilon =  0.1000000000000E-07
   Comparison of RMS-norms of residual
             1 0.2533188551738E+05 0.2533188551738E+05 0.1497879774166E-12
             2 0.2346393716980E+04 0.2346393716980E+04 0.8488743310506E-13
             3 0.6294554366904E+04 0.6294554366904E+04 0.3034271788588E-14
             4 0.5352565376030E+04 0.5352565376030E+04 0.8308967344119E-13
             5 0.3905864038618E+05 0.3905864038618E+05 0.6650300273080E-13
   Comparison of RMS-norms of solution error
             1 0.3100009377557E+03 0.3100009377557E+03 0.1373406191445E-12
             2 0.2424086324913E+02 0.2424086324913E+02 0.1600422929406E-12
             3 0.7782212022645E+02 0.7782212022645E+02 0.4090394153928E-13
             4 0.6835623860116E+02 0.6835623860116E+02 0.3617356324816E-13
             5 0.6065737200368E+03 0.6065737200368E+03 0.2605201960010E-13
   Verification Successful


   BT Benchmark Completed.
   Class           =                        D
   Size            =            408x 408x 408
   Iterations      =                      250
   Time in seconds =                   216.00
   Total processes =                      144
   Compiled procs  =                      144
   Mop/s total     =                270070.08
   Mop/s/process   =                  1875.49
   Operation type  =           floating point
   Verification    =               SUCCESSFUL
   Version         =                    3.3.1
   Compile date    =              18 Mar 2019

   Compile options:
      MPIF77       = mpifort
      FLINK        = $(MPIF77)
      FMPI_LIB     = (none)
      FMPI_INC     = (none)
      FFLAGS       = -O2
      FLINKFLAGS   = -O2
      RAND         = (none)


   Please send feedbacks and/or the results of this run to:

   NPB Development Team
   Internet: npb@nas.nasa.gov


In the selected configuration, the BT benchmark executes 250 iterations of the time step loop, and then verifies that the result matches the expected outcome. Before exiting, the benchmark also reports some configuration details, as well as the wall-clock execution time (216.00 seconds) for the core computation.




Scalasca    Copyright © 1998–2021 Forschungszentrum Jülich GmbH, Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming