While summary profiles only provide process- or thread-local data aggregated over time, event traces contain detailed time-stamped event data which also allows to reconstruct the dynamic behavior of an application. This enables tools such as the Scalasca trace analyzer to provide even more insights into the performance behavior of an application, for example, whether the time spent in MPI communication is real message processing time or incurs significant wait states (i.e., intervals where a process sits idle without doing useful work waiting for data from other processes to arrive).

Trace collection and subsequent automatic analysis by the Scalasca trace analyzer can be enabled using the -t option of scalasca -analyze. Since this option enables trace collection in addition to collecting a summary measurement, it is often used in conjunction with the -q option which turns off measurement entirely. (Note that the order in which these two options are specified matters: First turn off measurement using -q, then enable tracing with -t.)

Attention: Do not forget to specify an appropriate measurement configuration (i.e., a filtering file and SCOREP_TOTAL_MEMORY setting)! Otherwise, you may easily fill up your disks and suffer from uncoordinated intermediate trace buffer flushes, which typically render such traces to be of little (or no) value!

For our example measurement, scoring of the initial summary report in Section Optimizing the measurement configuration with the filter applied estimated a total memory requirement of 30 MiB per process (which could be verified by re-scoring the filtered summary measurement). As this exceeds the default SCOREP_TOTAL_MEMORY setting of 16 MiB, use of the prepared filtering file alone is not yet sufficient to avoid intermediate trace buffer flushes. In addition, the SCOREP_TOTAL_MEMORY setting has to be adjusted accordingly before starting the trace collection and analysis. For the example measurement shown below, a slightly larger memory buffer of 32 MiB is used, although this is not strictly necessary. As an alternative, the filtering file could be extended to also exclude additional routines from measurement (e.g., binvrhs) to further reduce the trace buffer requirements at the expense of loosing details.

With this setting in place, a trace measurement can now be collected and subsequently analyzed by the Scalasca trace analyzer. Note that renaming or removing the summary experiment directory is not necessary, as trace experiments are created with suffix "trace".

  $ export SCOREP_TOTAL_MEMORY=32M
  $ scalasca -analyze -q -t -f npb-bt.filt mpiexec -n 144 ./bt.D.144
  S=C=A=N: Scalasca 2.5 trace collection and analysis
  S=C=A=N: ./scorep_bt_144_trace experiment archive
  S=C=A=N: Mon Mar 18 13:57:08 2019: Collect start
  mpiexec -n 144 ./bt.D.144


   NAS Parallel Benchmarks 3.3 -- BT Benchmark

   No input file inputbt.data. Using compiled defaults
   Size:  408x 408x 408
   Iterations:  250    dt:   0.0000200
   Number of active processes:   144

   Time step    1
   Time step   20
   Time step   40
   Time step   60
   Time step   80
   Time step  100
   Time step  120
   Time step  140
   Time step  160
   Time step  180
   Time step  200
   Time step  220
   Time step  240
   Time step  250
   Verification being performed for class D
   accuracy setting for epsilon =  0.1000000000000E-07
   Comparison of RMS-norms of residual
             1 0.2533188551738E+05 0.2533188551738E+05 0.1499315900507E-12
             2 0.2346393716980E+04 0.2346393716980E+04 0.8546885387975E-13
             3 0.6294554366904E+04 0.6294554366904E+04 0.2745293523008E-14
             4 0.5352565376030E+04 0.5352565376030E+04 0.8376934357159E-13
             5 0.3905864038618E+05 0.3905864038618E+05 0.6650300273080E-13
   Comparison of RMS-norms of solution error
             1 0.3100009377557E+03 0.3100009377557E+03 0.1373406191445E-12
             2 0.2424086324913E+02 0.2424086324913E+02 0.1600422929406E-12
             3 0.7782212022645E+02 0.7782212022645E+02 0.4090394153928E-13
             4 0.6835623860116E+02 0.6835623860116E+02 0.3596566920650E-13
             5 0.6065737200368E+03 0.6065737200368E+03 0.2605201960010E-13
   Verification Successful


   BT Benchmark Completed.
   Class           =                        D
   Size            =            408x 408x 408
   Iterations      =                      250
   Time in seconds =                   227.78
   Total processes =                      144
   Compiled procs  =                      144
   Mop/s total     =                256110.17
   Mop/s/process   =                  1778.54
   Operation type  =           floating point
   Verification    =               SUCCESSFUL
   Version         =                    3.3.1
   Compile date    =              18 Mar 2019

   Compile options:
      MPIF77       = scorep mpifort
      FLINK        = $(MPIF77)
      FMPI_LIB     = (none)
      FMPI_INC     = (none)
      FFLAGS       = -O2
      FLINKFLAGS   = -O2
      RAND         = (none)


   Please send feedbacks and/or the results of this run to:

   NPB Development Team
   Internet: npb@nas.nasa.gov


  S=C=A=N: Mon Mar 18 14:01:04 2019: Collect done (status=0) 236s
  S=C=A=N: Mon Mar 18 14:01:04 2019: Analyze start
  mpiexec -n 144 scout.mpi ./scorep_bt_144_trace/traces.otf2
  SCOUT   (Scalasca 2.5)
  Copyright (c) 1998-2019 Forschungszentrum Juelich GmbH
  Copyright (c) 2009-2014 German Research School for Simulation Sciences GmbH

  Analyzing experiment archive ./scorep_bt_144_trace/traces.otf2

  Opening experiment archive ... done (0.008s).
  Reading definition data    ... done (0.017s).
  Reading event trace data   ... done (0.280s).
  Preprocessing              ... done (0.366s).
  Analyzing trace data       ... done (2.511s).
  Writing analysis report    ... done (0.104s).

  Max. memory usage         : 170.812MB
  *** WARNING ***
  40472 clock condition violations detected:
     Point-to-point:    40472
     Collective    :        0
  This usually leads to inconsistent analysis results.

  Try running the analyzer using the '--time-correct' command-line
  option to apply a correction algorithm.


  Total processing time     : 3.383s
  S=C=A=N: Mon Mar 18 14:01:09 2019: Analyze done (status=0) 5s
  Warning: 3.686GB of analyzed trace data retained in ./scorep_bt_144_trace/traces!
  S=C=A=N: ./scorep_bt_144_trace complete.

  $ ls scorep_bt_144_trace
  MANIFEST.md  scorep.filter  scout.cubex  trace.cubex  traces.def   trace.stat
  scorep.cfg   scorep.log     scout.log    traces       traces.otf2

As can be seen, the scalasca -analyze convenience command automatically initiates the trace analysis after successful trace collection. Besides the already known files MANIFEST.md, scorep.cfg, scorep.filter, and scorep.log, the generated experiment directory scorep_bt_144_trace now contains artifacts related to trace measurement and analysis:

an OTF2 trace archive consisting of the anchor file traces.otf2, the global definitions file traces.def, and the per-process data in the traces/ directory,
the trace analysis log file scout.log, and finally
the trace analysis reports scout.cubex and trace.stat.

The Scalasca trace analyzer also warned about a number of point-to-point clock condition violations it detected. A clock condition violation is a violation of the logical event order that can occur when the local clocks of the individual compute nodes are insufficiently synchronized. For example, based on the measured timestamps, a receive operation may appear to have finished before the corresponding send operation started—something that is obviously not possible. The Scalasca trace analyzer includes a correction algorithm [1] that can be applied in such cases to restore the logical event order, while trying to preserve the length of intervals between local events in the vicinity of the violation.

To enable this correction algorithm, the --time-correct command-line option has to be passed to the Scalasca trace analyzer. However, since the analyzer is implicitly started through the scalasca -analyze command, this option has to be set using the SCAN_ANALYZE_OPTS environment variable, which holds command-line options that scalasca -analyze should forward to the trace analyzer. Instead of collecting and analyzing a new experiment, an existing trace measurement can also be re-analyzed using the -a option of the scalasca -analyze command:

  $ export SCAN_ANALYZE_OPTS="--time-correct"
  $ scalasca -analyze -a mpiexec -n 144 ./bt.D.144
  S=C=A=N: Scalasca 2.5 trace analysis
  S=C=A=N: ./scorep_bt_144_trace experiment archive
  S=C=A=N: Mon Mar 18 14:02:57 2019: Analyze start
  mpiexec -n 144 scout.mpi --time-correct ./scorep_bt_144_trace/traces.otf2
  SCOUT   (Scalasca 2.5)
  Copyright (c) 1998-2019 Forschungszentrum Juelich GmbH
  Copyright (c) 2009-2014 German Research School for Simulation Sciences GmbH

  Analyzing experiment archive ./scorep_bt_144_trace/traces.otf2

  Opening experiment archive ... done (0.013s).
  Reading definition data    ... done (0.023s).
  Reading event trace data   ... done (0.615s).
  Preprocessing              ... done (0.478s).
  Timestamp correction       ... done (0.463s).
  Analyzing trace data       ... done (2.653s).
  Writing analysis report    ... done (0.106s).

  Max. memory usage         : 174.000MB

  	# passes        : 1
  	# violated      : 9450
  	# corrected     : 23578735
  	# reversed-p2p  : 9450
  	# reversed-coll : 0
  	# reversed-omp  : 0
  	# events        : 301330922
  	max. error      : 0.000048 [s]
  	error at final. : 0.000000 [%]
  	Max slope       : 0.394363334

  Total processing time     : 4.448s
  S=C=A=N: Mon Mar 18 14:03:07 2019: Analyze done (status=0) 10s
  Warning: 3.686GB of analyzed trace data retained in ./scorep_bt_144_trace/traces!
  S=C=A=N: ./scorep_bt_144_trace complete.

Note: The additional time required to execute the timestamp correction algorithm is typically small compared to the trace data I/O time and waiting times in the batch queue for starting a second analysis job. On platforms where clock condition violations are likely to occur (i.e., clusters), it is therefore often convenient to enable the timestamp correction algorithm by default.

Similar to the summary report, the trace analysis report can finally be post-processed and interactively explored using the Cube report browser, again using the scalasca -examine convenience command:

  $ scalasca -examine scorep_bt_144_trace
  INFO: Post-processing trace analysis report (scout.cubex)...
  INFO: Displaying ./scorep_bt_144_trace/trace.cubex...

The report generated by the Scalasca trace analyzer is again a profile in CUBE4 format, however, enriched with additional performance properties. Examination shows that about two-thirds of the time spent in MPI point-to-point communication is waiting time, split into roughly 60% in Late Sender and 40% in Late Receiver wait states (see Figure TraceAnalysis). While the execution time in the solve_cell routines looked relatively balanced in the summary profile, examination of the Critical path imbalance metric shows that these routines in fact exhibit a small amount of imbalance, which is likely to cause the wait states at the next synchronization point. This can be verified using the Late Sender delay costs metric, which confirms that the solve_cells as well as the y_backsubstitute and z_backsubstitute routines are responsible for almost all of the Late Sender wait states. Likewise, the Late Receiver delay costs metric shows that the majority of the Late Receiver wait states are caused by the solve_cells routines as well as the MPI_Wait calls in the solver routines, where the latter indicates a communication imbalance or an inefficient communication pattern.

"TraceAnalysis": Screenshot of a trace analysis result in the Cube report browser.