Scalasca  (Scalasca 2.6, revision 748ac9e9)
Scalable Performance Analysis of Large-Scale Applications
Optimizing the measurement configuration

To avoid drawing wrong conclusions based on skewed performance data due to excessive measurement overhead, it is often necessary to optimize the measurement configuration before conducting any additional experiments. This can be achieved in various ways, for example, using runtime filtering, selective recording, or manual instrumentation controlling measurement. Please consult the Score-P Manual [17] for details on the available options. However, in many cases it is already sufficient to filter a small number of frequently executed but computationally inexpensive user functions to reduce the measurement overhead to an acceptable level. In this context, filtering means that those functions are still executed, but no measurements are taken and recorded for them. Therefore, filtered functions no longer show up in the measurement report, and the associated execution time is attributed to the parent function from which they are called (similar to inlining performed by the compiler). The selection of the routines to be filtered has to be done with care, though, as it affects the granularity of the measurement and too aggressive filtering might "blur" the location of important hotspots.

To assist in identifying candidate functions for runtime filtering, the initial summary report can be scored using the -s option of the scalasca -examine command:

  $ scalasca -examine -s scorep_bt_144_sum
  INFO: Post-processing runtime summarization report (profile.cubex)...
  scorep-score -r ./scorep_bt_144_sum/profile.cubex \
      > ./scorep_bt_144_sum/scorep.score
  INFO: Score report written to ./scorep_bt_144_sum/scorep.score

  $ head -n 25 scorep_bt_144_sum/scorep.score

  Estimated aggregate size of event trace:                   3701GB
  Estimated requirements for largest trace buffer (max_buf): 26GB
  Estimated memory requirements (SCOREP_TOTAL_MEMORY):       26GB
  (warning: The memory requirements cannot be satisfied by Score-P to avoid
   intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the
   maximum supported memory or reduce requirements using USR regions filters.)

  flt   type     max_buf[B]          visits  time[s] time[%] time/     region
                                                             visit[us]
         ALL 27,597,828,625 152,806,258,921 60341.59   100.0      0.39 ALL
         USR 27,592,847,542 152,791,289,689 50827.11    84.2      0.33 USR
         MPI      4,086,824      10,016,496  9177.48    15.2    916.24 MPI
         COM        894,218       4,952,592   336.80     0.6     68.01 COM
      SCOREP             41             144     0.19     0.0   1305.21 SCOREP

         USR  9,123,406,734  50,517,453,756  4716.33     7.8      0.09 matvec_sub_
         USR  9,123,406,734  50,517,453,756  8774.13    14.5      0.17 binvcrhs_
         USR  9,123,406,734  50,517,453,756  6520.04    10.8      0.13 matmul_sub_
         USR    200,157,360   1,108,440,240    89.94     0.1      0.08 exact_solution_
         USR     22,632,168     124,121,508    13.43     0.0      0.11 binvrhs_
         MPI      1,608,942       2,603,232     9.57     0.0      3.68 MPI_Irecv
         MPI      1,608,942       2,603,232    14.83     0.0      5.70 MPI_Isend
         MPI        861,432       4,771,008  7936.43    13.2   1663.47 MPI_Wait
         USR        234,936       1,301,184     3.24     0.0      2.49 lhsabinit_
         USR         78,312         433,728  6213.68    10.3  14326.21 x_solve_cell_

As can be seen from the top of the score report, the estimated size for an event trace measurement without filtering applied is approximately 3.7 TiB, with the process-local maximum across all ranks being roughly 26 GiB. Considering the 128 GiB of main memory available on JURECA's compute nodes, the 24 MPI ranks per node, and the fact that Score-P's internal memory buffer is limited to 4 GiB per process, a tracing experiment with this configuration is clearly prohibitive if disruptive intermediate trace buffer flushes are to be avoided.

The next section of the score output provides a table which shows how the trace memory requirements of a single process (column max_buf) as well as the overall number of visits and CPU allocation time are distributed among certain function groups. For traces that can be handled by the Scalasca Trace Tools, the most relevant groups are:

The detailed breakdown by region below the summary provides a classification according to these function groups (column type) for each region found in the summary report. Investigation of this part of the score report reveals that most of the trace data would be generated by about 50 billion calls to each of the three routines matvec_sub, binvcrhs and matmul_sub, which are classified as USR. And although the percentage of time spent in these routines at first glance suggest that they are important, the average time per visit is below 170 nanoseconds (column time/visit). That is, the relative measurement overhead for these functions is substantial, and thus a significant amount of the reported time is very likely spent in the Score-P measurement system rather than in the application itself. Therefore, these routines constitute good candidates for being filtered (like they are good candidates for being inlined by the compiler). Additionally selecting the exact_solution routine, which generates about 200 MiB of event data on a single rank with very little runtime impact, a reasonable Score-P filtering file would therefore look like this:

  SCOREP_REGION_NAMES_BEGIN
      EXCLUDE
          binvcrhs_
          matvec_sub_
          matmul_sub_
          exact_solution_
  SCOREP_REGION_NAMES_END

Please refer to the Score-P User Manual [17] for a detailed description of the filter file format, how to filter based on file names, define (and combine) blacklists and whitelists, and how to use wildcards for convenience. Also note that the run-time filtering approach used in this example only affects routines in the USR and COM groups. Measurements for other groups can—to certain degrees—be controlled by other means, as the generated events have to meet various consistency requirements.

The effectiveness of this filter—in terms of generated trace data—can be examined by scoring the initial summary report again, this time also specifying the filter file using the -f option of the scalasca -examine command. This way a filter file can be incrementally developed, avoiding the need to conduct many measurements to step-by-step investigate the effect of filtering individual functions.

  $ scalasca -examine -s -f npb-bt.filt scorep_bt_144_sum
  scorep-score -f npb-bt.filt -r ./scorep_bt_144_sum/profile.cubex \
      > ./scorep_bt_144_sum/scorep.score_npb-bt.filt
  INFO: Score report written to ./scorep_bt_144_sum/scorep.score_npb-bt.filt

  $ head -n 25 scorep_bt_144_sum/scorep.score_npb-bt.filt

  Estimated aggregate size of event trace:                   3920MB
  Estimated requirements for largest trace buffer (max_buf): 28MB
  Estimated memory requirements (SCOREP_TOTAL_MEMORY):       30MB
  (hint: When tracing set SCOREP_TOTAL_MEMORY=30MB to avoid intermediate flushes
   or reduce requirements using USR regions filters.)

  flt   type     max_buf[B]          visits  time[s] time[%] time/     region
                                                             visit[us]
   -     ALL 27,597,828,625 152,806,258,921 60341.59   100.0      0.39 ALL
   -     USR 27,592,847,542 152,791,289,689 50827.11    84.2      0.33 USR
   -     MPI      4,086,824      10,016,496  9177.48    15.2    916.24 MPI
   -     COM        894,218       4,952,592   336.80     0.6     68.01 COM
   -  SCOREP             41             144     0.19     0.0   1305.21 SCOREP

   *     ALL     28,762,789     145,457,413 40241.14    66.7    276.65 ALL-FLT
   +     FLT 27,570,377,562 152,660,801,508 20100.44    33.3      0.13 FLT
   *     USR     23,781,706     130,488,181 30726.67    50.9    235.47 USR-FLT
   -     MPI      4,086,824      10,016,496  9177.48    15.2    916.24 MPI-FLT
   *     COM        894,218       4,952,592   336.80     0.6     68.01 COM-FLT
   -  SCOREP             41             144     0.19     0.0   1305.21 SCOREP-FLT

   +     USR  9,123,406,734  50,517,453,756  4716.33     7.8      0.09 matvec_sub_
   +     USR  9,123,406,734  50,517,453,756  8774.13    14.5      0.17 binvcrhs_
   +     USR  9,123,406,734  50,517,453,756  6520.04    10.8      0.13 matmul_sub_
   +     USR    200,157,360   1,108,440,240    89.94     0.1      0.08 exact_solution_
   -     USR     22,632,168     124,121,508    13.43     0.0      0.11 binvrhs_
   -     MPI      1,608,942       2,603,232     9.57     0.0      3.68 MPI_Irecv
   -     MPI      1,608,942       2,603,232    14.83     0.0      5.70 MPI_Isend
   -     MPI        861,432       4,771,008  7936.43    13.2   1663.47 MPI_Wait
   -     USR        234,936       1,301,184     3.24     0.0      2.49 lhsabinit_

Below the (original) function group summary, the score report now also includes a second summary with the filter applied. Here, an additional group FLT is added, which subsumes all filtered regions. Moreover, the column flt indicates whether a region/function group is filtered ("+"), not filtered ("-"), or possibly partially filtered ("*", only used for function groups).

As expected, the estimate for the aggregate event trace size drops down to 3.9 GiB, and the process-local maximum across all ranks is reduced to 28 MiB. Since the Score-P measurement system also creates a number of internal data structures (e.g., to track MPI requests and communicators), the suggested setting for the SCOREP_TOTAL_MEMORY environment variable to adjust the maximum amount of memory used by the Score-P memory management is 30 MiB when tracing is configured (see Section Trace collection and analysis).



Scalasca    Copyright © 1998–2021 Forschungszentrum Jülich GmbH, Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming