Scalasca
(Scalasca 2.6.2-dev, revision 0037b52e)
Scalable Performance Analysis of Large-Scale Applications
|
To avoid drawing wrong conclusions based on skewed performance data due to excessive measurement overhead, it is often necessary to optimize the measurement configuration before conducting any additional experiments. This can be achieved in various ways, for example, using runtime filtering, selective recording, or manual instrumentation controlling measurement. Please consult the Score-P Manual [17] for details on the available options. However, in many cases it is already sufficient to filter a small number of frequently executed but computationally inexpensive user functions to reduce the measurement overhead to an acceptable level. In this context, filtering means that those functions are still executed, but no measurements are taken and recorded for them. Therefore, filtered functions no longer show up in the measurement report, and the associated execution time is attributed to the parent function from which they are called (similar to inlining performed by the compiler). The selection of the routines to be filtered has to be done with care, though, as it affects the granularity of the measurement and too aggressive filtering might "blur" the location of important hotspots.
To assist in identifying candidate functions for runtime filtering, the initial summary report can be scored using the -s
option of the scalasca -examine
command:
$ scalasca -examine -s scorep_bt_144_sum INFO: Post-processing runtime summarization report (profile.cubex)... scorep-score -r ./scorep_bt_144_sum/profile.cubex \ > ./scorep_bt_144_sum/scorep.score INFO: Score report written to ./scorep_bt_144_sum/scorep.score $ head -n 25 scorep_bt_144_sum/scorep.score Estimated aggregate size of event trace: 3701GB Estimated requirements for largest trace buffer (max_buf): 26GB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 26GB (warning: The memory requirements cannot be satisfied by Score-P to avoid intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the maximum supported memory or reduce requirements using USR regions filters.) flt type max_buf[B] visits time[s] time[%] time/ region visit[us] ALL 27,597,828,625 152,806,258,921 60341.59 100.0 0.39 ALL USR 27,592,847,542 152,791,289,689 50827.11 84.2 0.33 USR MPI 4,086,824 10,016,496 9177.48 15.2 916.24 MPI COM 894,218 4,952,592 336.80 0.6 68.01 COM SCOREP 41 144 0.19 0.0 1305.21 SCOREP USR 9,123,406,734 50,517,453,756 4716.33 7.8 0.09 matvec_sub_ USR 9,123,406,734 50,517,453,756 8774.13 14.5 0.17 binvcrhs_ USR 9,123,406,734 50,517,453,756 6520.04 10.8 0.13 matmul_sub_ USR 200,157,360 1,108,440,240 89.94 0.1 0.08 exact_solution_ USR 22,632,168 124,121,508 13.43 0.0 0.11 binvrhs_ MPI 1,608,942 2,603,232 9.57 0.0 3.68 MPI_Irecv MPI 1,608,942 2,603,232 14.83 0.0 5.70 MPI_Isend MPI 861,432 4,771,008 7936.43 13.2 1663.47 MPI_Wait USR 234,936 1,301,184 3.24 0.0 2.49 lhsabinit_ USR 78,312 433,728 6213.68 10.3 14326.21 x_solve_cell_
As can be seen from the top of the score report, the estimated size for an event trace measurement without filtering applied is approximately 3.7 TiB, with the process-local maximum across all ranks being roughly 26 GiB. Considering the 128 GiB of main memory available on JURECA's compute nodes, the 24 MPI ranks per node, and the fact that Score-P's internal memory buffer is limited to 4 GiB per process, a tracing experiment with this configuration is clearly prohibitive if disruptive intermediate trace buffer flushes are to be avoided.
The next section of the score output provides a table which shows how the trace memory requirements of a single process (column max_buf
) as well as the overall number of visits and CPU allocation time are distributed among certain function groups. For traces that can be handled by the Scalasca Trace Tools, the most relevant groups are:
MPI
: MPI API functions.OMP
: OpenMP constructs and API functions.PTHREAD
: POSIX threads API functions.COM
: User functions/regions that appear on a call path to a parallelization API call or construct (MPI/OpenMP/POSIX threads). These functions provide the context of parallelization API usage and should therefore only be filtered with care.USR
: User functions/regions that do not appear on a call path to a parallelization API call or construct (MPI/OpenMP/POSIX threads).SCOREP
: Artificial regions generated by the Score-P measurement system.The detailed breakdown by region below the summary provides a classification according to these function groups (column type
) for each region found in the summary report. Investigation of this part of the score report reveals that most of the trace data would be generated by about 50 billion calls to each of the three routines matvec_sub
, binvcrhs
and matmul_sub
, which are classified as USR
. And although the percentage of time spent in these routines at first glance suggest that they are important, the average time per visit is below 170 nanoseconds (column time/visit
). That is, the relative measurement overhead for these functions is substantial, and thus a significant amount of the reported time is very likely spent in the Score-P measurement system rather than in the application itself. Therefore, these routines constitute good candidates for being filtered (like they are good candidates for being inlined by the compiler). Additionally selecting the exact_solution
routine, which generates about 200 MiB of event data on a single rank with very little runtime impact, a reasonable Score-P filtering file would therefore look like this:
SCOREP_REGION_NAMES_BEGIN EXCLUDE binvcrhs_ matvec_sub_ matmul_sub_ exact_solution_ SCOREP_REGION_NAMES_END
Please refer to the Score-P User Manual [17] for a detailed description of the filter file format, how to filter based on file names, define (and combine) blacklists and whitelists, and how to use wildcards for convenience. Also note that the run-time filtering approach used in this example only affects routines in the USR
and COM
groups. Measurements for other groups can—to certain degrees—be controlled by other means, as the generated events have to meet various consistency requirements.
The effectiveness of this filter—in terms of generated trace data—can be examined by scoring the initial summary report again, this time also specifying the filter file using the -f
option of the scalasca -examine
command. This way a filter file can be incrementally developed, avoiding the need to conduct many measurements to step-by-step investigate the effect of filtering individual functions.
$ scalasca -examine -s -f npb-bt.filt scorep_bt_144_sum scorep-score -f npb-bt.filt -r ./scorep_bt_144_sum/profile.cubex \ > ./scorep_bt_144_sum/scorep.score_npb-bt.filt INFO: Score report written to ./scorep_bt_144_sum/scorep.score_npb-bt.filt $ head -n 25 scorep_bt_144_sum/scorep.score_npb-bt.filt Estimated aggregate size of event trace: 3920MB Estimated requirements for largest trace buffer (max_buf): 28MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 30MB (hint: When tracing set SCOREP_TOTAL_MEMORY=30MB to avoid intermediate flushes or reduce requirements using USR regions filters.) flt type max_buf[B] visits time[s] time[%] time/ region visit[us] - ALL 27,597,828,625 152,806,258,921 60341.59 100.0 0.39 ALL - USR 27,592,847,542 152,791,289,689 50827.11 84.2 0.33 USR - MPI 4,086,824 10,016,496 9177.48 15.2 916.24 MPI - COM 894,218 4,952,592 336.80 0.6 68.01 COM - SCOREP 41 144 0.19 0.0 1305.21 SCOREP * ALL 28,762,789 145,457,413 40241.14 66.7 276.65 ALL-FLT + FLT 27,570,377,562 152,660,801,508 20100.44 33.3 0.13 FLT * USR 23,781,706 130,488,181 30726.67 50.9 235.47 USR-FLT - MPI 4,086,824 10,016,496 9177.48 15.2 916.24 MPI-FLT * COM 894,218 4,952,592 336.80 0.6 68.01 COM-FLT - SCOREP 41 144 0.19 0.0 1305.21 SCOREP-FLT + USR 9,123,406,734 50,517,453,756 4716.33 7.8 0.09 matvec_sub_ + USR 9,123,406,734 50,517,453,756 8774.13 14.5 0.17 binvcrhs_ + USR 9,123,406,734 50,517,453,756 6520.04 10.8 0.13 matmul_sub_ + USR 200,157,360 1,108,440,240 89.94 0.1 0.08 exact_solution_ - USR 22,632,168 124,121,508 13.43 0.0 0.11 binvrhs_ - MPI 1,608,942 2,603,232 9.57 0.0 3.68 MPI_Irecv - MPI 1,608,942 2,603,232 14.83 0.0 5.70 MPI_Isend - MPI 861,432 4,771,008 7936.43 13.2 1663.47 MPI_Wait - USR 234,936 1,301,184 3.24 0.0 2.49 lhsabinit_
Below the (original) function group summary, the score report now also includes a second summary with the filter applied. Here, an additional group FLT
is added, which subsumes all filtered regions. Moreover, the column flt
indicates whether a region/function group is filtered ("+
"), not filtered ("-
"), or possibly partially filtered ("*
", only used for function groups).
As expected, the estimate for the aggregate event trace size drops down to 3.9 GiB, and the process-local maximum across all ranks is reduced to 28 MiB. Since the Score-P measurement system also creates a number of internal data structures (e.g., to track MPI requests and communicators), the suggested setting for the SCOREP_TOTAL_MEMORY
environment variable to adjust the maximum amount of memory used by the Score-P memory management is 30 MiB when tracing is configured (see Section Trace collection and analysis).
Copyright © 1998–2022 Forschungszentrum Jülich GmbH,
Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming |