9.0-rc2 (revision ee0aaf9c9)
|
This chapter demonstrates a typical performance analysis workflow using Score-P. It consist of the following steps:
The workflow is demonstrated using NPB BT-MZ benchmark as an example. BT-MZ solves a discretized version of unsteady, compressible Navier-Stokes equations in three spatial dimensions. It performs 200 time-steps on a regular 3-dimensional grid using ADI and verifies solution error within acceptable limit. It uses intra-zone computation with OpenMP and inter-zone communication with MPI. The benchmark can be build with a predefined data class (S,W,A,B,C,D,E,F) and any number of MPI processes and OpenMP threads.
NPB BT-MZ distribution already prepared for this example could be obtained from here.
In order to collect performance measurements, BT-MZ has to be instrumented. There are various types of instrumentation supported by Score-P which cover a broad spectrum of performance analysis use cases (see Chapter 'Application Instrumentation' for more details).
In this example we start with automatic compiler instrumentation by prepending compiler/linker specification variable MPIF77
found in config/make.def
with scorep
:
After the makefile is modified and saved, it is recommended to return to the root folder of the application and clean-up previously build files:
Now the application is ready to be instrumented by simply issuing the standard build command:
After the command is issued, the make command should produce the output similar to the one below:
When make finishes, the built and instrumented application could be found under bin.scorep/bt-mz_W.4
.
Now instrumented BT-MZ is ready to be executed and to be profiled by Score-P at the same time. However before doing so, one has an opportunity to configure Score-P measurement by setting Score-P environment variables. For the complete list of variables please refer to Appendix 'Score-P Measurement Configuration'.
The typical Score-P performance analysis workflow implies collecting summary performance information first and then in detail performance exploration using execution traces. Therefore Score-P has to be configured to perform profiling and tracing has to be disabled. This is done by setting corresponding environment variables:
Performance data collected by Score-P will be stored in an experiment directory. In order to efficiently manage multiple experiments, one can specify a meaningful name for the experiment directory by setting
After Score-P is prepared for summary collection, the instrumented application can be started as usual:
The BT-MZ output should look similar to the listing below:
After application execution is finished, the summary performance data collected by Score-P is stored in the experiment directory bin.scorep/scorep_bt-mz_W_4x4_sum
. The directory contains the following files:
scorep.cfg
- a record of the measurement configuration used in the run profile.cubex
- the analysis report that was collated after measurement After BT-MZ finishes execution, the summary performance data measured by Score-P can be investigated with CUBE or ParaProf interactive report exploration tools.
CUBE:
ParaProf:
Both tools will reveal the call-path profile of BT-MZ annotated with metrics: Time, Visits count, MPI message statistics (bytes sent/received). For more information on using the tool please refer to the corresponding documentation (CUBE, ParaProf).
Though we were able to collect the profile data, one can mention that the execution took longer in comparison to un-instrumented run, even when the time spent for measurement start-up/finalization is disregarded. Longer execution times of the instrumented application is a sign of high instrumentation/measurement overhead. Furthermore, this overhead might result in large trace files and buffer flushes in the later tracing experiment if Score-P is not properly configured.
In order to investigate sources of the overhead and to tune measurement configuration for consequent trace collection with Score-P, scorep-score
tool (see Section 'Scoring a Profile Measurement' for more details about scorep-score
) can be used:
The textual output of the tool generates an estimation of the size of an OTF2 trace produced, should Score-P be run using the current configuration. Here the trace size estimation could be also seen as a measure of overhead, since both are proportional to the number of recorded events. Additionally, the tool shows the distribution of the required trace size over call-path classes. From the report above one can see that the estimated trace size needed is equal to 1 GB in total or 256 MB per MPI rank, which is significant. From the breakdown per call-path class one can see that most of the overhead is due to user-level computations. In order to further localize the source of the overhead, scorep-score
can print the breakdown of the buffer size on per-region basis:
The regions marked as USR type contribute to around 32% of the total time, however, much of that is very likely to be measurement overhead due to frequently-executed small routines. Therefore, it is highly recommended to remove these routines from measurements. This can be achieved by placing them into a filter file (please refer to Section 'Defining and testing a filter' for more details about filter file specification) as regions to be excluded from measurements. There is already a filter file prepared for BT-MZ which can be used:
One can use scorep-score
once again to verify the effect of the filter file :
Now one can see that the trace size is reduced to just 20MB in total or 6MB per MPI rank. The regions filtered out will be marked with "+" in the left-most column of the per-region report.
After the filtering file is prepared to exclude the sources of the overhead, it is recommended to repeat summary collection, in order to obtain more precise measurements.
In order to specify the filter file to be used during measurements, the corresponding environment variable has to be set:
It is also recommended to adjust the experiment directory name for the new run:
Score-P also has a possibility to record hardware counters (see Section 'PAPI Hardware Performance Counters') and operating system resource usage (see Section 'Resource Usage Counters') in addition to default time and number of visits metrics. Hardware counters could be configured by setting Score-P environment variable SCOREP_METRIC_PAPI
to the comma-separated list of PAPI events (other separator could be specified by setting SCOREP_METRIC_PAPI_SEP
):
papi_avail
and papi_native_avail
in order to get the list of the available PAPI generic and native events.Operating system resource usage metrics are configured by setting the following variable:
Additionally Score-P can be configured to record only a subset of the mpi functions. This is achieved by setting SCOREP_MPI_ENABLE_GROUPS
variable with a comma-separated list of sub-groups of MPI functions to be recorded (see Appendix 'MPI wrapper affiliation'):
In case performance of the CUDA code is of interest, Score-P can be configured to measure CUPTI metrics as follows (see Section 'CUDA Performance Measurement'):
In case performance of the OpenCL code is of interest, Score-P can be configured to measure OpenCL events as follows (see Section 'OpenCL Performance Measurement'):
When the granularity offered by the automatic compiler instrumentation is not sufficient, one can use Score-P manual user instrumentation API (see Section 'Manual Region Instrumentation') for more fine-grained annotation of particular code segments. For example initialization code, solver or any other code segment of interest can be annotated.
In order to enable user instrumentation, an --user
argument has to be passed to Score-P command prepending compiler and linker specification:
Below, the loop found on line ... in file ... is annotated as a user region:
This will appear as an additional region in the report.
BT-MZ has to be recompiled and relinked in order to complete instrumentation.
After applying advanced configurations described above, summary collection with Score-P can be started as usual:
After execution is finished, one can use scorep-score
tool to verify the effect of filtering:
The report above shows significant reduction in runtime (due to elimination of the overhead) not only in USR regions but also in MPI/OMP regions as well.
Now, the extended summary report can be interactively explored using CUBE:
or ParaProf:
After exploring extended summary report, additional insight into performance of BT-MZ can be gained by performing trace collection. In order to do so, Score-P has to be configured to perform tracing by setting corresponding variable to true
and disabling profile generation:
Additionally it is recommended to set a meaningful experiment directory name:
After BT-MZ execution is finished, a separate trace file per thread is written into the new experiment directory. In order to explore it, Vampir tool can be used:
Please consider that traces can become extremely large and unwieldy, because the size of the trace is proportional to number of processes/threads (width), duration (length) and detail (depth) of measurement. When the trace is too large to hold in the memory allocated by Score-P, flushes can happen. Unfortunately the resulting traces are of little value, since uncoordinated flushes result in cascades of distortion.
Traces should be written to a parallel file system, e.g., to /work
or /scratch
which are typically provided for this purpose.