------------------- Released version 7.0 ----------------------------- Features and improvements: - Add support for recording calls to OpenCL 2.1/2.2 functions. - Add support for recording events from the Kokkos tools interface. The Kokkos CUDA and HIP back ends are stable on a single device (see OPEN_ISSUES). The OpenMP and Pthread back ends should be treated as experimental. - Issue individual I/O events in POSIX vectorized I/O operations. - Add recording of transfer offsets of POSIX I/O operations. - Add wrapping of more vectorized I/O operations: - `preadv2`, `preadv64`, `preadv64v2` - `pwritev2`, `pwritev64`, `pwritev64v2` - Add stripe count/size for recorded files on the Lustre file system. - Add process ID (PID) and thread ID (TID) as attributes on program begin or thread creation events respectivly. - Record node-level unique identifiers for NVIDIA and AMD GPUs as CUDA and OpenCL location properties to separate devices in a multi-GPU environment. - A new mutex implementation based on atomic intrinsics replaces all existing mutex implementations. - Change default of CUDA instrumentation to force a flush of CUDA activity buffers at program exit. This should resolve issues with measurements failing to include CUDA activity. `SCOREP_CUDA_ENABLE=flushatexit` is deprecated and replaced with the new `SCOREP_CUDA_ENABLE=dontflushatexit` option for programs that already perform a device synchronize or reset before exit and don't need an additional flush. User tools and API improvements and changes: - Remove the configure option `--with-extra-instrumentation-flags`. It was introduced to work around GCC compiler instrumentation issues that vanished with the advent of the recommended GCC compiler instrumentation plug-in. - Remove the instrumenter option `--config=` as it was considered of little use. - Add ability to generate an initial filter file with optional control parameters using buffer values, visits and region types. This includes the ability to iteratively refine the generated filter file using existing filters. - Compile-time filtering via `scorep --instrument-filter` is now also available for builds using Intel compilers. - Add additional `scorep-score` sorting modes `name`, `totaltime`, `timepervisit`, and `visits`, besides the default `maxbuffer`. Select a sorting mode via `-s `. - Remove the `scorep` and `scorep-config` option `--mutex` due to changes in the mutex implementation, see above. - Allow to build against the `libcuda.so` stubs library from the CUDA SDK. Specify `--with-libcuda-lib=/lib64/stubs` when configuring. At runtime the `libcuda.so` library must be found by the system-library path though. Bugfixes: - Support changed BFD API. Changes introduced by binutils-2.34. - Fix aborts when user library wrapper were first called in a thread parallel context. - Unify and fix representation of artificial root nodes for threads, GPU kernels, and OpenMP tasks in profiling. - Allocation metrics were lost on MPI RMA window allocation functions. - Honor `CUDA_VISIBLE_DEVICES` when creating CUDA location names. - Improve error handling of calls to `realpath` on kernel files in `/proc` or `/sys` when recording I/O activities. - Allow to select 'runtime' wrapping of OpenCL in the instrumenter again. - Fix event sequence and attributes when recording non-blocking `lio_listio` operations. - Improve thread-safety of CUDA adapter. - Improve mount point extraction for some corner cases. Compatibility: - Score-P now requires an MPI implementation which is compliant with at least the MPI 2.2 standard and provides the `USE mpi` Fortran bindings, instead of the discouraged `INCLUDE 'mpif.h'`. Note that `USE mpi_f08` is not yet supported and Score-P will abort during MPI initialization if this is detected. ------------------- Released version 6.0 ----------------------------- Major features: - Support for recording I/O activities: Calls to POSIX I/O and MPI-I/O are wrapped and meta data about individual I/O operations is recorded. Whereas MPI-I/O events are recorded by default, POSIX I/O recording needs to be activated using the instrumenter option --io=posix. Features and improvements: - Created separate enable group for request handling functions in MPI. MPI functions dealing with the completion of non-blocking requests (i.e., the Test/Wait family of calls) are no longer part of the P2P enable group and moved to a separate enable group, which is enabled or disabled automatically by the Score-P runtime system. - Adapted remapper specification to reflect that Test/Wait functionality is no longer specific to point-to-point communication. - Added support for the Clang compiler suite. Select via `--with-nocross-compiler-suite=clang`. Additionally experimental support for macOS based systems was added, but needs to be enabled with `--enable-experimental-platform` explicitly. - Bulding with the PGI compiler suite now selects the 'pgfortran' compiler for F77 and FC. Added support for the PGI/LLVM variant. - Added support for tracking MPI-3 one-sided communication. - The previously unused environment variable `SCOREP_MPI_MAX_ACCESS_EPOCHS` was renamed to `SCOREP_MPI_MAX_EPOCHS` and is now used in tracking MPI one-sided communication. - Changed the presentation of parameter-based profiling. Instead of nested call tree nodes under the source code region, create multiple nodes for the region on the same level and attach Cube-Parameters to them. In this context, the API of libscorep-estimator (used for scoring profiles, e.g., in scorep-score) changed. Consider this API 'experimental'. Bugfixes: - For OPARI2-instrumented codes that use OpenMP criticals the mapping to Score-P critical objects was erroneous. As a consequence, lock-contention analysis for these criticals unfortunatly was erroneous too. ------------------- Released version 5.0 ----------------------------- Major features: - Orphan thread support: Score-P now records events from POSIX threads that were not instrumented, e.g., threads created from `std::thread`, Intel TBB, Intel Cilk Plus, or any other runtime which is based on POSIX threads. Previously, events from such threads caused a 'TPD == 0' measurement abort. Note that if your link-line does not need a POSIX thread option like -pthread, you need to use the Score-P option `--thread=pthread` to activate this feature. This feature also includes support for POSIX threads that are running longer than main. For these threads, Score-P will exit all active regions and end the thread (from the measurement point of view). - Added support for cartesian topologies. Supported topology types: 1) MPI cartesian topologies via MPI_Cart_create. 2) Platform/Hardware specific topologies: - IBM Blue Gene/Q - K Computer 3) Process x Threads topology: Generic 2D topology, currently only for CPU threads. 4) User topologies via user instrumentation API. By default all available topology types will be recorded. They can selectively be disabled based on type through environment variables, see `scorep-info config-vars`. Viable topology results require a distinct thread binding. Features and improvements: - Score-P now generates a dynamic `MANIFEST.md` file for each experiment and copies files, like the filter or selective configuration files, to the experiment directory. - In profiling mode, add the file `/scorep/scorep.spec` to the `profile.cubex` container, thus making the profile output more self-contained. - On thread creation, request internal memory on the fly instead of in advance. Depending on the measurement configuration this will save some memory. - As Open MPI provides since version 3.0 a C++ compiler wrapper for SHMEM, Score-P will also provide a instrumentation wrapper `scorep-oshcxx` in this case. - Values in config variables of type Set can now be negated by preceeding it with '~', e.g., 'SCOREP_MPI_ENABLE_GROUPS=default,~cg'. - Functions excluded from instrumentation by the GCC plug-in, because they were declared as inline, can now be instrumented by providing an instrumentation filter to 'scorep' where the function is matched by an explicit 'INCLUDE' rule, which is not the match-all '*' one. Functions excluded from instrumentation can be listed by adding `--verbose=2` to the `scorep` command-line. - Changes to the experimental `scorep-preload-init` script: - Also preloads the Score-P constructor to be able to early initialize the measurement. - Issues a warning for options which are not suitable for uninstrumented applications. - 'MPI_Comm_idup' is now supported and does not abort the measurement anymore. - Added support for the high bandwidth memory interface (hbw_malloc) of the memkind library, allowing memory tracking for the Intel KNL MCDRAM with Score-P. - All Fortran wrappers support now 64-bit character length arguments with GCC 8. - Multiple improvements in the `scorep` instrumenter command to better interact with build systems: - All warnings and errors are prefixed with '[Score-P] ', for better identification. - All output goes to stderr, to not interfere when catching output from the compiler/linker in process substitutions. - When no source files could be identified, the command is executed as is. - Since Score-P version 2.x, measurement initialization is done before entering 'main' using compiler-provided constructor functions, if available. As a consequence, MPI- or SHMEM-only instrumented programs lacked the artificial 'PARALLEL' region that was used to enclose all following regions. Instead of the 'PARALLEL' region Score-P now generates program-begin and program-end events that enclose the entire application. If program arguments are given, these are recorded as well. In tracing mode program-begin/end are mapped to ProgramBegin/End event records; in profiling mode this feature is modeled as enter/exit of an additional region with the name of the executable, if available. Bugfixes: - Instrumentation of Fortran OpenMP programs that use untied tasks failed with undefined references. Fixed. - So far, programs that `pthread_exit()` the main thread crashed based on the requirement that the program's main thread is responsible for the measurement finalization. This requirement was removed and was accompanied by multiple improvements of threads lasting longer than main. - Restored the ability to run with `SCOREP_TOTAL_MEMORY=4G`. - Instrumentation failed for codes that include system headers via local headers of the same name. This is fixed for compilers that support the '-iquote' option (most of the compilers do, PGI doesn't). Note that this bugfix is overruled if scorep's '--pdt' option is used. - Fix memory recording of C++14 applications, because Score-P did not wrapped the `delete`/`delete[]` operators with size argument. - Fix possible overflow of send/recv bytes in MPI_Bcast, MPI_Sendrecv, and MPI_Sendrecv_replace. - In selecting MPI groups to be recorded (SCOREP_MPI_ENABLE_GROUPS), fix handling of MPI subgroups. ------------------- Released version 4.1 ----------------------------- Bugfixes: - scorep-score: fixed potentially wrong output of SCOREP_TOTAL_MEMORY which was caused by an uninitialized variable. - Improve robustness of wrapping memory-related function calls during link-time. - Fixed PGI compiler adapter to prevent the corruption of register values in some cases. - Fixed calculation of memory statistics in out-of-memory condition. - Honor --libdir and --dis|enable-shared|static when building and installing libscorep_estimator. ------------------- Released version 4.0 ----------------------------- Major features: - User Library Wrapping: Using scorep-libwrap-init, you can now automatically generate library wrappers supplying only the headers and library files of the target library. You then install the wrapper into SCOREP_LIBWRAP_PATH and use it with the new instrumenter flag --libwrap=. For this only linking with Score-P is necessary, except when the library is called from threads, then the threading paradigm has to be instrumented as well. Features and improvements: - The utility "scorep-score" is provided now as a library application to allow using its functionality in third-party software. Obtain compile flags via "scorep-config --target score --cflags|--ldflags|--libs". - Improve detection and compiler selection for SGI MPT implementations. - Provide the Substrate Plugin interface, which enables plugins to consume Score-P runtime events for recording, analysis, and optimization purposes. - Added the option SCOREP_FORCE_CFG_FILES, which enables users to force the creation of the experiment directory even if there are no active substrates that write any output. Defaults to true. - Provided the option to use sequence definitions for the system tree. They provide a constant size system tree description. The trade-off is the loss of individual names and properties for locations, location groups and system tree nodes. Currently supported only for MPI. - Added possibilities to aggregate the locations within a thread to reduce the report size. The aggregation can be enabled via the SCORE_PROFILING_FORMAT environment variable. The new formats THREAD_SUM, THREAD_TUPLE, KEY_THREADS, and CLUSTER_THREADS are available. - Replace the two threading variants --thread=omp:pomp_tpd and --thread=omp:ancestry by only one: --thread=omp. The possible options are detected at configure time. If both are available, the ancestry variant will be used by default. - As compressing OTF2 traces was not supported by any OTF2 release in the past and probably wont be in the foreseeable future either, the support for this feature in Score-P was removed. - Score-P no longer ships with the Cube GUI. Cube was componentized and Score-P just includes Cube's library components that are necessary for measurements and scoring. The configure option --with-cube was replace by --with-cubew and --with-cubelib. They need to be provided a PATH to cubew-config and cubelib-config, respectively, if not already in PATH. The Cube GUI is separately available from http://www.scalasca.org. - An experimental script named `scorep-preload-init` is provided which helps to setting up a measurement done through the `LD_PRELOAD` mechanism. Score-P needs to be built with shared libraries to enable this feature and not all instrumentations are supported though. Bugfixes: - Improve the extraction of topology information from the Slurm topology/tree plugin to create the system tree. There were cases where the Slurm topology information wasn't correctly distributed to the individual compute nodes. This resulted in a system tree with a single node parenting all processes instead of several nodes parenting subsets of processes. - Recording of synchrounous metrics (SCOREP_METRIC_SYNC), i.e., per-process metrics or metrics provided by a 'sync' plugin, resulted in wrong values in profiling mode. Fixed. - Added a time-based string to temporary results files of the preprocessing step during instrumentation. This should avoid name clashes if the same source file is concurrently processed twice during the build process. - The support for a modularized OPARI2, introduced in Score-P 2.0, attributed wrong names for the inner regions of the OpenMP constructs critical, ordered, section, single, and task. This is fixed now. ------------------- Released version 3.1 ----------------------------- Features and improvements: - The induced penalty to access thread-local storage variables was considerably reduced for some compilers, notable for the Intel compilers. - If both OpenMP instrumentation options, omp:tpd and omp:ancestry, are supported, use omp:ancestry as default. This works around a problem found with recent Intel compilers (e.g., 17.0.0) and the omp:tpd option. - The GCC compiler instrumentation plug-in now instruments functions that will not return in the usual way, like, e.g., a Pthread start_routine that calls pthread_exit. Bugfixes: - Fix compilation error during instrumentation, if the command line contains a header file. - Fix loosing parameter call-paths by avoiding multiple definitions of the same parameters. - Fix that memory allocation measurements are disabled if the user explicitly specifies --memory. - Fix conflict of function wrapping with IPA on BlueGene systems. - Do not preprocess assembler files anymore. - Fix race condition in parallel make (make -j). Note that parallel 'make check' still exhibits race conditions due to Fortran dependency issues. - Fix segmentation fault in the profile when memory operations and metric counters are recorded at the same time. - Improve detection of ARM and Cray platforms. - Allow for shell variables in configure options. Options like '--includedir=\${prefix}/include' caused configure to fail. ------------------- Released version 3.0 ----------------------------- Note: In this version, we switch from a 'major.minor.bugfix' versioning scheme to a 'major.bugfix' scheme. New user-relevant features will be introduced by increasing the major number. Bugfix releases will not add new user-relevant features but might contain, in addition to bugfixes, Score-P-internal improvements. Major features: - Support for instrumentation of OpenACC codes based on the profiling interface specified in OpenACC 2.5. Features and improvements: - Extract topology information from the Slurm topology/tree plugin to create the system tree. This feature is available in Slurm since version 2.1 (around 09/2009) and documented since 01/2014. Please refer to the Slurm documentation how to enable this feature: http://slurm.schedmd.com/topology.html - Change PGI C++ compiler settings (selected via --with-nocross-compiler-suite=pgi) from pgCC to pgc++. PGI removed pgCC in version 16.1. If your installation still provides pgCC and you want to use it, please add CXX=pgCC to your configure line. Bugfixes: - Prevent sampling/unwinding when Intel MPI is used. This combination, even when sampling is not active, may mysteriously alter the application output just by linking libunwind. - Fixed possible underestimation of the trace size and memory footprint in scorep-score due to counting timestamps only for enter/leave records. - Fixed function signatures of SHMEM API functions that changed in Open MPI 2.0. ------------------- Released version 2.0.2 --------------------------- Bugfixes: - The preprocessing of source files before they will be instrumented with OPARI2 was broken. This is fixed. - Prevent potential division by zero error during calculation of tsc timer frequency. - Compiler-specific CXXFLAGS might break the 'build-score' configure as CXX use to build 'scorep-score' might differ from CXX used to build the Score-P libraries. CXXFLAGS in build-score are now ignored. To set build-score related CXXFLAGS, use CXXFLAGS_FOR_BUILD_SCORE. - Fix bug in configuration of SHMEM support triggered by change in shmem.fh header of Open MPI 1.10.2. - Fix PAPI configure check when additional libraries are needed to successfully link to PAPI. This was a regression introduced with version 2.0. - Fix typos in remapping specification file which caused the point-to-point and collective bytes transferred metrics to always be zero. - Build-system hardening. - The configure check for libunwind now also works if libunwind depends on liblzma. - Documentation improvements. - Fixed memory leaks in sampling and CUDA mode. ------------------- Released version 2.0.1 --------------------------- Bugfixes: - Prevent the memory adapter from initializing the measurement system as this leads to program crashes if it happens too early, e.g., on Blue Gene systems. If memory instrumentation is the only means of instrumentation, the measurement system is initialized via the feature 'compiler constructor'. If this feature isn't available (search for 'compiler constructor: yes' in 'scorep-info config-summary'), you need to add e.g., user instrumentation to initialize the measurement system. ------------------- Released version 2.0 ----------------------------- Major features: - Score-P supports a new data collecting mode based on sampling. Sampling can be used in conjunction with the usual instrumentation of parallel paradigms. Therefore it combines the lower overhead of statistical sampling and the accuracy of instrumentation. Both call-path profiling and event tracing are supported. As this is rather a major change in the Score-P internals and also for the user experience we appreciate any feedback but need to declare the sampling support as experimental in this first release. - Support for OPARI2 2.0 was integrated. OPARI2 is now more flexible to enable support for other pragma/directive based paradigms. - Support for MPI-3.1 functions (except 'MPI_Comm_idup'). Most new functions currently provide plain enter/exit wrappers. - Support for tracking memory allocations was added to Score-P. This includes C/C++, MPI, and SHMEM API calls. The instrumentation is done by default, though must be enabled at measurement time explicitly. Features and improvements: - When using compiler instrumentation with GNU (not the gcc-plugin but the '-finstrument-functions' variant), Cray, or Fujitsu compilers, one can provide a file containing symbols that will trigger measurement events when the corresponding function is called. These symbols are subject to filtering. Providing symbols this way is useful when obtaining symbols during measurement via 'nm' or 'libbfd' is not an option, e.g, on Blue Gene systems. The symbol file needs to be specified in the environment variable 'SCOREP_NM_SYMBOLS'. The accepted format is as in 'nm -l '. - Transparent changes to the event-dispatching. Currently events are consumed by either the profiling or tracing substrate (or both). - The timer selection was moved from configure time to measurement time. During configure we detect all available timers and provide the environment variable 'SCOREP_TIMER' to select one. The timer defaults to a low-overhead time stamp counter, if available. Note that we assume all processes to use the same timer and time stamp counter timers to run at the same frequency. - Building the entire Score-P package on Blue Gene/Q systems using GNU compilers is now supported. The installation currently needs some extra steps, please see 'share/bg-gnu/README' for details. The installation on older Blue Gene systems, though not tested, might work as well. - Source-to-source instrumentation via PDT on Blue Gene systems was re-enabled for PDT versions newer than 3.18. - Score-P takes advantage from compilers to initialize the measurement system automatically before triggering any event. This also ensures that the interrupt sources for sampling are registered as early as possible and in the case when no compiler instrumentation is available. - Score-P uses now the '-Minstrument=functions' flag for PGI compiler instrumentation (64-bit targets only). The '-Mprof=func' flag is no longer supported by PGI compiler version 16. To our knowledge, '-Minstrument=functions' is available at least since PGI compiler version 11. However, older PGI compiler versions may not support '-Minstrument=functions' and are not supported by Score-P anymore. - A synchronization callback was added to the metric plugin API. A metric plugin can register a synchronization callback which is called every time Score-P starts clock synchronization. The synchronization callback contains one argument specifying the point in time in more detail. At the moment we distinguish synchronization at initialization, during measurement run, and at finalization. As a result, the synchronization callback allows metric plugins to detect start and end points of measurement intervals. - The manual user instrumentation for Fortran 90 now performs region initialization checks based on handle values instead of comparing names. This reduces overhead. It does not apply when using PGI compilers though. - Support tracing of applications with more than 500000 tasks. User tools and API improvements and changes: - A Score-P installation provides new instrumentation wrappers which simplify the application instrumentation of autotools and CMake based projects. Please consult the usage instruction of the 'scorep-wrapper' command. - The option '--pomp' does not take any options any more. - Specific options for OPARI2 are passed via the '--opari=' option. - To control instrumentation of OpenMP the options '--openmp' and '--noopenmp' have been added. Note that for compilations using the OpenMP compiler-flag, instrumentation is enabled by default. However, when manually disabling instrumentation via '--noopenmp', some instrumentation must still be carried out to ensure a thread-safe execution of the measurement system. - POMP user instrumentation is no longer automatically activated together with OpenMP instrumentation. The '--pomp' flag has to be explicitly specified with the 'scorep' command. - On Cray systems, compiler instrumentation does not add '-G2' option anymore because '-G2' disables some optimizations. - The instrumenter now warns the user if the provided instrumentation filter wont be used by the active instrumentations. - The option '--disable-preprocessing' was added to the instrumenter. It tells the instrumenter to skip all preprocessing related activities. Useful e.g, if the input files are already preprocessed. Bugfixes: - Fixed possible mistreatment of a profile node as being in an untied task. - Fixed bug in obtaining executable names longer than 512 characters when using the GNU compiler adapter (applies also to Cray and Fujitsu compilers). - The GCC compiler instrumentation plug-in was non-functional for GCC 5 because of an unnoticed API change. Additionally, the custom demangling of Fortran module functions is working again. - The GCC instrumentation plug-in does not instrument the `main` function in Fortran programs anymore as the main entry point for the user is `MAIN__`. - Names assigned to MPI communicators by calls to 'MPI_Comm_set_name' are now also tracked, even if the corresponding API calls wont be recorded. - Fixed MPI library interposition if the link command lists explicitly 'libmpifort' or 'libmpigi'. ------------------- Released version 1.4.2 --------------------------- Features and improvements: - The GCC plug-in can also be built on cross build machines and with the GCC 5 release series. Bugfixes: - The OpenMP flag for PGI compilers (-mp) may have a value appended. In this case, the instrumenter did not detect the OpenMP paradigm properly. Fixed. - On Cray systems, a conflict between the -eZ and and the -eP flag occurred if the instrumenter performed preprocessing before OPARI2 instrumentation and the command line contains -eZ. Fixed. - If the user explicitly requires static Score-P libraries by specifying --static on the command line, scorep-config provides also full paths to the dependencies of its libraries, which might cause problems if the libraries are linked with dynamic libraries. Fixed. - The preprocessing step of CUDA source files for the OPARI2 instrumentation did not add preprocessing flags to the preprocessor invocation. Thus, it becomes a full compilation step. Fixed. - Fix exponent in the CUDA metric definitions. - Fix scorep-config bug on MIC, which always showed an 'Unsupported target mic. Abort' - Configure checks for PAPI on MIC failed with unresolved symbols to libpfm. Fixed. - Help text for --target attribute of scorep-config added ------------------- Released version 1.4.1 --------------------------- Bugfixes: - BG/Q: use optimized MPI rank to SION file mapping (one file per I/O node) - Fixes in the OpenCL adapter: - The Score-P instrumenter did misinterpret the OpenCL library as an input file, if it was given as '-l opencl' on the command line. Fixed. - Fixed segmentation fault of clReleaseEvent during Score-P OpenCL flush. - Fixed wrappers of OpenCL 2.0 functions. - Revised mutex locking. - Apply filtering also to CUDA API exit events. - The Score-P instrumenter did misinterpret the Pthread library as an input file, if it was given as '-l pthread' on the command line. Fixed. - The collapse node post-processing in the profile happened for the master location and lead to errors if a collapse node appeared on anther location. Fixed. - Fixed detection of building a shared library on Cray in the instrumenter. - Fixed failed OpenMP detection on K if the -Kopenmp flag was combined with other flags in a comma separated list. - Fixed erroneous calculation and presentation of task migration metrics. - The GCC instrumentation plug-in can now also be built if the used GCC installation does not provide a `gmp.h` header. - Fixed missing DESTDIR support for installing `scorep-config` delegate on Xeon Phi. - Instrumented C/Fortran OpenMP programs on Fujitsu systems showed race conditions. Furthermore, C++ applications failed at initialization time. This was due to a bug in the Fujitsu compiler and OpenMP runtime. Fujitsu provided a workaround that fixed this issues. - Calls to functions, instrumented by the GCC plug-in, after the finalization of the measurement, aborted the application. Fixed. - In shared Score-P builds using recent Intel MPI a 'MPIR_Thread: TLS definition ... mismatches non-TLS definition ...' error was encountered. Fixed. - The OpenSHMEM measurement adapter records request-lock instead of acquire-lock events. Fixed. - Instrumentation of applications compiled with PGI compilers and Open MPI 1.8 failed with an 'undefined reference to pgf90_compiled'. Fixed by adding the '-pgf90libs' option when using MPI with PGI compilers. ------------------- Released version 1.4 ----------------------------- Major features: - If the used OTF2 version supports SIONlib, then it is now possible to write also traces with SIONlib that include an arbitrary number of threads, asynchronous metric plugins, and accelerator (CUDA/OpenCL/...) streams. - Basic support for OpenCL instrumentation. - For GCC versions 4.5 till 4.9 a new function instrumentation is available via the plug-in interface of the compiler. This new function instrumentation greatly improves the measurement performance. It also provides compile-time instrumentation filtering using the same filter file format as the run-time filtering. On some systems the GCC plug-in dev package needs to be installed, in order to provide the necessary header files. - Score-P now ships with the entire Cube package included. I.e., a Cube installation is no longer a hard requirement when building Score-P from a tarball (this requirement was introduced with Score-P 1.2 and was needed to build scorep-score, a tool to score profile experiments to prepare a filter for subsequent trace experiments). A Cube installation will be favored if cube-config is in PATH (as with OTF2 and OPARI2 installations). To use the internal Cube even if a cube-config is in PATH, specify --without-cube on the configure command-line. To prevent building the Cube GUI, add --without-gui to the configure command-line. Features and improvements: - Support for pthread_exit and pthread_cancel was added. - Added support for task migration in the profiling system. - Basic support for Fujitsu FX100 systems added. - Added support for Intel Xeon Phi systems (native mode only) - Score-P now requires at least OTF2 1.5. - Added new user instrumentation macros (e.g., SCOREP_USER_REGION_BY_NAME_BEGIN( name, type ) and SCOREP_USER_REGION_BY_NAME_END( name )). These macros can annotate user regions without the need to take care about the handle struct. User tools and API improvements and changes: - Due to the added task migration support, the default for the invocation of OPARI2 in the instrumenter was changed. Until now, the instrumenter let OPARI2 make all tasks tied and print a warning if an untied task was encountered. The new default is that the untied tasks are left untied and no warning is printed. - The task related data storage mechanism was changed. The profiling backend does not use a hash table to associate a task id with a data structure anymore, but gets a pointer from the task management in the measurement core. Thus, the environment variable SCOREP_PROFILING_TASK_TABLE_SIZE to specify the size of the hash table disappeared. - Added the environment variable SCOREP_PROFILING_TASK_EXCHANGE_NUM to specify how ofter the profiling system returns reallocated memory objects that have migrated to another thread. - Support for cobi was removed. - SCOREP_User_RegionBegin / SCOREP_User_RegionInit accept NULL as parameter value for lastFileName and lastFileHandle. This simplifies the calls to these functions when used directly without the provided macros. - scorep-score got a new option: -m allows to display mangled region names. Furthermore, the filter evaluation in scorep-score can also use mangled names, too. Bugfixes: - In some cases, not all regions are exited at measurement finalization time. Fixed. - Using PGI compiler instrumentation in conjunction with tasks could lead wrong region handles in region exits. Fixed. - Fix building of MPI wrapper if compiler issues unrelated warnings at configure time. - The SCOREP_USER_METRIC_UINT64 macro used signed values. Fixed. - Add conflict in the instrumenter between --thread=pthread and --mutex=pthread. - Fixed errors with libmpigf during linking of the instrumented application. - Fixes wrong acquisition order in pthread_cond_timedwait by modifying the nesting level (analog pthread_cond_wait) - Fixes that internal CUDA driver calls were recorded - Fixes a potential deadlock in CUDA adapter for multithreaded CUDA - Fortran OpenMP applications instrumented with OPARI2 and preprocessing report wrong file names ending in '.input.F' for POMP2 regions. Fixed except for Oracle/Studio and Cray compiler. ------------------- Released version 1.3 ----------------------------- Major features: - Basic support for the K Computer and Fujitsu FX10 systems added. The Tofu network topology will be supported in a subsequent release. Note that some C++ OpenMP programs fail during measurement initialization for unknown reasons. - Add support for instrumenting programs which use SHMEM library calls for one-sided communication. Score-P currently supports the SHMEM implementations of Cray, Open MPI, OpenSHMEM, and SGI. - Basic support for POSIX thread instrumentation. Supported POSIX thread routines are pthread_create, pthread_join, pthread_mutex_init, pthread_mutex_destroy, pthread_mutex_lock, pthread_mutex_trylock, pthread_mutex_unlock, pthread_cond_init, pthread_cond_destroy, pthread_cond_signal, pthread_cond_broadcast, pthread_cond_wait, and pthread_cond_timedwait. Following thread management functions are currently not supported and will abort the program: pthread_exit and pthread_cancel. The usage of pthread_detach will cause the program to fail if the detached thread is still running after the end of main. These limitations will be addressed in an upcoming version of Score-P. Note that you need to instrument every thread creation. Features and improvements: - Use Process Manager Interface (PMI) to get fine-granular information about the system topology on Cray machines. - Implemented the possibility to write CUBE profiles with the tuple values containing sum, minimum, maximum, number of samples, sum of squares. - The new SIONlib integration of OTF2 extends the support of writing SION traces to all multi-process paradigms, not only MPI. Though only pure multi-process measurements are supported for now. No threads, no CUDA, no non-CPU metrics. Score-P itself does not depend on SIONlib any longer, only OTF2 does now. The configure option '--with-sionlib' (formerly '--with-sionconfig') is passed to OTF2. As part of this integration the measurement configuration variable 'SCOREP_TRACING_NLOCATIONS_PER_SION_FILE' was renamed to 'SCOREP_TRACING_MAX_PROCS_PER_SION_FILE' to clarify that Score-P can only distribute whole processes into a multi-file SION trace. - Improved initialization of adapters which results in a reduced number of libraries needed to be linked into the application. - Extended the TAU adapter to allow input of location properties, which are location specific meta data presented as key/value pair. - The option --thread=[:] gives users the possibility to choose the threading model and to fine-tune certain aspects. Currently OpenMP and POSIX threads are supported with either --thread=omp or --thread=pthread. For OpenMP we provide the two variants --thread=omp:pomp_tpd (default) and --thread=omp:ancestry. The former tells OPARI2 to insert code for thread tracking where the latter uses the ancestry functions in OpenMP 3.0 and later to accomplish the same task. User tools and API improvements and changes: - Improved automatic MPI detection in the instrumenter (helpful on Cray, as cc/CC/ftn is the compile command for both MPI and non-MPI). - Changed paradigm selection in the instrumenter to match the selection options in the scorep-config tool. Thus, introduced --mpp= and --thread= flags for the instrumenter to select the multi-process paradigm and the threading paradigm. The old options --mpi, --nompi, --openmp, --noopenmp are marked as deprecated and are no longer documented. - Added handling for special characters, like space, in file names and path names. However, there are still some limitation when using special characters: The PDT parser cannot deal with these characters and, thus, fails if PDT instrumentation is enabled and special characters appear. Furthermore, compilation fails when double quotes appear in source file names and preprocessing is enabled. - Unified naming of macros in the user adapter. In C/C++ the macros to define global region handles (SCOREP_GLOBAL_REGION_DEFINE and SCOREP_GLOBAL_REGION_EXTERNAL) and in Fortran the parameter macros (SCOREP_PARAMETER_DEFINE, SCOREP_PARAMETER_INT64, SCOREP_PARAMETER_UINT64, SCOREP_PARAMETER_STRING) got the prefix SCOREP_USER instead of only SCOREP. - Added selection for mutex locking, allowing to use the parameter --mutex= to switch between known locking mechanisms within the measurement system (omp,pthread,pthread:spinlock,pthread:wrap). - Improved event size estimation in scorep-score using otf2-estimator. - Install Cube remap specification file and provide its location via the scorep-config tool. - The scorep-info tool can now show known and open issues regarding the measurement with Score-P. It is highly advised to consult this list before reporting problems. CUDA support improvements and changes: - Added support for CUDA 5.5 and CUDA 6.0: The CUPTI activity buffer handling has changed. The SCOREP_CUDA_BUFFER_CHUNK environment variable has therefore been introduced (see user documentation). The default size for SCOREP_CUDA_BUFFER was changed to '1M'. - New options for SCOREP_CUDA_ENABLE: 'references' : track references between CUDA host and device activities in the OTF2 trace 'flushatexit' : forces pending CUDA activities to be flushed at program exit (avoids records to be dropped in OpenACC programs) 'kernel_serial': serialize recording of (potentially concurrent) kernels - Obsolete options for SCOREP_CUDA_ENABLE: 'concurrent' : recording of (potentially concurrent) kernels is the default 'stream_reuse': feature has been removed 'device_reuse': feature has been removed - Added support for runtime filtering of CUDA device and host activities. Bugfixes: - When using the Intel compiler, functions from shared libraries now appear in the measurement output. Previously we inspected the symbol table of the executable and evaluated the filtering on all functions in the executable. Thus, compiler instrumented functions from shared libraries were automatically filtered, when using the Intel compiler. Now, the filters are evaluated when the functions appear the first time. - Fix handling of Intel compiler options starting with "-o". - The pgCC compiler version 13.9 and newer preinclude omp.h if OpenMP is enabled. This leads to multiply defined symbols if the source file is preprocessed before compilation. Prevent the preinclusion for the compilation of preprocessed files if an appropriate compiler option exists (exists since pgCC version 14.1). - Fix a deadlock on AIX, if MPI_Abort was called. - If a system provides only shared OpenMP runtime libraries and a compiler does not add rpath information but relies on LD_LIBRARY_PATH, the Score-P instrumenter fails execution. Fixed. - Fix missing flags in OPARI2 call to disable OpenMP instrumentation, if the user selected POMP instrumentation for a serial program without specifying that the program is serial. - Prepend link calls to the Intel compiler by setting VT_LIB_DIR and VT_LIBS to avoid remarks. - Changed enumeration of threads in the profile from a global enumeration to an enumeration from 0 to N-1 on each process. - Use "-G2" if the Cray compiler instrumentation is used. The previous "-g" flag disabled all optimizations. - Fix creation of experiment directory if the monitored application make use of 'chdir' operation. - The Score-P instrumenter tool moved compiler selection flags for the MPI compiler wrapper to a different location in the command line. Fixed. - Fixed broken instrumentation if the applications link step explicitly links libc. - Fixed wrong acquisition order attribute passed to acquire lock events from OpenMP critical sections. ------------------- Released version 1.2.3 --------------------------- - Fixed a failed assertion that occurs if selective recording was enabled in profiling mode. - Fixed wrong path names in the instrumenter, when Score-P was configured with the --bindir flag. - Install scorep-score in the correct directory, if Score-P was configured with the --bindir flag. - Reduce per-event measurement overhead by improving Score-P's assert and error handling. - Adapt configure to recent Cray installations. - Score-P measurements provided with a SCOREP_EXPERIMENT_DIRECTORY, say foo, used to overwrite an existing foo even if this foo is not a directory. Will now abort with a meaningful message. - Metric plugin component: handling of multiple metrics improved. - Don't remove source files during make distclean in an in-place build. - Fix failing detection of nvcc in case it was called with a path. - The measurement configuration (stored in the file `scorep.cfg') is now also preserved in the experiment directory in case of an failed measurement. - Added compiler instrumentation flags also to the ldflags to fix missing instrumentation if high optimization levels recompile parts of the code. - Changed the region names of OPARI2 instrumented named criticals. If a name for the critical region is provided, the enclosing region will have the name '!$omp critical ' and the structured block '!$omp critical sblock'. Replace by the given name. ------------------- Released version 1.2.2 --------------------------- - The Fortran Cray compiler instrumentation did not create an exit event. Thus, we add an exit on Score-P finalization. - Removed remark of the Intel compiler during instrumentation that VT_ROOT is not set, if preprocessing was used. - MPI parallel measurements with just one process were fixed. - Fixed a race condition during initialization of the TRACE_BUFFER_FLUSH region, that could lead to incomplete profiles if a user runs a hybrid (MPI + OpenMP) application and enables profiling and tracing at the same time. - Fix error message when scorep-config is called without arguments in a non-mpi installation. - In scorep-config's rpath options, omit paths searched by ldconfig, even if Score-P was installed there, in order to comply to packaging guidelines of some Linux distributions. - Fixed broken MPI detection in the instrumenter if the MPI compiler wrapper is specified with the full path. - If Score-P is build with static and dynamic libraries, the selection of using static or dynamic libraries was improved. Using -Bstatic or -Bshared had some side effects and was sometimes unreliable. - On Cray system, change libtools default to prefer static linking of external libraries. - Suppress failed assertion messages when initializing compiler instrumentation with Intel compilers without libbfd. The measurement completes even if these messages exist. - Added options to scorep-config and the scorep instrumenter to enable/disable online access support. - Fixed broken --includedir configure option that installed Score-P headers in a wrong directory. - Fix SCOREP_RECORDING_IS_ON(isOn) user macro; in Fortran codes, isOn was not set to false when instrumented with --nouser. - Fixed instrumentation compilation error that occurred if --opari="--disable=atomic" was specified without OpenMP compilation flags. - Improvements in obtaining region information via libbfd. - Improved configure checks to determine values of MPI constants. Previous tests failed on AIX. - Improvements of measurement reconfiguration in Online Access mode. - Honor --without-mpi when --with-custom-compilers is given at configure time. - Several smaller fixes. ------------------- Released version 1.2.1 --------------------------- - Allow configuration without support for the MPI programming model by specifying --without-mpi on the configure line. - Abort during instrumentation with a meaningful error message if a user requests MPI but the Score-P installation does not support MPI - On Blue Gene/Q, detect PAMI library at configure time. The location and names of the PAMI files changes during a system upgrade. Search all known directories and library names. - Improve --with-custom-compilers, customization files are now recognized also in the build directory (see INSTALL). - On SGI MPT systems, or more generally on systems that don't use compiler wrappers for building MPI programs, improve the automatic detection of the MPI programming paradigm during instrumentation. - Abort with an error message during instrumentation if the user wants to build a shared library with static Score-P libraries. - Abort if the user specified a filter file which cannot be opened. - Improved the auto-detection in the instrumenter for MPI libraries. This should fix some failures with MPI programs that do not use a compiler wrapper, e.g., when using SGI MPT. - Fixed that the instrumenter fails to detect whether an application uses OpenMP with the XL compiler if the user specifies more than one option to '-qsmp=" - Abort configuration when the user specified --without-cube on the commandline as cube is a required component. ------------------- Released version 1.2 ----------------------------- - Simplified MPI compiler detection, passing '--with-mpi' to configure is usually not necessary if your MPI compiler is in PATH. - Support for Cray systems. PrgEnv-(cray|gnu|intel|pgi) are supported in static mode (static is the default). Please note that OpenMP instrumentation is currently broken for PrgEnv-cray. - Compilation units getting processed by OPARI2 are now being preprocessed by the C/C++ preprocessor. This way it is possible to instrument OpenMP directives in header files. It also solves instrumentation problems cause by OpenMP pragmas within preprocessor defines. Preprocessing is the default but can be deactivated using --nopreprocess. When using PDT instrumentation, preprocessing is deactivated. - To reduce the memory demands of dynamic regions in profiling mode, this version provides a lossy compression mechanism called 'clustering'; similar subtrees of a dynamic region are clustered into one. This feature is enabled by default. There are three new environment variables for customization, please see the documentation for details. - The new keyword 'MANGLED' was added to the filter file format to deal with cases where the displayed name and mangled name are different. The keyword 'FORTRAN' was removed. - External metric sources can be utilized via a a plug-in mechanism. This feature is controlled via the SCOREP_METRIC_PLUGIN environment variable. Please see the documentation for details and an example. - The CUDA adapter got refactored and extended to provide much more useful metrics. There are several new values to the environment variable SCOREP_CUDA_ENABLE. Please see the documentation for details. - The machine name used in the profile and trace output is now configurable at built-time with the --with-machine-name flag or at run-time with the SCOREP_MACHINE_NAME measurement configuration variable. - Full support to track the incurred OpenMP thread teams and utilizing the new generic threading records of OTF2. - The Score-P internals were significantly refactored in order to increase flexibility to adapt to new programming paradigms and event sources. - Please note that the feature 'selective tracing' was renamed to 'selective recording' as it also applies to profiling. - Please note that CUBE is a hard requirement when build Score-P from a tarball. This is due to the fact that we want to provide the user with 'scorep-score', that cannot be build without the CUBE reader library available. ------------------- Released version 1.1 ----------------------------- - Rewind, a new event-trace recording mode for long-running experiments, triggered by user-instrumentation macros. Writes semantics information in OTF2 anchor file as rewind might affect analysis. - ARM support (detection + compiler adapter). - Metric service improvements. Support for per-process metrics and per-system-tree-class metrics. - Support for OpenMP-task profiling and tracing alongside with improvements of the POMP adapter. - Component separation: Score-P can now use pre-installed OTF2, OPARI2, and CUBE packages instead of the internal ones. - Removed dependency to external repository that was used by Score-P, OTF2, and OPARI2 in order to prevent version conflicts. - Support for CUDA profiling and tracing. - Easier experiment configuration via scorep-info which provides a list of all measurement configuration variables. - scorep-info also provides the improved configure-summary of the installation. - Scoring of profile experiments via scorep-score (if configured with external CUBE) to prepare a filter for subsequent trace experiment. - Documentation improvements. - Numerous configure improvements. Let external libraries use generic configure options (tbc). Fixed portability issues. - Numerous instrumenter improvements. All possible combinations of options supported. - MPI profiling improvements. - OpenMP nesting supported although little tested. - Several compiler-dependent OpenMP-related bugfixes. ------------------- Released version 1.0.2 --------------------------- - Several instrumentation fixes: - Improvements for PDT Fortran instrumentation. - Improvements for C++ user instrumentation. - Return real failure if instrumentation is erroneous. Failures may went undetected previously. - Allow for out-of-place builds. - Provide correct parameter to SCOREP_USER_REGION_ENTER macro. - Provide correct timestamp to OmpTaskCreate events. - Fix invalid order of arguments provided to MpiCollectiveEnd events. - Fix bug in parameter profiling. - Enable SIONlib support, currently just for MPI applications. - Various fixes for the generated OpenMP region names: - Inner and outer blocks got different names. - Regions with the ordered clause got a special name. - All region names got it '@file:lno' appended, to make them distinguishable. ------------------- Released version 1.0.1 --------------------------- - Renaming of the configure related variable LD_FLAGS_FOR_BUILD to LDFLAGS_FOR_BUILD for consistency. - Renaming of installed tool and options for consistency, i.e. changing underscores to dashes. Also, the --(no)openmp_support option changed to --(no)openmp. - Improved linking on AIX systems. - Robustness improvements when instrumenting with PDT. - On x86 platforms, be more cautious using the tsc counter. If /proc/cpuinfo reports constant_tsc but not nonstop_tsc, then it is likely that the counter is unreliable. - Improved configure summary. - configure will not fail if -q or --silent is passed. ------------------- Released version 1.0 -----------------------------