3.1-rc3 (revision d9ca08bb)
Usage in writing mode - MPI example

This is a short example of how to use the OTF2 writing interface with MPI. This example is available as source code in the file otf2_mpi_writer_example.c .

We start with inclusion of some standard headers.

#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <time.h>

And then include the MPI and OTF2 header.

#include <mpi.h>
#include <otf2/otf2.h>

Now prepare the inclusion of the <otf2/OTF2_MPI_Collectives.h> header. As it is a header-only interface, it needs some information about the used MPI environment. In particular the MPI datatypes which match the C99 types uint64_t and int64_t. In case you have an MPI 3.0 conforming MPI implementation you can skip this. If not, provide #define's for the following macros prior the #include statement. In this example, we assume a LP64 platform.

#if MPI_VERSION < 3
#define OTF2_MPI_UINT64_T MPI_UNSIGNED_LONG
#define OTF2_MPI_INT64_T MPI_LONG
#endif

After this preparatory step, we can include the <otf2/OTF2_MPI_Collectives.h> header.

We use MPI_Wtime to get timestamps for our events but need to convert the seconds to an integral value. We use a nano second resolution.

get_time( void )
{
double t = MPI_Wtime() * 1e9;
return ( uint64_t )t;
}

Define a pre and post flush callback. If no memory is left in OTF2's internal memory buffer or the writer handle is closed a memory buffer flushing routine is triggered. The pre flush callback is triggered right before a buffer flush. It needs to return either OTF2_FLUSH to flush the recorded data to a file or OTF2_NO_FLUSH to suppress flushing data to a file. The post flush callback is triggered right after a memory buffer flush. It has to return a current timestamp which is recorded to mark the time spend in a buffer flush. The callbacks are passed via a struct to OTF2.

get_time( void )
{
double t = MPI_Wtime() * 1e9;
return ( uint64_t )t;
}

We also declare some enums for MPI regions and communicators. Now everything is prepared to begin with the main program.

int
main( int argc,
char** argv )
{

First initialize the MPI environment and query the size and rank.

MPI_Init( &argc, &argv );
int size;
MPI_Comm_size( MPI_COMM_WORLD, &size );
int rank;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );

Create new archive handle.

OTF2_Archive* archive = OTF2_Archive_Open( "ArchivePath",
"ArchiveName",
1024 * 1024 /* event chunk size */,
4 * 1024 * 1024 /* def chunk size */,
OTF2_SUBSTRATE_POSIX,

Set the previously defined flush callbacks.

Now we provide the OTF2 archive object the MPI collectives. As all ranks in MPI_COMM_WORLD write into the archive, we use this communicator as the global one. We set the local communicator to MPI_COMM_NULL, as we don't care about file optimization here.

OTF2_Archive* archive = OTF2_Archive_Open( "ArchivePath",
"ArchiveName",
1024 * 1024 /* event chunk size */,
4 * 1024 * 1024 /* def chunk size */,
OTF2_SUBSTRATE_POSIX,

Now we can create the event files. Though physical files aren't created yet.

Each rank now requests an event writer with its rank number as the location id.

OTF2_EvtWriter* evt_writer = OTF2_Archive_GetEvtWriter( archive,
rank );

We note the start time in each rank, this is later used to determine the global epoch.

OTF2_EvtWriter_Enter( evt_writer,
NULL,
get_time(),
REGION_MPI_INIT );

Write an enter and a leave record for region 0 to the local event writer.

We also record a MPI_Barrier in the trace. For this we generate an event before we do the MPI call.

NULL,
get_time() );

Now we can do the MPI_Barrier call.

After we passed the MPI_Barrier. we can note the end of the collective operation inside the event stream.

NULL,
get_time(),
COMM_WORLD,
0 /* bytes provided */,
0 /* bytes obtained */ );

Finally we leave the region again with the leave region.

OTF2_EvtWriter_Leave( evt_writer,
NULL,
get_time(),
REGION_MPI_INIT );

TODO: add more detailed annotation for the remainder of the example

The event recording is now done, note the end time in each rank.

Now close the event writer, before closing the event files collectively.

After we wrote all of the events we close the event files again.

The per-rank definition files are optional, but create them nevertheless to please the reader. In a real application, this should be used to write ClockOffset records for time synchronisations.

OTF2_DefWriter* def_writer = OTF2_Archive_GetDefWriter( archive,
rank );
OTF2_Archive_CloseDefWriter( archive, def_writer );

We now collect all of the epoch_start and epoch_end timestamps by calculating the minimum and maximize and provide these to the root rank.

uint64_t epoch_timestamp = epoch_timestamp_spec.tv_sec * 1000000000 + epoch_timestamp_spec.tv_nsec;
struct
{
uint64_t timestamp;
int index;
} epoch_start_pair, global_epoch_start_pair;
epoch_start_pair.timestamp = epoch_start;
epoch_start_pair.index = rank;
MPI_Allreduce( &epoch_start_pair,
&global_epoch_start_pair,
1, MPI_LONG_INT, MPI_MINLOC,
MPI_COMM_WORLD );
if ( epoch_start_pair.index != 0 )
{
if ( rank == 0 )
{
MPI_Recv( &epoch_timestamp, 1, OTF2_MPI_UINT64_T,
epoch_start_pair.index, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE );
}
else
{
MPI_Send( &epoch_timestamp, 1, OTF2_MPI_UINT64_T,
0, 0, MPI_COMM_WORLD );
}
}
uint64_t global_epoch_end;
MPI_Reduce( &epoch_end,
&global_epoch_end,
1, OTF2_MPI_UINT64_T, MPI_MAX,
0, MPI_COMM_WORLD );

Only the root rank will write the global definitions, thus only he requests a writer object from the archive.

if ( 0 == rank )
{
OTF2_GlobalDefWriter* global_def_writer = OTF2_Archive_GetGlobalDefWriter( archive );

We need to define the clock used for this trace and the overall timestamp range.

1000000000,
global_epoch_start_pair.timestamp,
global_epoch_end - global_epoch_start_pair.timestamp + 1,
epoch_timestamp );

Now we can start writing the referenced definitions, starting with the strings.

OTF2_GlobalDefWriter_WriteString( global_def_writer, 0, "" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 1, "Initial Thread" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 2, "MPI_Init" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 3, "PMPI_Init" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 4, "MPI_Finalize" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 5, "PMPI_Finalize" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 6, "MPI_Comm_split" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 7, "PMPI_Comm_split" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 8, "MPI_Intercomm_create" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 9, "PMPI_Intercomm_create" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 10, "MPI_Comm_free" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 11, "PMPI_Comm_free" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 12, "MPI_Bcast" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 13, "PMPI_Bcast" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 14, "MPI_Ibarrier" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 15, "PMPI_Ibarrier" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 16, "MPI_Test" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 17, "PMPI_Test" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 18, "MPI_Wait" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 19, "PMPI_Wait" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 20, "MyHost" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 21, "node" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 22, "MPI" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 23, "MPI_COMM_WORLD" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 24, "SPLIT 0" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 25, "SPLIT 1" );
OTF2_GlobalDefWriter_WriteString( global_def_writer, 26, "INTERCOMM" );

Write definition for the code region which was just entered and left to the global definition writer.

REGION_MPI_INIT,
2 /* region name */,
3 /* alternative name */,
0 /* description */,
OTF2_REGION_FLAG_NONE,
22 /* source file */,
0 /* begin lno */,
0 /* end lno */ );
REGION_MPI_FINALIZE,
4 /* region name */,
5 /* alternative name */,
0 /* description */,
OTF2_REGION_FLAG_NONE,
22 /* source file */,
0 /* begin lno */,
0 /* end lno */ );
REGION_MPI_COMM_SPLIT,
6 /* region name */,
7 /* alternative name */,
0 /* description */,
OTF2_REGION_FLAG_NONE,
22 /* source file */,
0 /* begin lno */,
0 /* end lno */ );
REGION_MPI_INTERCOMM_CREATE,
8 /* region name */,
9 /* alternative name */,
0 /* description */,
OTF2_REGION_FLAG_NONE,
22 /* source file */,
0 /* begin lno */,
0 /* end lno */ );
REGION_MPI_COMM_FREE,
10 /* region name */,
11 /* alternative name */,
0 /* description */,
OTF2_REGION_FLAG_NONE,
22 /* source file */,
0 /* begin lno */,
0 /* end lno */ );
REGION_MPI_BCAST,
12 /* region name */,
13 /* alternative name */,
0 /* description */,
OTF2_REGION_FLAG_NONE,
22 /* source file */,
0 /* begin lno */,
0 /* end lno */ );
REGION_MPI_IBARRIER,
14 /* region name */,
15 /* alternative name */,
0 /* description */,
OTF2_REGION_FLAG_NONE,
22 /* source file */,
0 /* begin lno */,
0 /* end lno */ );
REGION_MPI_TEST,
16 /* region name */,
17 /* alternative name */,
0 /* description */,
OTF2_REGION_FLAG_NONE,
22 /* source file */,
0 /* begin lno */,
0 /* end lno */ );
REGION_MPI_WAIT,
18 /* region name */,
19 /* alternative name */,
0 /* description */,
OTF2_REGION_FLAG_NONE,
22 /* source file */,
0 /* begin lno */,
0 /* end lno */ );

Write the system tree to the global definition writer.

0 /* id */,
20 /* name */,
21 /* class */,

For each rank we define a new location group and one location. We provide also a unique string for each location group.

for ( int r = 0; r < size; r++ )
{
char process_name[ 32 ];
snprintf( process_name, sizeof( process_name ), "MPI Rank %d", r );
27 + r,
process_name );
r /* id */,
27 + r /* name */,
0 /* system tree */,
OTF2_UNDEFINED_LOCATION_GROUP /* creating process */ );
r /* id */,
1 /* name */,
43 /* # events */,
r /* location group */ );
}

The last step is to define the MPI communicator. This is a three-step process. First we define that this trace actually recorded in the MPI paradigm and enumerate all locations which participate in this paradigm. As we used the MPI ranks directly as the location id, the array with the locations is the identity.

uint64_t comm_locations[ size ];
for ( int r = 0; r < size; r++ )
{
comm_locations[ r ] = r;
}

Now we can define sub-groups of the previously defined list of communication. locations. For MPI_COMM_WORLD this is the whole group here. Note the these sub-groups are created by using indices into the list of communication locations, and not by enumerating location ids again. But in this example the sub-group is the identity again.

OTF2_GlobalDefWriter_WriteGroup( global_def_writer,
0 /* id */,
24 /* name */,
size,
comm_locations );
OTF2_GlobalDefWriter_WriteGroup( global_def_writer,
1 /* id */,
0 /* name */,
size,
comm_locations );

Finally we can write the definition of the MPI_COMM_WORLD communicator. This finalizes the writing of the global definitions and we can also close the writer object.

OTF2_GlobalDefWriter_WriteComm( global_def_writer,
COMM_WORLD,
23 /* name */,
1 /* group */,
OTF2_UNDEFINED_COMM /* parent */,
for ( int r = 0; r < size; r += 2 )
{
comm_locations[ r / 2 ] = r;
}
OTF2_GlobalDefWriter_WriteGroup( global_def_writer,
2 /* id */,
0 /* name */,
( size + 1 ) / 2,
comm_locations );
OTF2_GlobalDefWriter_WriteComm( global_def_writer,
COMM_SPLIT_0,
24 /* name */,
2 /* group */,
COMM_WORLD,
for ( int r = 1; r < size; r += 2 )
{
comm_locations[ r / 2 ] = r;
}
OTF2_GlobalDefWriter_WriteGroup( global_def_writer,
3 /* id */,
0 /* name */,
size / 2,
comm_locations );
OTF2_GlobalDefWriter_WriteComm( global_def_writer,
COMM_SPLIT_1,
25 /* name */,
3 /* group */,
COMM_WORLD,
COMM_INTERCOMM,
26 /* name */,
2 /* groupA */,
3 /* groupB */,
COMM_WORLD,
global_def_writer );
}

All the other ranks wait inside this barrier so that root can write the global definitions.

At the end, close the archive, finalize the MPI environment, and exit.

To compile your program use a command like the following. Note that we need to activate the C99 standard explicitly for GCC.

$(MPICC) $(CFLAGS) $(OTF2_CFLAGS) -c otf2_mpi_writer_example.c -o otf2_mpi_writer_example.o

Now you can link your program with:

$(MPICC) $(CFLAGS) otf2_mpi_writer_example.o $(OTF2_LDFLAGS) $(OTF2_LIBS) -o otf2_mpi_writer_example