Skip to content

likwid perfctr

Thomas Gruber edited this page Sep 12, 2024 · 60 revisions

likwid-perfctr: Measuring applications' interaction with the hardware using the hardware performance counters

While there are already a bunch of tools around to measure hardware performance counters, a lightweight command line tool for simple end-to-end measurements was still missing. The Linux MSR module, providing an interface to access model specific registers from user space, allows us to read out hardware performance counters with an unmodified Linux kernel. Moreover, recent Intel systems provide Uncore hardware counter through PCI interfaces.

likwid-perfctr supports the following modes:

  • wrapper mode: Use likwid-perfctr as a wrapper to your application. You can measure without altering your code.
  • stethoscope mode: Measure performance counters for a variable time duration independent of any code running.
  • timeline mode: Output performance metric in specified frequency (can be ms or s)
  • marker API: Only measure regions in your code, still likwid-perfctr controls what to measure.

There are pre-configured event sets, called performance groups, with useful pre-selected event sets and derived metrics. Alternatively, you can specify a custom event set. In a single event set, you can measure as many events as there are physical counters on a given CPU respectively socket. See in the architecture specific pages for more details. likwid-perfctr will validate at startup if an event can be measured on a configured counter.

Because likwid-perfctr performs simple end-to-end measurements and does not know anything about the code which gets executed, it is crucial to pin your application. The relation between the measurement and your code is solely through pinning. As LIKWID works in user-space there is no possibility to measure only a single process, LIKWID always measures CPUs or sockets. likwid-perfctr has all pinning functionality of likwid-pin builtin. You need no additional tool for the pinning. Still you can control affinity yourself if you prefer.

likwid-perfctr's performance groups are simple text files and can be easily changed or extended. It is simple to create your own performance groups with custom derived metrics. In contrast to previous versions of LIKWID, no recompilation is needed anymore after changing a performance group.

Content

Supported architectures

For architecture specific information for likwid-perfctr about available counters and related counter options click on the architecture links below.

Intel:

  • P6 processors : Pentium 2, Pentium 3 (deprecated)
  • Pentium M : Banias, Dothan (deprecated)
  • Core2 : 65nm Dual Core, 45nm Dual and Quad Core
  • Nehalem : Full support for Uncore.
  • NehalemEX : Full support for Uncore.
  • Westmere : Full support for Uncore.
  • WestmereEX : Full support for Uncore.
  • Silvermont : Support for energy counters.
  • Goldmont : Support for energy counters.
  • SandyBridge : Support for energy counters.
  • SandyBridge EP/EN : Support for energy counters. Full support for Uncore.
  • IvyBridge : Support for energy counters.
  • IvyBridge EP/EN/EX : Support for energy counters. Full support for Uncore.
  • Haswell : Support for energy counters. Full support for Uncore.
  • Haswell EP/EN/EX : Support for energy counters. Full support for Uncore.
  • Broadwell : Support for energy counters.
  • Broadwell Xeon D : Support for energy counters. Full support for Uncore.
  • Broadwell EP : Support for energy counters. Full support for Uncore.
  • Skylake : Support for energy counters. Full support for Uncore.
  • Skylake X : Support for energy counters. Full support for Uncore.
  • Intel Xeon Phi (KNC) : Intel Xeon Phi KNC, support for energy counters. Full support for Uncore.
  • Intel Xeon Phi (KNL, KNM) : Intel Xeon Phi KNL and KNM, support for energy counters. Full support for Uncore.
  • Cascadelake X : Support for energy counters. Full support for Uncore.
  • Tigerlake : Full support for core-local, energy and Uncore counters.
  • Icelake : Full support for core-local, energy and Uncore counters.
  • Icelake SP : Full support for core-local, energy and Uncore counters.
  • SapphireRapids : Full support for core-local, energy and Uncore counters.

AMD:

  • K8 : All variants
  • K10 : Barcelona, Shanghai, Istanbul and MagnyCours
  • Interlagos : Full support including NorthBridge counters
  • Kabini : Full support including L2 and NorthBridge counters
  • Zen: Full support for core-local, L3, data fabric and energy counters.
  • Zen2: Full support for core-local, L3, data fabric and energy counters.
  • Zen3: Full support for core-local, L3, data fabric and energy counters.
  • Zen4: Full support for core-local, L3, data fabric and energy counters.

ARM:

IBM:

Prerequisites

Depending on the selection of the access mode (direct or accessdaemon) the prerequisites are different.

Always required prerequisites

The MSR device files must be present. This can be checked with ls /dev/cpu/*/msr and should list one msr device file per available CPU. If you don't have the files, try to load the msr kernel module sudo modprobe msr and check the MSR device files again. In order to load the module at startup, you can add a line with msr to /etc/modules (the filename might be different for your distribution).

Prerequisites for direct access mode

The direct access mode has less overhead compared to the access daemon way but it requires higher privileges for the users. Set ACCESSMODE=direct in config.mk to use this feature.

  • Make sure your user has enough rights to read and write the MSR device files. You can grant read and write access to the MSR device files like this: sudo chmod +rw /dev/cpu/*/msr
  • The MSR device files are strongly protected to avoid security vulnerabilites. To overcome these protections do either one of the following:
    • You can set the capabilities of the LIKWID's Lua interpreter: sudo setcap cap_sys_rawio+ep <PREFIX>/bin/likwid-lua where <PREFIX> is the installation path. Since the capabilities system is kind of strange and operating system dependent, this might not be enough for your system. This provides access only to core-local counters, no Uncore support.
    • You can set the Lua interpreter setuid root. This is not recommended since this allows anybody who uses LIKWID's Lua interpreter to execute code with root privileges.

Prerequisites for access daemon mode

In order to provide common users access to the hardware performance registers, you can use the access daemon. It is written with security in mind. It restricts accesses to hardware performance related registers, so users cannot read or write system related registers. When you select ACCESSMODE=accessdaemon in config.mk you only need install LIKWID with sudo make install. This sets the proper rights for the access daemon. Do not change the CHOWN variables in config.mk unless you want to use different permissions (group that is allowed to access the MSR device files, different name of the root user, etc.)

Update for Linux kernel 5.9 and newer: With Linux 5.9, the msr kernel module got some security fixes. The major change for LIKWID is, that now all MSR are non-writable by default. In order to change that, you have to change the boot options of your operating system to contain msr.allow_writes=on to enable writes again. This affects only ACCESSMODE=direct and ACCESSMODE=accessdaemon. If you use the perf_event backend, you don't have to change anything.

Update for Linux kernel 5.10 and newer: We got reports, that with Linux 5.10 the PCI accesses are also restricted by security mechanisms. In order to fix this, the access daemon requires an additinal capabilities flag: sudo setcap cap_sys_admin,cap_sys_rawio=ep EXECUTABLE

See also the file INSTALL for further details. In security sensitive areas, as on multi user systems or HPC clusters the uncontrolled access to all MSR registers is a security problem. For solutions to this issue have a look at Build and likwid-accessD.

Options

-h, --help		 Help message
-v, --version		 Version information
-V, --verbose <level>	 Verbose output, 0 (only errors), 1 (info), 2 (details), 3 (developer)
-c <list>		 Processor ids to measure (required), e.g. 1,2-4,8
-C <list>		 Processor ids to pin threads and measure, e.g. 1,2-4,8
			 For information about the <list> syntax, see likwid-pin
-G <list>		 GPU ids to measure
-g, --group <string>	 Performance group or custom event set string for CPUs
-W <string>		 Performance group or custom event set string for Nvidia GPUs
-H			 Get group help (together with -g switch)
-s, --skip <hex>	 Bitmask with threads to skip
-M <0|1>		 Set how MSR registers are accessed, 0=direct, 1=accessDaemon
-a			 List available performance groups
-e			 List available events and counter registers
-E <string>              List available events and corresponding counters that match <string> (case-insensitive)
-i, --info		 Print CPU info
-T <time>		 Switch eventsets with given frequency
Modes:
-S <time>		 Stethoscope mode with duration in s, ms or us, e.g 20ms
-t <time>		 Timeline mode with frequency in s, ms or us, e.g. 300ms
-m, --marker		 Use Marker API inside code
Output options:
-o, --output <file>	 Store output to file. (Optional: Apply text filter according to filename suffix)
-O			 Output easily parseable CSV instead of fancy tables

Basic Usage (Wrapper mode)

Output help text with

$ likwid-perfctr -h

There are two required flags: -c to configure for which cores the counters should be measured and -g to specify which group or event set you want to measure. The core id list is a comma separated list which can also contain ranges, e.g 1,2,4-7. This list can be specified in all variants supported by likwid-pin, from physical processor ids to different logical variants. To figure out the thread and cache topology you can use likwid-topology. As likwid-perfctr measures processors and has no knowledge about your process or threads, you have to ensure that your code you want to measure really runs on the processors you sense with likwid-perfctr. likwid-perfctr includes all functionality of likwid-pin for pinning a threaded application. Alternatively you can also care yourself for the pinning with another tool or from within the code.

For gathering information about hardware performance capabilities and performance groups use the -a, -g and -H switches.

Print all supported groups on a processor to stdout:

$ likwid-perfctr -a

To get a list with all supported counter registers and events, call:

$ likwid-perfctr -e | less

To get a list with all supported events and corresponding counter registers that match a string (case insensitive), call:

$ likwid-perfctr -E <string>

A help text explaining a specific event group can be requested with -H together with the -g switch:

$ likwid-perfctr -H -g MEM

This prints the text below LONG in the performance group file. For custom performance groups, it is recommended to add a describing text and the formulas of the derived metrics.

To use likwid-perfctr for a serial application execute:

$ likwid-perfctr  -C S0:1  -g BRANCH  ./a.out

This will pin the application to the second core (index 1) on socket zero (S0) and measure the performance group BRANCH on this core. A explanation for the CPU string notation can be found on the page likwid-pin. The output for the serial application looks like this:

--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
CPU type:	Intel Core Haswell processor
CPU clock:	3.39 GHz
--------------------------------------------------------------------------------
YOUR PROGRAM OUTPUT
--------------------------------------------------------------------------------
Group 1: BRANCH
+------------------------------+---------+---------+
|             Event            | Counter |  Core 1 |
+------------------------------+---------+---------+
|       INSTR_RETIRED_ANY      |  FIXC0  |  201137 |
|     CPU_CLK_UNHALTED_CORE    |  FIXC1  |  375590 |
|     CPU_CLK_UNHALTED_REF     |  FIXC2  | 1595994 |
| BR_INST_RETIRED_ALL_BRANCHES |   PMC0  |  44079  |
| BR_MISP_RETIRED_ALL_BRANCHES |   PMC1  |   3982  |
+------------------------------+---------+---------+

+----------------------------+--------------+
|           Metric           |    Core 1    |
+----------------------------+--------------+
|     Runtime (RDTSC) [s]    | 3.522605e-03 |
|    Runtime unhalted [s]    | 1.107221e-04 |
|         Clock [MHz]        | 7.982933e+02 |
|             CPI            | 1.867334e+00 |
|         Branch rate        | 2.191491e-01 |
|  Branch misprediction rate | 1.979745e-02 |
| Branch misprediction ratio | 9.033780e-02 |
|   Instructions per branch  | 4.563103e+00 |
+----------------------------+--------------+`

The output will always consist of a table with the raw event counts and another table with derived metrics. The columns are the processor ids measured. If you measure more than one core, there is another table with statistical data like sum, minimum, maximum and average of all measured cores.

In general, the events have the same naming as in the official processor manuals (substituted "." by "_"). The relevant manuals are the Intel Software Development Manual 3B Appendix A and for AMD the BIOS and Kernel Developers Guides (BKDG) of the appropriate processor. You can also have a look in the optimization manuals provided by the vendors for interesting event sets or at Intel's Performance monitoring database (https://github.com/intel/perfmon, https://github.com/TomTheBear/perfmondb). There are the OFFCORE_RESPONSE events on Intel systems that don't follow the Intel notation. You have to specify the bits for the filter registers yourself using the OFFCORE_RESPONSE_0/1_OPTIONS event with the event options match0 (lower register part) and match1 (higher register part). LIKWID also introduces some events that cannot be found in the official documentation. They are commonly known events with pre-configured event options.

LIKWID counts all events in user-space by default. Kernel-space counting is deactivated but for some architectures, it can be enabled by adding the KERNEL counter option. See description of counter options for more details. It is not possible to count only kernel-space.

Basic threaded usage

For threaded use nothing changes apart from the -C command line argument. The application must be compiled with threading support. You do not need to set OMP_NUM_THREADS or CILK_WORKERS, this is done by likwid-perfctr according to the given CPU list. When the environment variables are already set, likwid-perfctr does not overwrite them.

$ likwid-perfctr -C 0-3 -g BRANCH ./a.out
--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
CPU type:	Intel Core Haswell processor
CPU clock:	3.39 GHz
--------------------------------------------------------------------------------
YOUR PROGRAM OUTPUT
--------------------------------------------------------------------------------
Group 1: BRANCH
+------------------------------+---------+----------+---------+----------+---------+
|             Event            | Counter |  Core 0  |  Core 1 |  Core 2  |  Core 3 |
+------------------------------+---------+----------+---------+----------+---------+
|       INSTR_RETIRED_ANY      |  FIXC0  | 15585960 | 5526616 |  7679943 | 4045942 |
|     CPU_CLK_UNHALTED_CORE    |  FIXC1  | 15025112 | 4660629 |  7745757 | 3406840 |
|     CPU_CLK_UNHALTED_REF     |  FIXC2  | 44696128 | 9473964 | 22825288 | 3762474 |
| BR_INST_RETIRED_ALL_BRANCHES |   PMC0  |  1470984 |  752872 |  1163894 |  345736 |
| BR_MISP_RETIRED_ALL_BRANCHES |   PMC1  |   9457   |   8238  |   25573  |   1025  |
+------------------------------+---------+----------+---------+----------+---------+

+-----------------------------------+---------+----------+---------+----------+------------+
|               Event               | Counter |    Sum   |   Min   |    Max   |     Avg    |
+-----------------------------------+---------+----------+---------+----------+------------+
|       INSTR_RETIRED_ANY STAT      |  FIXC0  | 32838461 | 4045942 | 15585960 | 8209615.25 |
|     CPU_CLK_UNHALTED_CORE STAT    |  FIXC1  | 30838338 | 3406840 | 15025112 |  7709584.5 |
|     CPU_CLK_UNHALTED_REF STAT     |  FIXC2  | 80757854 | 3762474 | 44696128 | 20189463.5 |
| BR_INST_RETIRED_ALL_BRANCHES STAT |   PMC0  |  3733486 |  345736 |  1470984 |  933371.5  |
| BR_MISP_RETIRED_ALL_BRANCHES STAT |   PMC1  |   44293  |   1025  |   25573  |  11073.25  |
+-----------------------------------+---------+----------+---------+----------+------------+

+----------------------------+--------------+--------------+--------------+--------------+
|           Metric           |    Core 0    |    Core 1    |    Core 2    |    Core 3    |
+----------------------------+--------------+--------------+--------------+--------------+
|     Runtime (RDTSC) [s]    | 6.292864e-02 | 6.292864e-02 | 6.292864e-02 | 6.292864e-02 |
|    Runtime unhalted [s]    | 4.429985e-03 | 1.374134e-03 | 2.283749e-03 | 1.004468e-03 |
|         Clock [MHz]        | 1.140153e+03 | 1.668508e+03 | 1.150968e+03 | 3.071098e+03 |
|             CPI            | 9.640158e-01 | 8.433061e-01 | 1.008570e+00 | 8.420388e-01 |
|         Branch rate        | 9.437879e-02 | 1.362266e-01 | 1.515498e-01 | 8.545253e-02 |
|  Branch misprediction rate | 6.067640e-04 | 1.490605e-03 | 3.329842e-03 | 2.533403e-04 |
| Branch misprediction ratio | 6.429030e-03 | 1.094210e-02 | 2.197193e-02 | 2.964690e-03 |
|   Instructions per branch  | 1.059560e+01 | 7.340711e+00 | 6.598490e+00 | 1.170240e+01 |
+----------------------------+--------------+--------------+--------------+--------------+

+---------------------------------+--------------+--------------+-------------+----------------+
|              Metric             |      Sum     |      Min     |     Max     |       Avg      |
+---------------------------------+--------------+--------------+-------------+----------------+
|     Runtime (RDTSC) [s] STAT    |  0.25171456  |  0.06292864  |  0.06292864 |   0.06292864   |
|    Runtime unhalted [s] STAT    |  0.009092336 |  0.001004468 | 0.004429985 |   0.002273084  |
|         Clock [MHz] STAT        |   7030.727   |   1140.153   |   3071.098  |   1757.68175   |
|             CPI STAT            |   3.6579307  |   0.8420388  |   1.00857   |   0.914482675  |
|         Branch rate STAT        |  0.46760772  |  0.08545253  |  0.1515498  |   0.11690193   |
|  Branch misprediction rate STAT | 0.0056805513 | 0.0002533403 | 0.003329842 | 0.001420137825 |
| Branch misprediction ratio STAT |  0.04230775  |  0.00296469  |  0.02197193 |  0.0105769375  |
|   Instructions per branch STAT  |   36.237201  |    6.59849   |   11.7024   |   9.05930025   |
+---------------------------------+--------------+--------------+-------------+----------------+

Please note that in previous versions of LIKWID you had to specify the threading implementation used. This is not necessary anymore. LIKWID uses a pinning library that overloads the call of pthread_create, the thread creation procedure used by many threading solutions (of course PThreads but also OpenMP, Cilk+, C++11 threads).

On newer processors there is one issue related to Uncore events. The Uncore counters measure per socket. Therefore likwid-perfctr has a socket lock which ensures that only one thread per socket starts the counters and only one thread per socket stops them. The first CPU initialized per socket gets and keeps the lock for the whole execution time. Be aware that in the statistics tables, the processors that haven't measured the Uncore event are included, so only the values MAX and SUM are usable.

Using custom event sets

likwid-perfctr allows you to specify custom event sets. You can measure as many events in one event set as there are physical counters on an architecture. You specify the event set as a comma separated list of event/counter pairs. This is highly architecture dependent! On Intel architectures, the fixed purpose events are added automatically to the event set if not already present. These fixed events are retired instructions (INSTR_RETIRED_ANY:FIXC0), clock cycles with the current frequency (CPU_CLK_UNHALTED_CORE:FIXC1) and clock cycles of the reference clock (CPU_CLK_UNHALTED_REF:FIXC2) while the CPU is running in unhalted state.

This could look like:

$ likwid-perfctr -C 0-3 -g FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE:PMC0,FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE:PMC1 ./a.out

--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
CPU type:	Intel Xeon IvyBridge EN/EP/EX processor
CPU clock:	3.00 GHz
--------------------------------------------------------------------------------
YOUR PROGRAM OUTPUT
--------------------------------------------------------------------------------
Group 1:
+--------------------------------------+---------+--------------+--------------+--------------+--------------+
|                 Event                | Counter |    Core 0    |    Core 1    |    Core 2    |    Core 3    |
+--------------------------------------+---------+--------------+--------------+--------------+--------------+
|          Runtime (RDTSC) [s]         |   TSC   | 2.058991e+01 | 2.058991e+01 | 2.058991e+01 | 2.058991e+01 |
|           INSTR_RETIRED_ANY          |  FIXC0  |  99177052283 |  70451946660 |  42327093707 |  14201949658 |
|         CPU_CLK_UNHALTED_CORE        |  FIXC1  |  52549591060 |  37828491423 |  23914813640 |  9075382636  |
|         CPU_CLK_UNHALTED_REF         |  FIXC2  |  45643583430 |  33342410670 |  21380293590 |  8250340590  |
| FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE |   PMC0  |  23240920556 |  16616581676 |  9974859861  |  3342307010  |
| FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE |   PMC1  |    9889323   |   10181092   |   10184277   |   10276947   |
+--------------------------------------+---------+--------------+--------------+--------------+--------------+

+-------------------------------------------+---------+--------------+-------------+-------------+----------------+
|                   Event                   | Counter |      Sum     |     Min     |     Max     |       Avg      |
+-------------------------------------------+---------+--------------+-------------+-------------+----------------+
|          Runtime (RDTSC) [s] STAT         |   TSC   |   82.35964   |   20.58991  |   20.58991  |    20.58991    |
|           INSTR_RETIRED_ANY STAT          |  FIXC0  | 226158042308 | 14201949658 | 99177052283 |   56539510577  |
|         CPU_CLK_UNHALTED_CORE STAT        |  FIXC1  | 123368278759 |  9075382636 | 52549591060 | 30842069689.75 |
|         CPU_CLK_UNHALTED_REF STAT         |  FIXC2  | 108616628280 |  8250340590 | 45643583430 |   27154157070  |
| FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE STAT |   PMC0  |  53174669103 |  3342307010 | 23240920556 | 13293667275.75 |
| FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE STAT |   PMC1  |   40531639   |   9889323   |   10276947  |   10132909.75  |
+-------------------------------------------+---------+--------------+-------------+-------------+----------------+

The custom event set is FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE:PMC0,FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE:PMC1. It defines two events programmed at the counters PMC0 and PMC1. The automatically added events for the counters FIXC0, FIXC0 abd FIXC2 for Intel platforms. The event Runtime (RDTSC) [s] and its counter TSC is a virtual event. It is not possible to define this event or counter in a custom event set. Normally the runtime of an application is printed in the metrics tables defined by performance groups. Custom event sets have no derived metrics. To give a complete overview also for custom event sets, the runtime is printed as virtual event.

Performance groups

For common tasks there exist pre-configured event sets. These groups provide useful event sets and compute common derived metrics. We try to provide a basic set of groups on all architectures. Due to the differing capabilities some groups may be processor specific. You can print available groups on an architecture with likwid-perfctr -a. For processor specific information about what events are chosen for the groups use the -H -g group switch. This gives you detailed documentation from which events the derived metrics are computed.

Using multiple event sets or performance groups in a single run

Starting with LIKWID version 4.0.0 it is possible to measure multiple performance groups or custom event sets. Each event set can use all available performance counters. The event sets are switched periodically in a round-robin fashion. The default switch period is 2 seconds but you can alter this time on command line using the -T <time> switch. Please be aware that not all architectures provide the L2 and L3 performance group, we can only provide performance groups, if the architecture offers the required events.

$ likwid-perfctr -C S0:0-3 -g L2 -g L3 -T 500ms ./a.out

This call measures the application pinned on the first 4 cores of socket 0. It starts with measuring the L2 performance group and after 500 milliseconds, it switches to the L3 group. Again, after 500ms, the L2 group is programmed and measured again. The output prints the results of both performance groups but only for the time that they were measured, hence runtime_L2 + runtime_L3 = runtime_a.out. The derived metrics use the group runtime, not the overall runtime. There is no extrapolation or anything similar done. If you want to know e.g. the L2 data volume for the whole run (under the assumption that it grows linear), you have to scale it yourself. Multiple groups are not supported by the stethoscope mode!

Using the Marker API

The Marker API allows you to measure named regions of your code. Overlap or nesting of the regions is allowed. You can also enter a region multiple times, e.g. in a loop. The counters for each region are accumulated. In the threaded case, you can have serial and threaded regions.

The Marker API only reads out the counters. The configuration of the counters is still handled via the wrapper application likwid-perfctr. In order to use the LIKWID Marker API, you must include the file likwid-markers.h and link your code against the LIKWID library. Partly you need Pthreads enabled during linking, commonly done by setting -pthread on the compilers command line. To allow you to quickly toggle the Marker API, the LIKWID header contains a set of macros which allow you to activate the Marker API by defining LIKWID_PERFMON during build of your software. You have to include the LIKWID header to your source code to ensure your code also compiles if LIKWID is not available.

For gcc or icc this look e.g. as:

$ gcc -O3 -fopenmp -pthread -o test dofp.c -DLIKWID_PERFMON -I<PATH_TO_LIKWID>/include -L<PATH_TO_LIKWID>/lib -llikwid -lm

Below is an example showing the usage of the Marker API for a serial code:

// This block enables to compile the code with and without the likwid header in place
#ifdef LIKWID_PERFMON
#include <likwid-marker.h>
#else
#define LIKWID_MARKER_INIT
#define LIKWID_MARKER_THREADINIT
#define LIKWID_MARKER_SWITCH
#define LIKWID_MARKER_REGISTER(regionTag)
#define LIKWID_MARKER_START(regionTag)
#define LIKWID_MARKER_STOP(regionTag)
#define LIKWID_MARKER_CLOSE
#define LIKWID_MARKER_GET(regionTag, nevents, events, time, count)
#endif

LIKWID_MARKER_INIT;
LIKWID_MARKER_THREADINIT;

LIKWID_MARKER_START("Compute");
// Your code to measure
LIKWID_MARKER_STOP("Compute");
LIKWID_MARKER_CLOSE;

For a threaded code it is important to call the following sequence of function calls from the serial part of the program:

LIKWID_MARKER_INIT;
[...]
LIKWID_MARKER_CLOSE;

If you use the Marker API together with likwid-accessD, it is highly recommended to call

LIKWID_MARKER_REGISTER(string);

for each code region and application thread you want to measure with the used identifier strings. This creates basic structures and establishes the connection to the access daemon. If you don't do it and your code runs only for a short time, the values of the first region in the code will be off/lower.

For convenience there is also a simple API to pin your code or process or get the processor id.

likwid_pinProcess(int processorId);
likwid_pinThread(int processorId);
likwid_getProcessorId();

LIKWID starting with release 4.0.0 introduces some more Marker API calls: Switch between multiple event sets (causes much overhead compared to the other API functions):

LIKWID_MARKER_SWITCH;

Moreover, if you want to reduce the overhead of LIKWID_MARKER_START you can register the region names in prior. This avoids creating the hash tables serially which can cause timing problems. It is optional but highly recommended!

LIKWID_MARKER_REGISTER("Compute")

If you want to process the aggregated measurement values inside of your application:

LIKWID_MARKER_GET("Compute", nevents, events, time, count)

where nevents is int* defining the length of the given array events (type double*) and contains the number of filled entries at return. time has type double* and count has type int*.

Note: No whitespace characters are allowed in the region tag!

If you want to reset the counts for a region: (available in 4.3.3 and later)

LIKWID_MARKER_RESET("Compute")

The call has to be performed by each thread to reset its own values.

In order to run an executable with instrumentation, you have to activate the Marker API for likwid-perfctr using the -m switch:

$ likwid-perfctr -C 0-3 -g BRANCH -m ./a.out

Since the CPU list and the event set is given to likwid-perfctr and not programmed into the executable, you have the full flexibility for measurements without further modifying the executable itself.

Notice: Each threads read the counters individually, so when one thread has less work, it will read the counters before the other thread(s). This can be crucial in the case where the little-work-thread is the thread that performs the reading of Uncore counters. The Uncore counters are socket-specific. Likwid uses commonly the first hardware thread in the affinity list of a socket. For example you have 2 sockets 0,1,2,3 and 4,5,6,7 and you run with 1,2,3,5,6,7 the cores 1 and 5 will measure the Uncore counters. Consequently, their execution of user code is delayed and it might be that the other threads perform already some of their iterations until the Uncore-threads start to execute the user code.

Example:You want the memory data volume of a loop executed by multiple threads. While the threads responsible to read the Uncore counters still reads the stopped counters, the other threads already load and store data from/to memory. This data volume is not counted and consequently the memory data volume will be lower as expected.

Use barriers or conditional waits to synchronize the threads if you want exact measurements.

We are currently thinking about to provide MarkerAPI calls that include barriers and/or some environment variable that activates barriers in all MarkerAPI calls.

Using the Marker API with Fortran 90

There is a native interface for using the LIKWID Marker API with Fortran 90 programs. You have to enable it in the config.mk file as it is not enabled by default. If you enable it, the Intel Fortran compiler flags are set. To change this to gfortran edit ./make/include_GCC.mk to set gfortran with according flags. You have to care that the fortran interface module likwid.mod is in your module include path and of course linked against the likwid library.

For the Intel fortran compiler this can look as follows:

$ ifort -I<PATH_TO_LIKWID>/include -O3 -o fortran chaos.F90 -L<PATH_TO_LIKWID>/lib -llikwid  -lpthread -lm -DLIKWID_PERFMON

There is a example how to use the Marker API in Fortran in the test directory (chaos.F90) and the examples directory (F-markerAPI.F90). Code example:

call likwid_markerInit()
call likwid_markerThreadInit()

call likwid_markerStartRegion("sub")
! Do stuff
call likwid_markerStopRegion("sub")

call likwid_markerClose()

All functions that are available in the C Marker API are also available for Fortran 90, including likwid_markerRegisterRegion, likwid_markerNextGroup and likwid_markerGetRegion.

Syntax of the intermediate Marker API file

When the instrumented code closes the Marker API it writes a file with the results to disc. By default, the /tmp directory is used and the common file name is likwid_<PID_OF_PERFCTR>.txt. The syntax of the file is:

<nrThreads> <nrRegions> <nrGroups>
<regionID_1>:<regionName_1>
[...]
<regionID_n>:<regionName_n>
<regionID_1> <groupID> <cpuID_1> <callCount> <regionTime> <nrEvents> <event1> <event2> ... <eventM_g>
<regionID_1> <groupID> <cpuID_2> <callCount> <regionTime> <nrEvents> <event1> <event2> ... <eventM_g>
[...]
<regionID_n> <groupID> <cpuID_L-1> <callCount> <regionTime> <nrEvents> <event1> <event2> ... <eventM_g>
<regionID_n> <groupID> <cpuID_L> <callCount> <regionTime> <nrEvents> <event1> <event2> ... <eventM_g>

where <regionName_1> is the actual user-provided string suffixed with -<groupId>.

For parsing this file the LIKWID library provides helper functions:

  • int perfmon_readMarkerFile(const char* filename)
  • int perfmon_getNumberOfRegions(): Get number of regions
  • int perfmon_getGroupOfRegion(int region): If the groups are switched in the application (see LIKWID_MARKER_SWITCH), you can get the group identifier of the group. This only works, if you did perfmon_init() and gid = perfmon_addEventSet(group) before. The group identfier is the same as gid.
  • char* perfmon_getTagOfRegion(int region): Get the name tag of the region
  • int perfmon_getEventsOfRegion(int region): Get number of events measured in a region
  • int perfmon_getMetricsOfRegion(int region): Get number of derived metrics measured in a region
  • int perfmon_getThreadsOfRegion(int region): Returns the number of threads that executed the region
  • int perfmon_getCpulistOfRegion(int region, int count, int* cpulist): Get the hardware thread IDs that executed the region.
  • double perfmon_getTimeOfRegion(int region, int thread): Get aggregated runtime of thread in region
  • int perfmon_getCountOfRegion(int region, int thread): Get the count of region executions by a thread
  • double perfmon_getResultOfRegionThread(int region, int event, int thread): Get the measurement for an event for a thread and region
  • double perfmon_getMetricOfRegionThread(int region, int metricId, int threadId): Get the measurment for a derived metric for a thread and region

Marker API in other programming languages

The LIKWID team currently has no plans to provide the Marker API for other programming languages. But since LIKWID is open-source, everybody is welcome to create a module for his/her favorite programming language. Here is a list of projects offering the API in other languages:

Defining custom performance groups

With recent versions of LIKWID it is easy to specify your own performance groups or change existing ones. All groups are specified in terms of text files in the directory $HOME/.likwid/groups/ARCH/, where ARCH is a short name for the processor microarchitecture. You can get the short name by running likwid-perfctr -i (since 5.11.2015). Adding a new group or changing an existing group is nothing more than editing a text file. The format is explained best on an example:

SHORT Double Precision MFlops/s

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
PMC0  FP_COMP_OPS_EXE_SSE_FP_PACKED
PMC1  FP_COMP_OPS_EXE_SSE_FP_SCALAR
PMC2  FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION
PMC3  FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION

METRICS
Runtime [s](s]) FIXC1*inverseClock
CPI  FIXC1/FIXC0
DP MFlops/s (DP assumed) 1.0E-06*(PMC0*2.0+PMC1)/time
Packed MUOPS/s   1.0E-06*PMC0/time
Scalar MUOPS/s 1.0E-06*PMC1/time
SP MUOPS/s 1.0E-06*PMC2/time
DP MUOPS/s 1.0E-06*PMC3/time

LONG
Double Precision MFlops/s

The order of the statements is important. The first tag marks a SHORT description of the group. Then follows a list of events which are measured. Of course only as many events as there are physical counters on an architecture can be measured but LIKWID will skip non-present counters at initialization. First comes the performance counter register and then separated by spaces the event. The supported events can be e.g. taken from the list printed with the -e flag. You can specify further options for the event by adding them to the register definition :OPT1=VAL1:OPT2=VAL2 . After the METRICS tag, a list of derived metrics, one metric per line, is specified. Every metric is made up of a formula (without spaces) and an short description (as it will appear in the table). The formula must follow C syntax. Preset variables that can be used in a metric are time for the runtime and inverseClock for the inverse clock of the current processor. Since the performance groups are interpreted by Lua, no rebuilding of LIKWID is needed anymore.

With the commit https://github.com/RRZE-HPC/likwid/commit/4849571d833315c96e2e14c8e16ea1a222587730 we added an additional folder that is checked for performance group files. Since every user should be able to create its own custom groups, LIKWID now checks also the folder $HOME/.likwid/groups/ARCH .

With the commits https://github.com/RRZE-HPC/likwid/commit/75bb5737f039c0bf739ebfe728f78768d5748eeb and https://github.com/RRZE-HPC/likwid/commit/88dedc39e02d3b7e171503807836a1d70f1b13ec we changed the internal calculator to a Lua-based version. Starting with these commits, you don't need to specify event options anymore in the metric formulas. All performance groups were changed to reflect this.

The timeline mode

likwid-perfctr allows to measure a time resolved profile. With

$ likwid-perfctr -c N:0-7 -g BRANCH -t 2s > out.txt

you can measure the branching behavior of the machines on CPU cores 0-7 with a measurement every 2 seconds. This means the counters run will be read out every 2 seconds. This is implemented in a lightweight fashion. The output is to stderr. The syntax of the timeline mode output lines with a custom event set is:

<groupID> <numberOfEvents> <numberOfThreads> <Timestamp> <Event1_Thread1> <Event1_Thread2> ... <EventN_ThreadN>

The output of the timeline mode is different for custom event sets and performance groups. While for custom event sets the EventX refers to the raw count of event X, for performance groups the EventX means MetricX, thus lists the derived metrics for the different threads. In general, when using a performance group one is more interested in the derived metrics as in the raw counts, so I changed the behavior. So for performance group:

<groupID> <numberOfEvents> <numberOfThreads> <Timestamp> <Metric1_Thread1> <Metric1_Thread2> ... <MetricN_ThreadN>

You can set multiple event sets on the commandline. After each measurement period, the event set is switched to the next one in a round-robin fashion. Please note, that when you want a read out every 2s with multiple event sets, the read out is performed for each group every 2s*<numberOfEventSets>.

If you want to cancel the measurement, you can send a SIGINT signal to the likwid-perfctr process.

Notice: Although LIKWID allows measurements in a microsecond granularity, there are some points to consider. Tests have shown that for measurements below 100 milliseconds, the periodically printed results are not valid results anymore (they are higher than expected) but the behavior of the results is still valid. E.g. if you try to resolve the burst memory transfers, you need results for small intervals. The memory bandwidth for each measurement may be higher than expected (could even be higher than the theoretical maximum of the machine) but the burst and non-burst traffic is clearly identifiable by highs and lows of the memory bandwidth results.

The stethoscope mode

likwid-perfctr allows you to listen for a specific time what is happening on a node. This is useful if you want to look what a long running application currently makes in terms of performance. We use it to profile MPI codes, where we probably do not have access to the code. Stethoscope mode is also suited to be used for monitoring like it is used by likwid-agent. Be careful not to rely too much on these measurements. Because you do not know what your code is actually doing it may happen that the result is volatile depending which time period you were measuring. Still it can give you a first idea what is going on with regard to basic performance properties.

Monitor branching behavior on the first eight processors for 10 seconds:

$ likwid-perfctr -c N:0-7 -g BRANCH  -S 10s

If you want to cancel the measurement before the specified duration, you can send a SIGINT signal to the likwid-perfctr process.

Output filters

You can use the option -o to specify output to a file. As described in the next section this file can also include placeholders for things as PID (process ID) or MPI rank. You must also give a file extension and here comes the output subsystem of LIKWID into play. LIKWID supports the common text output and the CSV format natively. If you specify .txt as suffix the raw text output is written to this file. If you specify another suffix likwid-perfctr will write the CSV output to a temporary file and call a script with the name of the suffix to convert the output to an arbitrary format or apply filtering of results. In the LIKWID tree all filters are located in the filters directory. At the moment there are filters for XML and JSON output. You can add further output filters by just adding new scripts to this directory. My scripts uses Perl but you can of course use any scripting language. If no filter script can be found, the temporary file is renamed to the desired file name and consequently contains CSV output.

This allows you to tailor the output of likwid-perfctr, so that it fits well into your tool chain. There is also a switch (-O) for directly generating CSV output without calling a script. This is useful if likwid-perctr is used as monitoring backend.

Using likwid-perfctr for MPI programs

Notice: likwid-mpirun offers an easier interface to measure hardware performance counters for MPI and hybrid applications.

To use likwid-perfctr for a MPI program, the most important issue is to care for the pinning of the processes to cores and to instruct likwid-perfctr which core belong to which process. The current solution to this problem is to pin the MPI process with a taskset and call likwid-perfctr with -c N:0 notation to enable logical pinning inside this cpuset. In order to distinguish the output for multiple processes the -o option allows to output all results to a file.

The following placeholders can be used in the output file name:

  • %j - Environment variable PBS_JOBID
  • %r - MPI Rank for Intel MPI (env var PMI_RANK) and OpenMPI (env var OMPI_COMM_WORLD_RANK)
  • %h - Hostname
  • %p - Process ID (PID)

The output filename is specified as:

$ mpiexec -np <numberofProcesses> likwid-perfctr -c L:N:0 -g BRANCH -o test_%h_%p.txt  ./a.out

A more detailed explanation of the combination LIKWID and MPI can be found at this page.

Using the perf_event backend with likwid-perfctr

Before building LIKWID set USE_PERF_EVENT in config.mk to true. This disables the native backend of LIKWID and integrates perf_event. While initializing the file /proc/sys/kernel/perf_event_paranoid is checked and some functions restricted to do only stuff allowed by the paranoid value:

  • -1 - Allows measurements of the whole CPU, not only a specific PID. Allows reading of Uncore counters. The additional functionality of reading raw trancepoints is not supported by LIKWID, so no change to 0.
  • 0 - Allows measurements of the whole CPU, not only a specific PID. Allows reading of Uncore counters.
  • 1 - Allows measurements of the PID identifying the started application (user- and kernel-space possible). No Uncore!
  • 2 - Allows measurements of the PID identifying the started application (only user-space possible). No Uncore!

For the paranoid levels 1 and 2 it is required to set the --execpid option on the command line to ensure only the started application is measured. For 0 and -1 the --execpid option can be left out to measure anything that happens on the selected CPU (like the default backend of LIKWID does).

Specific flags for perf_event can be set with the --perfflags command line option.

Options are currently not supported because a translation table for LIKWID option names to perf_event option names is required.

If you want to measure some different application, you can use the command line option --perfpid.

For more information, see the "LIKWID and perf_event" page.

Using the Nvidia GPU backend

With LIKWID 5.0 the Nvidia GPU backend is introduced. It allows measurements of single kernels using the GPU MarkerAPI (and later for all kernels in an application). In order to work, the library paths of CUDA and CUPTI must be in $LD_LIBRARY_PATH. In order to check whether it works, you can run likwid-topology whether it lists your GPU.

In the meantime, support for CUPTI as well as PerfWorks got implemented. You should be able to use LIKWID with any Nvidia GPU now.

For measurements, you have to specify which GPUs should be measured with -G <gpulist> like -G 0,2 or -G 0-1. Moreover, you need to specify an event set with -W <eventset> where the events can be comma-separated. In order to see the available events, call likwid-perfctr -e and pipe the output to a pager like less. In contrast to CPUs, the number of available counters for a GPU and its subdomains is unclear. You still need to specify a counter for LIKWID like GPU0, GPU1, etc. as LIKWID used them in the metric formulas of the performance groups. So an example evenset could be WRAPS_ACTIVE:GPU0,INST_ISSUED:GPU1. For most handy use, the NVIDIA GPU backend also provides performance groups which are listed at likwid-perfctr -a below the ones for CPUs. They can be used as an event set and define a list of events and derived metrics.

As noted above, the NVIDIA GPU backend currently supports only measurements of kernels thus marked regions with the NvMarkerAPI. The NvMarkerAPI is very similar to the CPU instrumentation API MarkerAPI. Here are the calls for C:

  • LIKWID_NVMARKER_INIT: Initialize the library and set up event set for counting
  • LIKWID_NVMARKER_REGISTER(str): Register a code region. Reduces the overhead but optional
  • LIKWID_NVMARKER_START(str): Start a code region. The NvMarkerAPI measures only GPU activity, so CPU code in the region is not covered as long as no interaction with the GPU happens (data copy from memory to/from GPU, ...)
  • LIKWID_NVMARKER_STOP(str): Stop a code region. Counter results are summed up for each traversal of the code region
  • LIKWID_NVMARKER_CLOSE: Close the library and write out results
  • LIKWID_NVMARKER_SWITCH: Switch to next event set in a round-robin fashion (experimental)
  • LIKWID_NVMARKER_GET(regionTag, ngpus, nevents, events, time, count): Get the current results of a code region (not implemented yet)

Compile the code with -DLIKWID_NVMON and proper include -I and library path -L settings.

Example code

Nvidia GPU Permissions

The Nvidia libraries provide an option to allow profiling as a user. If you try out likwid-perfctr and it tells error 35 (CUPTI_ERROR_INSUFFICIENT_PRIVILEGES) in the output, see this page and follow the instructions here.

There is a temporary method using sudo but it has it's difficulties and therefore is not recommended. The LIKWID code loads the Nvidia libraries at runtime (dlopen) and thus requires common environment variables like LD_LIBRARY_PATH to be available. But some of these environment variables are rightfully deleted when using sudo for security reasons. In order to override this behavior, you have to forward all required variables into the sudo execution like this:

sudo PATH="$PATH" HOME="$HOME" LD_LIBRARY_PATH="$LD_LIBRARY_PATH" CUDA_HOME="$CUDA_HOME" likwid-perfctr ...

(This is not an exhausive list of environment variables required, just an example)

Using the GENERIC_EVENT

With LIKWID 5.0, there exists a new event called GENERIC_EVENT which exists for all supported counter types.

Sometimes, an event definition is not an event name but the setting for the configuration registers in hexadecimal format:

  • all fields are mentioned separately like config=0x05,umask=0x78. An example for this are the perf_event event files /sys/devices/<unit>/events/*
  • one value for the whole register like 0x437805. The important part for x86 systems is 0x7805 which can be translated to config=0x05,umask=0x78. The 0x430000 refers to the enable bit (0x400000), the "count in user-space bit" (0x10000) and "count in kernel-space bit" (0x200000). These bits are managed by the LIKWID library and are not required when using the GENERIC_EVENT. Be aware that LIKWID by default activates only the "count in user-space bit". If you want to count kernel-space events as well, use the KERNEL option. In order to decode which parts of these whole register specifications are required, refer to the vendor documentation or (partly) see /sys/devices/<unit>/format/*.

If you know the settings, the GENERIC_EVENT can be used:

GENERIC_EVENT:<COUNTER>:CONFIG=0x05:UMASK=0x78

The <COUNTER> tells LIKWID which unit and counter to use and how to write the settings to the configuration register. If the <COUNTER> provides additional options like THRESHOLD or EDGEDETECT, you can use them as well.

Using custom event or counter-related options

Some events or counters provide additional configuration options which can be used to filter or extend the measurement. Some events require specific options to be set by default which is handled by LIKWID. The options are listed in the event and counter list likwid-perfctr -e and if you search for a substring in an event likwid-perfctr -E <str>:

The options for the counters. Those can be used with every event that is programmable on the counter but setting an option might not make sense for the event. The following counters and events are for Intel Skylake Desktop

FIXC0, Fixed counters, KERNEL|ANYTHREAD
FIXC1, Fixed counters, KERNEL|ANYTHREAD
FIXC2, Fixed counters, KERNEL|ANYTHREAD
PMC0, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
PMC1, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
PMC2, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION|IN_TRANSACTION_ABORTED
PMC3, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION

Some events require a default configuration for a set of the options. This is shown here by the set of UOPS_EXECUTED events:

UOPS_EXECUTED_THREAD, 0xB1, 0x1, PMC
UOPS_EXECUTED_USED_CYCLES, 0xB1, 0x1, PMC, THRESHOLD=0x1
UOPS_EXECUTED_STALL_CYCLES, 0xB1, 0x1, PMC, THRESHOLD=0x1|INVERT=0x1
UOPS_EXECUTED_TOTAL_CYCLES, 0xB1, 0x1, PMC, THRESHOLD=0xA|INVERT=0x1
UOPS_EXECUTED_CYCLES_GE_1_UOPS_EXEC, 0xB1, 0x1, PMC, THRESHOLD=0x1
UOPS_EXECUTED_CYCLES_GE_2_UOPS_EXEC, 0xB1, 0x1, PMC, THRESHOLD=0x2
UOPS_EXECUTED_CYCLES_GE_3_UOPS_EXEC, 0xB1, 0x1, PMC, THRESHOLD=0x3
UOPS_EXECUTED_CYCLES_GE_4_UOPS_EXEC, 0xB1, 0x1, PMC, THRESHOLD=0x4

The event UOPS_EXECUTED_THREAD "Counts the number of uops to be executed per-thread each cycle" (Intel documentation). In this case the THRESHOLD (Intel calls it "cmask") defines a >= condition and INVERT changes it to <. There are other events with default options which use the THRESHOLD with a different outcome. The USED_CYCLES event increments only if there is at least 0x1 uop to be executed. The USED_CYCLES event is consequently equal to the UOPS_EXECUTED_CYCLES_GE_1_UOPS_EXEC event (GE for Greater-or-Equal). So, the UOPS_EXECUTED_CYCLES_GE_*_UOPS_EXEC events count only cycles where are more or equal uops dispatched as specified in THRESHOLD. The STALL_CYCLES event "Counts cycles during which no uops were dispatched from the Reservation Station (RS) per thread" (Intel documentation), so the it increments in cycles if uops < 0x1, so only at cycles with 0 dispatched uop. The TOTAL_CYCLES event uses the same method but increments if uops < 10, which is all cycles because the hardware cannot dispatch more then "10" uops per cycle (it's less than 10).

You can define own counter options by adding them to the event on the command line:

likwid-perfctr -C 0 -g ARITH_DIVIDER_ACTIVE:PMC0:EDGEDETECT=0x1:THRESHOLD=0x1 ./app

The event ARITH_DIVIDER_ACTIVE:PMC0:EDGEDETECT=0x1:THRESHOLD=0x1 count at each activation of the divide unit, it "detects the edges" but only the "start edge" (THRESHOLD=0x1) since after the "end edge" the increment would be zero. If this event is not documented by Intel, I add the event+option combination often as ARITH_DIVIDER_COUNT for convinience.

In performance groups, you can add options like this:

[...]
PMC0:EDGEDETECT=0x1:THRESHOLD=0x1 ARITH_DIVIDER_ACTIVE
[...]
Divide unit activations per second PMC0:EDGEDETECT=0x1:THRESHOLD=0x1/time

In order to check what the options mean for a specific architecture and what is happened if you use them, check the "Architectures" section in the Wiki.

Clone this wiki locally