Skip to content

Native Image JFR Implementation (WIP)

Robert Toyonaga edited this page Mar 27, 2024 · 21 revisions

Overview

This document contains information on how and why various components are implemented. Users of this document should already be somewhat familiar with JFR and use this document to answer specific questions about how JFR is implemented in SubstrateVM. The content of this document will be updated and expanded as time goes by and changes are made to the code. Each section may or may not contain an overview which, unless otherwise stated, is information that applies to both Java and native mode implementations. All other sections, unless otherwise stated and apart from the occasional "OpenJDK" section, apply only to substrateVM. Each section may have subsections which are written with respect to the parent section. Subsections may not be unique across multiple parent sections. Parent sections are unique and ordered alphabetically. A section may specify the date the last edit to the section was made.


Chunks and file format

Edited March 20 2024.

Overview

Events and constant pool data is written to disk (and out of process memory) in discrete lumps called chunks. Chunks are independent. Two JFR files (which consist of chunks) can be concatenated to form a new JFR file. Most of the information in this section is true for both java and native mode.

Chunk File Format

Each chunk is fully self-contained and has 4 sections.The event data section is where the core event data goes, the constant pools are where constants are put, and the metadata section describes how the event data section is laid out. The constant pool and metadata sections are required to be able to interpret the event data section.

Chunk storage

Chunks are essentially just compact binary files containing recording data. Chunks are kept in a JFR disk repository while the program is running. On Linux, the repository is in the /tmp directory. Chunk filenames are kept in a linked list. Operations involving accessing chunks in the disk repository, determining new chunk filenames, or dumping snapshots can be found in the Java-level JDK code. This code is reused in Native Image.

Chunk rotation

Chunk rotation is managed by the JFR periodic thread. Chunks are rotated via the periodic task if it has been previously evaluated to be required. Evaluation basically means we check whether the amount written to a chunk is more than the max chunk size (12MB default). Chunks are written to disk via the JfrRepository system. Chunk rotation occurs during a safepoint. A safepoint is necessary for multiple reasons:

  1. Thread local JFR buffers must be iterated and flushed to disk
  2. The epoch must advance without the risk of other threads attempting to emit events. Such events would have their data corrupted or divided across epochs.

Metadata

Each chunk will have its own metadata event (JFR files consisting of multiple chunks will have multiple metadata events). Events have an "eventID". The metadata contains matching “eventID”s that describe how to read each type of event. The metadata includes event names, field layouts, etc.

JFR type IDs

JFR typeIDs cannot be reused, even across chunks. These IDs are the unique identifiers assigned to constants in a constant pool (type repository, symbol repository, etc).

Dumping snapshots (.jfr files)

Snapshot dumps take all the chunk files available in the disk repository and copy them to a single file the user can access. When a user requests a dump, firstly, a chunk rotation is done to write everything in-flight to disk, then the linked list of chunks is traversed and written to a snapshot.


Constant pool Repositories

Edited March 20 2024

Overview

SubstrateVM stores JFR constant pool data in various “repositories”. There is a method, thread, symbol, type, and stacktrace repository.

Synchronization

Write and read access is protected using a special kind of mutex called a VmMutex. A VmMutex is essentially a pre-allocated mutex that does not allow for safepoints when locking and unlocking. It is created during the image build process. It also does not allocate. This is important because much JFR code must be uninterruptible (which does not permit allocation).

Data Storage and Unique ID Management

The constant pool data is kept in an unmanaged memory buffer. This buffer is the same as the JFR event data buffers, only its usage differs. Similar to the JFR event data buffers, a pointer tracks the last written position and a counter is kept to track the number of unflushed entries. This counter also provides the JFR ID for the next constant added to the pool. Some constant types do not require a counter to be maintained because they already have an intrinsic ID (thread ID). When a JFR flush operation or chunk rotation occurs, the unflushed data of the buffer is written to the JFR disk repository and the buffer is reinitialized. When constant data is added to the repository it is written to the buffer in a similar fashion to how event data is written to the event data buffers.

Deduplication of Constants

Deduplication in SubstrateVM

The symbol, stacktrace, and method repositories also perform deduplication. This is usually accomplished by using an unmanaged memory hashtable. This table is protected by the same mutex used to protect the constant pool data buffer described above. Thread and type repositories get deduplication for free and so no table is kept. This is because threads already have an inherent unique identifier (thread ID). Types don’t need deduplication because they are managed separately using a map that is built at image build-time.

After a chunk rotation the table must be cleared and reset to prepare for the next time it is used. However, the tables must persist through flush operations because flushes are done with respect to the current epoch, not the previous one. If table data is cleared after a flush operation, then successive flushes/rotations risk reusing JFR IDs already assigned during the current epoch (eliminating uniqueness). This will corrupt the constant pool data.

Deduplication in OpenJDK

In hotspot, the deduplication is performed by setting JFR specific bits in the metaspace data for the various constant types: Klass/method/classloaders/package. The bit acts as a boolean flag indicating that some event data in a JFR buffer references this constant via a reference ID. Therefore, when it’s time to persist JFR data to disk, only the constants that are actually referenced get written. Even if multiple events reference the same constant, the bit needs to be only flipped once, and later the constant only needs to be written once.

Some constant types don’t have metaspace data so cannot take advantage of the bit flipping scheme. One example of this is stack traces. In such cases a look-up table is instead used to track and deduplicate usages of each constant.

SubstrateVM is different from Hotspot because it doesn’t have a metaspace. Therefore the metaspace data bit tagging scheme used by hotspot doesn’t translate to Native Image. Instead, we can only use the look-up table approach for our constant pools.

Thread Constant Pool Repository

SubstrateVM Implementation

Threads are registered when a recording begins, when a thread starts, and when virtual threads are mounted. Threads must be re-registered when a chunk rotation happens (inside a VM operation) so that thread data can be made available in the new self-contained chunk. Particular care is taken to account for virtual threads because existing virtual threads may or may not be mounted at the time of chunk rotation, so we must do more than simply re-register all mounted threads.

The thread constant pool is specially handled during chunk rotation. It is only written at chunk rotation (similar to the other pools) or at a flushpoint if it is dirty (contains new thread data). This is done for simplicity.

OpenJDK Implementation

The thread constant pool is written immediately after a new thread is started or any thread related data is changed. This is in the form of a thread checkpoint event. This event may only contain data pertaining to a single thread. This is different from SubstrateVM where thread constant pool data is bundled into a single large write operation at chunk rotations and flush points.


Event Streaming

Edited March 20 2024

Overview

Event streaming can be thought of as having a “producer” and “consumer” component. The producer part is the code that performs periodic flushing of the JFR buffers and constant pools to the JFR chunk repository on disk. It is this flushing of data to open/active chunks that allows event streaming to work. Before event streaming, only completed chunks were written/read from the JFR disk repository. This meant that a JFR snapshot file needed to be compiled in order to access the JFR recording data. This can be slow and has overhead. The consumer part of event streaming is the code that can be invoked from the application level to periodically parse the fresh chunk data in the disk repository and trigger callbacks. The consumer code runs in a separate thread.

In SubstrateVM, the producer component has been re-implemented. This is necessary because in OpenJDK, the producer code is tightly coupled to Hotspot. The consumer code is at the Java-level in the JDK, and is reused by Native Image.

A note on vocabulary: In the context of JFR data, “flush” can mean two different things. Flushing from local event buffer to global ring buffer or flushing from in-memory buffers (local or global) to disk repository. In this section “flush” means flushing/writing to disk.

image

Flushing (Producer Component)

There are two parts of the “producer” component that will be explained below: handling constant pools and handling JFR thread local buffers.

Flushing Constant pools

Similar to a chunk rotation, flushing constant pools involves writing buffered data to the JFR disk repository from each of the different constant pool repositories. For each constant pool, the entire unmanaged memory buffer containing constant data is dumped and the buffer is reset. The only difference compared to a chunk rotation is that the deduplication table is not reset. This is because after a flush point, the epoch remains the same so we cannot destroy the record of constants we have already written.

Additionally, we must lock each constant pool before we begin writing its data to prevent races with other threads. At chunk rotations we don't need to worry about concurrent modifications because chunk rotations happen in a VM operation.

Flushing JFR Thread Local Buffers

The event data in JFR TLBs must be written to disk during the periodic flush operation. In both HotSpot and SubstrateVM writing the TLBs happens outside of a safepoint. In order to write out all the TLBs, the flushing thread must access the TLBs of other threads. Accessing the thread locals of other threads outside safepoints is dangerous because those other threads may die at any time, resulting in a segfault when their thread locals are accessed. This is why special care must be taken to flush the event buffers.

The solution we chose was to add a global linked list of thread-specific event buffers (this is different from the global ring buffer where events could later be promoted to). This list solves the problem because each list node can be locked. This prevents races on the buffer. You might be thinking, “why can’t we just lock around the existing thread local buffer?”. Such an approach would result in a circular problem because the lock would now be a thread local which would be unsafe to access. We need to decouple the lifetime of the lock from the lifetime of the thread.

During a flushing operation, the flushing thread iterates through the list and handles each thread’s event buffer. The flushing thread locks each list node before checking its buffer. Before exiting, the owning thread cleans up its event buffer and the same lock on the node that holds the pointer to that buffer is acquired. During a chunk rotation (during a safepoint) writing out the buffers to disk happens the same way, by traversing this list. During a chunk rotation the buffers are written to disk directly, while during a flushing operation, the buffers are instead written to the global buffers first, then the global buffers are written to disk. This minimizes the amount of disk IO which could result in holding buffer lock for too long while the owning thread may be waiting for it.

Data Structures

There is a linked list where each list node is “owned” by a single thread and contains a pointer to its thread local event buffer. There is also a field used for locking. Nodes are added to the list at the head and list traversal during flushing happens from the head to the tail. This means that adding to the list and traversal can happen concurrently as long as the lock on the head node is acquired before additions are made.

Cleanup

When a thread exits, its buffer must be cleaned up. This cleanup was automatic when the buffers were stored as thread locals. The clean up now happens during a flush/chunk rotation while the linked list is being traversed. At each node of the linked list, the status of the thread is checked. If the thread is no longer alive, the node is unlinked and its memory is freed. When threads exit, they signal that their nodes are ready to be cleaned up by unsetting the node’s JFR buffer pointer. The buffer is also cleaned up at this time. This acts as a dirty flag signaling that the node should be unlinked on the next list traversal (whether that be during a flush or chunk rotation).

Concurrency Summary

These are the cases where local buffer nodes need to be locked:

  1. Flushing to disk
  2. Thread has died and is cleaning up its buffer
  3. Thread is attempting to promote data in its local buffer to globals

Flushing JFR Global Buffers

Global event buffers are simpler to write to disk since they don’t belong to specific threads. Similar to the local buffers, there is a linked list for global buffers which contains list nodes of the same format. During a flush or chunk rotation this list is iterated and unflushed data is written to disk. The lock must be acquired during a flush to avoid races with working threads that may be trying to promote data from their local buffers to the global buffers.

Concurrency Summary

These are the cases where a global buffer node needs to be locked:

  1. Flushing to disk during a flush point
  2. JFR background recorder thread has woken and is trying to write global buffers to disk
  3. Thread is attempting to promote data in its local buffer to a global buffer

Execution Sampler (method profiling)

Edited March 20 2024.

Overview

There are two execution samplers available in Native Image JFR, the recurring callback sampler and the async profiler. Prior to the 23.2 release the recurring callback sampler was default. After the 23.2 release, the async profiler is default (commit hash 8e58767f58860fdc73e42ecb23bb3c2e2a8be94e). The async profiler uses the SIGPROF signal. Both eventually report their stackraces through the jdk.ExecutionSample event.

Async Execution profiler

This CPU profiler is enabled by default. Upon program startup, a signal handler is registered for the SIGPROF signal and setitimer is configured to send the signal at regular intervals. While running, when the SIGPROF signal is caught, a thread will be interrupted and enter the signal handler. The thread's stacktrace will be walked and stored in a pre-allocated sample buffer taken off a buffer queue. The buffers are processed later in batches at chunk rotations. Native frames from C++ libraries are skipped.

Recurring Callback Samplers

These use safepoint checks to decrement a count that approximates a period of time. Currently this is an issue for the JFR recurring callback method sampler because it results in safepoint bias. This is not an issue for the SIGPROF based sampler.

JFR Execution Sampler in OpenJDK

The OpenJDK JFR CPU profiler does not suffer from safepoint bias. However, it does not capture native stack frames. This means that whenever time is spent in VM-level code, it is ignored. This results in a significant sampling bias. Although native frames are skipped, Native Image does not suffer as much from the same problem because the VM-level code is written in Java.


Java-level Event Writers

Valid as of jdk21+32 and SVM 23.0.

Overview

In both SVM and openJDK, each carrier thread has its own java-level EventWriter which is shared among all the virtual threads that get mounted on it. This means that when a virtual thread uses the java-level EventWriter, it must update the appropriate members with its own specific data. This is the same situation as in hotspot.

When the JFR Java level event buffers run out of space, they are flushed (similar to their native counterparts). However, unlike for native events, SVM reuses the code that controls when to flush from jdk.jfr.internal.EventWriter. Flushing, however, takes place in native code, so SVM has its own implementation that does roughly the same thing as in hotspot.

If flushing was successful, the current position pointer, max position pointer, committed start position pointer, at the Java level in EventWriter must all be reset. This is partly due to the possibility that a new buffer is now being used (the committed pos pointer would have to be reset though regardless). The validity flag must also be set based on whether enough space was made available due to flushing. These operations happen in both SVM and hotspot.

Substitutions

The java level event writer methods are partially substituted. These methods have gone out of date in the past and caused issues. SVM reuses the openJDK constructor method but substitutes all the java event writer code in hotspot.

OpenJDK

Creation of new event writers is done in C++ code in hotspot. Creation of new event writers is done lazily when the current thread’s event writer is gotten. If one does not exist or if the TIDs don’t match (in the case of threads) then a new one is created, or the current one is updated respectively.


Mirror / Java Level events

Edited March 22 2024.

Overview

Most of these events we get “for free” from openJDK. At image build time, we manually trigger class re-transformation by reflectively calling JVM#retransformClasses. This is done on all reachable Java level event types assignable from jdk.internal.event.Event.

Old object sampler

//TODO


Periodic Events

Valid as of JDK 21+34 (Aug 22 2023)

Overview

Periodic events are implemented at the Java level by defining two custom events EndChunkNativePeriodicEvents and EveryChunkNativePeriodicEvents. These events don’t actually ever get written to disk/buffers, and instead are actually used to group together multiple events (which do in fact get used). They are registered in a start-up hook using FlightRecorder#addPeriodicEvent.

Periodic events can be of 4 types specifiable via the JFR API annotation ie @Period(value = "endChunk"). The types are “everyChunk”, “beginChunk”, “endChunk”, “period”. Under the hood, there “everychunk” is really just the union of “beginChunk” and “endChunk”.

The java level openJDK infrastructure is reused to periodically emit the events during chunk rotations and recording state changes. See PeriodicEvents#doChunkBegin PeriodicEvents#doChunkEnd.


Thread locals

Valid as of GraalVM 23.0

Virtual Threads

Only carrier threads have fast thread locals. Virtual threads share thread locals, but careful handling must be done so that the data is fresh (ie. when dealing with the Java Event Writer).

JFR Buffers

There are JFR hooks that run before and after a thread’s lifetime which set up and tear down its thread locals. Each platform thread has its own Java event buffer and native event buffer. They must be separate because we are reusing the java level infrastructure (EventWriter) and concurrent writes at the Java and native level could destroy each other. Virtual threads can use the JFR buffers of their carrier threads without issue.

Buffer structure and concurrency

The JFR event buffers are structured such that reading of flushed data and writing of new data can happen safely concurrently. Committed data is flushed from the flushed positon up to the committed position. New data is only written at the write position (which is always after the committed position).


Startup

Edited March 27 2024

JFR initialization begins when the first Recording is created (not necessarily started) at runtime. This will trigger the lazy creation and initialization of the PlatformRecorder singleton class in OpenJDK java-level code. When PlatformRecorder is initialized, createJFR() is called (the method is named the same in OpenJDK and SubstrateVM). The call propagates to SubstrateVM internal JFR code via substitutions where it is handled. This is where the recorder thread is started, the chunkwriter initialized, and where the JFR in flight buffers are set up.

During native image build, a startup hook is registered to handle some JFR related bootstrapping tasks. Such tasks include registering periodic events via the JFR API, handling JFR command line arguments, and creating/starting a JFR recording.

Recordings can be created in ways other than via the command line at startup (ie via JMX or JFR API). However a recording created via the SubstrateVM startup hook will always be the first recording to be created and will result in createJFR() being called.

Once the first recording is started, meta-events are created (events related to JFR itself). Such events include jdk.ActiveSetting and jdk.ActiveRecording. This happens in OpenJDK java-level code writeMetaEvents().

It is possible that JFR misses recording some events because it is too slow to start up. For example, in some contrived cases, JFR starts recording at GC ID greater than 0 (so it misses some GC events). The primary goal of JFR is continuous monitoring and providing info in the event of a crash. So it's more intended for longer running applications (at least in OpenJDK) and not meant to capture data immediately at start-up. This is likely the reason that it's so lenient with start up time.


Throttler

Valid as of As of JDK 21+31

Overview

Each event that supports sampling is meant to have its own sampler instance. This instance, however, is shared between all threads and thus must be thread safe. For now only the jdk.ObjectAllocationSample event allows for sampling, but this can change in the future.

Implementation

Throttling is implemented in a very similar way as in hotspot. Some things have been simplified. This sampler is distinct from the method profiling sampler although some naming may be similar.

OpenJDK

The code resides in the JfrThrottler class as well as the AdaptiveSampler class. Written in c++ in Hotspot. Uses an adaptive sampling based approach with the concept of “sampling debt”. A rate is specified by the user and it tries to match the rate, it will never exceed it. A windowing scheme utilizes dual alternating windows. If less samples were taken in the previous window than desired to match the specified rate, debt is accrued and will try to be paid off during the current window (more samples will be taken).


TraceID

Implementation

The JFR TraceID map contains all the JFR traceIDs which are really just the class type IDs reused. The map size is determined during image build and is set to be the highest class type ID in DynamicHub. Each entry in the map is a series of flag bits describing whether the class used in the previous and current epoch. The “used current epoch” bit is set when an event emission puts a class into the event data.