AMD / ROCm Integration

GPUFlight supports AMD GPUs via ROCm, including HIP kernel tracing, system telemetry, occupancy analysis, and ISA disassembly.

Prerequisites

ROCm 6.x or later
HIP runtime
ROCm SMI library
rocprofiler-sdk
CMake 3.28+

Build Setup

include(FetchContent)

FetchContent_Declare(
    gpufl
    GIT_REPOSITORY https://github.com/gpu-flight/gpufl-client.git
    GIT_TAG        main
)

set(GPUFL_ENABLE_AMD ON CACHE BOOL "" FORCE)
set(GPUFL_ENABLE_NVIDIA OFF CACHE BOOL "" FORCE)

FetchContent_MakeAvailable(gpufl)

hip_add_executable(my_app my_app.cpp)
target_link_libraries(my_app PRIVATE gpufl::gpufl hip::host)

HIP Example

#include <gpufl/gpufl.hpp>
#include <hip/hip_runtime.h>

__global__ void scaleKernel(int* data, int scale, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) data[idx] *= scale;
}

int main() {
    gpufl::InitOptions opts;
    opts.app_name = "hip_demo";
    opts.sampling_auto_start = true;
    opts.system_sample_rate_ms = 50;
    opts.enable_kernel_details = true;
    gpufl::init(opts);

    int* d_data;
    hipMalloc(&d_data, N * sizeof(int));

    GFL_SCOPE("scale_loop") {
        for (int i = 0; i < 50; ++i) {
            scaleKernel<<<N/256, 256>>>(d_data, 2, N);
        }
        hipDeviceSynchronize();
    }

    hipFree(d_data);
    gpufl::shutdown();
    gpufl::generateReport();
}

Extended AMD Metrics

GPUFlight collects additional metrics on AMD GPUs via ROCm SMI:

Metric	Description
Junction Temperature	GPU junction (hotspot) temperature
Memory Temperature	VRAM temperature
Fan Speed	Fan speed percentage
Voltage	GFX voltage in millivolts
Energy	Cumulative energy consumption
PCIe Bandwidth	Combined PCIe read+write throughput
ECC Errors	Correctable and uncorrectable error counts

These appear automatically in the system metrics section of the report when available.

Occupancy on AMD

GPUFlight computes theoretical occupancy for AMD kernels using:

Wavefront size (typically 32 for RDNA, 64 for CDNA)
Max wavefronts per CU from the GPU architecture
VGPR usage per kernel (from rocprofiler code object metadata)
LDS (shared memory) usage per workgroup

The limiting resource is identified as "waves", "registers", or "shared_mem".

note

AMD occupancy uses architecture VGPR count only (not combined SGPR+VGPR). SGPRs have a separate allocation pool and don't limit VGPR occupancy.

ISA Disassembly

AMD ISA disassembly is captured automatically when GPU code objects are loaded. GPUFlight:

Captures the ELF code object during the CODE_OBJECT_LOAD callback
Computes a CRC32 for deduplication
Disassembles using llvm-objdump (from the ROCm LLVM toolchain)
Emits per-function instruction listings with PC offsets

The disassembly appears in the web UI under the "ISA" column (vs "SASS" for NVIDIA).

Known Limitations

No PC sampling on RDNA consumer GPUs: PC sampling requires MI200+ (CDNA) hardware
No SASS-equivalent metrics: Instruction-level metric collection is not yet available via rocprofiler-sdk for RDNA
CPU iGPU filtering: Systems with AMD APUs (Ryzen with integrated graphics) are automatically filtered out of telemetry to avoid polluted metrics

Prerequisites​

Build Setup​

HIP Example​

Extended AMD Metrics​

Occupancy on AMD​

ISA Disassembly​

Known Limitations​