Real-Time ROCm GPU Monitoring with Web Dashboard

The Problem

When running AI workloads on AMD GPUs with ROCm, visibility into GPU performance is essential. While rocm-smi provides command-line access to GPU metrics, we needed something more:

Real-time monitoring without constant terminal polling
Integration with existing observability stacks
A clean web interface for quick status checks

Introducing rocm_info

We built strix-halo-rocm_info, a lightweight monitoring tool written in Go that exposes ROCm GPU metrics through multiple interfaces.

Architecture

┌─────────────────┐     ┌──────────────────┐
│   rocm-smi      │────▶│   rocm_info      │
│   (ROCm CLI)    │     │   (Go Service)   │
└─────────────────┘     └────────┬─────────┘
                                 │
                    ┌────────────┼────────────┐
                    ▼            ▼            ▼
              ┌─────────┐  ┌─────────┐  ┌─────────┐
              │   Web   │  │  REST   │  │Prometheus│
              │Dashboard│  │   API   │  │ Metrics  │
              └─────────┘  └─────────┘  └─────────┘

Features

Feature	Description
Web Dashboard	Real-time GPU status visualization
REST API	JSON endpoints for programmatic access
Prometheus Metrics	`/metrics` endpoint for Grafana integration
Lightweight	Single Go binary, minimal resource footprint
Strix Halo Support	Tested on Ryzen AI Max 395 (gfx1151)

Metrics Exposed

The tool captures key GPU metrics:

Metric	Description
Temperature	GPU core temperature
Utilization	GPU compute usage percentage
Memory	VRAM usage and availability
Power	Current power draw
Clock Speeds	GPU and memory frequencies

Quick Start

1
2
3
4
git clone https://github.com/smarttechlabs-projects/strix-halo-rocm_info.git
cd strix-halo-rocm_info
go build -o rocm_info
./rocm_info

Access the dashboard at http://localhost:8080 and Prometheus metrics at http://localhost:8080/metrics.

Prometheus Integration

Add to your prometheus.yml:

1
2
3
4
scrape_configs:
  - job_name: 'rocm_gpu'
    static_configs:
      - targets: ['localhost:8080']

Then create Grafana dashboards to visualize GPU performance alongside your other infrastructure metrics.

Use Cases

AI/ML Workloads: Monitor GPU utilization during training and inference
Capacity Planning: Track memory usage patterns over time
Alerting: Set up Prometheus alerts for temperature or utilization thresholds
Multi-GPU Systems: Monitor all GPUs from a single dashboard

Get the Code

Repository: smarttechlabs-projects/strix-halo-rocm_info

Contributions welcome—whether it’s additional metrics, UI improvements, or documentation.

Need custom monitoring solutions for your AI infrastructure? Let’s talk.

The Problem

Introducing rocm_info

Architecture

Features

Metrics Exposed

Quick Start

Prometheus Integration

Use Cases

Get the Code

Related Articles

Fixing ROCm Boot Issues: amdgpu Module Blacklisted

Running ComfyUI on AMD Ryzen AI Max 395 (Strix Halo)

Installing LM Studio with Vulkan Support on AMD Strix Halo

AI Solutions

Services

Company