Skip to main content

Real-Time ROCm GPU Monitoring with Web Dashboard

A lightweight Go-based tool for monitoring AMD GPUs with ROCm, featuring a web dashboard, REST API, and Prometheus metrics.

S

SmartTechLabs

SmartTechLabs - Intelligent Solutions for IoT, Edge Computing & AI

2 min read
Real-Time ROCm GPU Monitoring with Web Dashboard

The Problem

When running AI workloads on AMD GPUs with ROCm, visibility into GPU performance is essential. While rocm-smi provides command-line access to GPU metrics, we needed something more:

  • Real-time monitoring without constant terminal polling
  • Integration with existing observability stacks
  • A clean web interface for quick status checks

Introducing rocm_info

We built strix-halo-rocm_info, a lightweight monitoring tool written in Go that exposes ROCm GPU metrics through multiple interfaces.

Architecture

┌─────────────────┐     ┌──────────────────┐
│   rocm-smi      │────▶│   rocm_info      │
│   (ROCm CLI)    │     │   (Go Service)   │
└─────────────────┘     └────────┬─────────┘
                                 │
                    ┌────────────┼────────────┐
                    ▼            ▼            ▼
              ┌─────────┐  ┌─────────┐  ┌─────────┐
              │   Web   │  │  REST   │  │Prometheus│
              │Dashboard│  │   API   │  │ Metrics  │
              └─────────┘  └─────────┘  └─────────┘

Features

FeatureDescription
Web DashboardReal-time GPU status visualization
REST APIJSON endpoints for programmatic access
Prometheus Metrics/metrics endpoint for Grafana integration
LightweightSingle Go binary, minimal resource footprint
Strix Halo SupportTested on Ryzen AI Max 395 (gfx1151)

Metrics Exposed

The tool captures key GPU metrics:

MetricDescription
TemperatureGPU core temperature
UtilizationGPU compute usage percentage
MemoryVRAM usage and availability
PowerCurrent power draw
Clock SpeedsGPU and memory frequencies

Quick Start

1
2
3
4
git clone https://github.com/smarttechlabs-projects/strix-halo-rocm_info.git
cd strix-halo-rocm_info
go build -o rocm_info
./rocm_info

Access the dashboard at http://localhost:8080 and Prometheus metrics at http://localhost:8080/metrics.


Prometheus Integration

Add to your prometheus.yml:

1
2
3
4
scrape_configs:
  - job_name: 'rocm_gpu'
    static_configs:
      - targets: ['localhost:8080']

Then create Grafana dashboards to visualize GPU performance alongside your other infrastructure metrics.


Use Cases

  • AI/ML Workloads: Monitor GPU utilization during training and inference
  • Capacity Planning: Track memory usage patterns over time
  • Alerting: Set up Prometheus alerts for temperature or utilization thresholds
  • Multi-GPU Systems: Monitor all GPUs from a single dashboard

Get the Code

Repository: smarttechlabs-projects/strix-halo-rocm_info

Contributions welcome—whether it’s additional metrics, UI improvements, or documentation.


Need custom monitoring solutions for your AI infrastructure? Let’s talk.

Share this article
S

SmartTechLabs

Building Intelligent Solutions: IoT, Edge Computing, AI & LLM Integration