iDatam

IN AFRICA

ALBANIA

ARGENTINA

AUSTRALIA

AUSTRIA

AZERBAIJAN

B AND H

BANGLADESH

BELGIUM

BRAZIL

BULGARIA

CANADA

CHILE

CHINA

COLOMBIA

COSTA RICA

CROATIA

CYPRUS

CZECH

DENMARK

ECUADOR

EGYPT

EL SALVADOR

ESTONIA

FINLAND

FOR BACKUP AND STORAGE

FOR DATABASE

FOR EMAIL

FOR MEDIA STREAMING

FRANCE

GEORGIA

GERMANY

GREECE

GUATEMALA

HUNGARY

ICELAND

IN ASIA

IN AUSTRALIA

IN EUROPE

IN NORTH AMERICA

IN SOUTH AMERICA

INDIA

INDONESIA

IRELAND

ISRAEL

ITALY

JAPAN

KAZAKHSTAN

KENYA

KOSOVO

LATVIA

LIBYA

LITHUANIA

LUXEMBOURG

MALAYSIA

MALTA

MEXICO

MOLDOVA

MONTENEGRO

MOROCCO

NETHERLANDS

NEW ZEALAND

NIGERIA

NORWAY

PAKISTAN

PANAMA

PARAGUAY

PERU

PHILIPPINES

POLAND

PORTUGAL

QATAR

ROMANIA

RUSSIA

SAUDI ARABIA

SERBIA

SINGAPORE

SLOVAKIA

SLOVENIA

SOUTH AFRICA

SOUTH KOREA

SPAIN

SWEDEN

SWITZERLAND

TAIWAN

THAILAND

TUNISIA

TURKEY

UK

UKRAINE

UNITED ARAB EMIRATES

URUGUAY

USA

UZBEKISTAN

VIETNAM

Real-Time Hardware Observability: Monitoring GPU Temps and NVMe Wear with Prometheus & Grafana

Take full control of your bare-metal infrastructure. Learn how to scrape root-level hardware sensors, track GPU temperatures, and monitor NVMe wear levels using a custom Prometheus and Grafana stack.

Real-time hardware observability stack with Prometheus and Grafana

When you rent a virtual machine in the public cloud, the underlying hardware is abstracted away. You have no idea if your CPU is thermal throttling, if your storage drive is nearing the end of its write-cycle lifespan, or if your GPU's memory modules are overheating. The hypervisor hides everything from you.

But when you deploy an unmanaged Bare-Metal Dedicated Server with iDatam, you own the machine down to the motherboard. You have absolute, root-level access to every physical sensor, IPMI metric, and PCIe controller. With great power comes great responsibility: if you are running massive AI training jobs or high-frequency databases, you must monitor your own hardware health.

In this tutorial, we will build an enterprise-grade hardware observability stack. We will use Prometheus to scrape metric data, Grafana to visualize it, the NVIDIA DCGM Exporter to track GPU metrics (temps, power draw, VRAM usage), and the smartctl_exporter to monitor the exact wear leveling of your PCIe Gen 5 NVMe drives.

What You'll Learn

Step 1: Install Prometheus and Grafana

For ease of management, we will deploy the core Prometheus and Grafana services using Docker Compose. Ensure Docker and Docker Compose are installed on your Ubuntu 24.04 LTS server.

Create a directory for your observability stack:

bash

mkdir -p ~/observability/prometheus
cd ~/observability
                                

Create a basic Prometheus configuration file (prometheus/prometheus.yml) so the service can start. We will add our hardware targets to this later:

bash

nano prometheus/prometheus.yml
                                

Add the following:

yaml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
                                

Now, create the docker-compose.yml file:

bash

nano docker-compose.yml
                                

Add the following configuration:

yaml

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    ports:
      - "9090:9090"
    network_mode: "host" # Use host network to easily scrape local exporters

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=SuperSecretGrafanaPassword
    network_mode: "host"
                                

Start the core stack:

bash

sudo docker compose up -d
                                

Step 2: Install Node Exporter (CPU & RAM)

To get baseline server metrics (CPU load, memory usage, disk I/O), we need the standard Prometheus Node Exporter. Since it needs root access to the host's /proc and /sys directories, it is best installed directly on the host OS.

Download and install Node Exporter:

bash

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-*linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
                                

Create a systemd service file:

bash

sudo nano /etc/systemd/system/node_exporter.service
                                

Add:

ini

[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
User=nobody
ExecStart=/usr/local/bin/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target
                                

Enable and start the service:

bash

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
                                

Step 3: Install NVIDIA DCGM Exporter (GPU Metrics)

If you are renting an iDatam GPU Server, standard tools won't capture deep GPU metrics. You need NVIDIA's Data Center GPU Manager (DCGM).

Since we already have the NVIDIA Container Toolkit installed (from our previous AI tutorials), we can run the DCGM exporter as a Docker container.

Run the exporter, giving it access to all GPUs:

bash

sudo docker run -d --gpus all --rm -p 9400:9400 \
  --name nvidia-dcgm-exporter \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
                                

(This exposes highly detailed metrics—including Tensor Core utilization, memory temperatures, and PCIe bandwidth—on port 9400).

Step 4: Install smartctl_exporter (NVMe Wear Leveling)

NVMe drives have a finite lifespan measured in Terabytes Written (TBW). If you are running an I/O-heavy database, you must monitor the "Percentage Used" SMART metric to replace drives before they fail.

Install the smartmontools package:

bash

sudo apt install smartmontools -y
                                

Download the smartctl_exporter:

bash

cd /tmp
wget https://github.com/prometheus-community/smartctl_exporter/releases/download/v0.11.0/smartctl_exporter-0.11.0.linux-amd64.tar.gz
tar xvfz smartctl_exporter-*.tar.gz
sudo mv smartctl_exporter-*/smartctl_exporter /usr/local/bin/
                                

Because this exporter requires root privileges to read raw disk data via smartctl, create a root-level systemd service:

bash

sudo nano /etc/systemd/system/smartctl_exporter.service
                                

Add:

ini

[Unit]
Description=Prometheus SMART Exporter
After=network.target

[Service]
User=root
ExecStart=/usr/local/bin/smartctl_exporter
Restart=always

[Install]
WantedBy=multi-user.target
                                

Enable and start it:

bash

sudo systemctl daemon-reload
sudo systemctl enable --now smartctl_exporter
                                

(This exposes NVMe SMART metrics on port 9633).

Step 5: Update Prometheus Configuration

Now, we must tell Prometheus to scrape all three exporters we just set up.

Edit your prometheus.yml file:

bash

nano ~/observability/prometheus/prometheus.yml
                                

Append the new jobs under scrape_configs:

yaml

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'nvidia_dcgm'
    static_configs:
      - targets: ['localhost:9400']

  - job_name: 'smartctl_nvme'
    static_configs:
      - targets: ['localhost:9633']
                                

Restart the Prometheus container to apply the changes:

bash

sudo docker restart observability-prometheus-1
                                

Step 6: Visualize in Grafana

Your server is now successfully collecting thousands of hardware data points every 15 seconds. Let's visualize them.

  1. Open your browser and navigate to Grafana at http://YOUR_SERVER_IP:3000.

  2. Log in using admin and the password SuperSecretGrafanaPassword (configured in Step 1).

  3. Go to Connections > Data Sources, select Prometheus, and enter http://localhost:9090 as the URL. Click Save & Test.

  4. Go to Dashboards > Import.

Instead of building dashboards from scratch, you can import pre-built, community-standard dashboards using their IDs:

  • For Node Exporter (CPU/RAM): Enter ID 1860 and click Load.

  • For NVIDIA DCGM (GPUs): Enter ID 12239 and click Load.

  • For NVMe SMART Data: Enter ID 10530 and click Load.

You now have a mission-control command center displaying real-time power draw on your H100s, thermal throttling alerts for your CPU, and the exact percentage of life remaining on your Gen 5 NVMe arrays.

Conclusion: Total Hardware Transparency

Hardware failure in the enterprise space is inevitable; the key to maintaining 100% uptime is predictability. By monitoring your own hardware sensors at the bare-metal level, you can proactively migrate workloads or swap drives weeks before a catastrophic failure occurs.

This level of granular observability is simply impossible on managed cloud platforms that obscure hardware realities to protect their margins.

When you deploy your infrastructure on iDatam’s Unmetered Dedicated Servers, you aren't just renting compute power—you are taking ownership of the metal. Enjoy transparent, root-level hardware access, and monitor your infrastructure exactly how your DevOps team requires.

Discover iDatam Dedicated Server Locations

iDatam servers are available around the world, providing diverse options for hosting websites. Each region offers unique advantages, making it easier to choose a location that best suits your specific hosting needs.

Up