When you rent a virtual machine in the public cloud, the underlying hardware is abstracted away. You have no idea if your CPU is thermal throttling, if your storage drive is nearing the end of its write-cycle lifespan, or if your GPU's memory modules are overheating. The hypervisor hides everything from you.

But when you deploy an unmanaged Bare-Metal Dedicated Server with iDatam, you own the machine down to the motherboard. You have absolute, root-level access to every physical sensor, IPMI metric, and PCIe controller. With great power comes great responsibility: if you are running massive AI training jobs or high-frequency databases, you must monitor your own hardware health.

In this tutorial, we will build an enterprise-grade hardware observability stack. We will use Prometheus to scrape metric data, Grafana to visualize it, the NVIDIA DCGM Exporter to track GPU metrics (temps, power draw, VRAM usage), and the smartctl_exporter to monitor the exact wear leveling of your PCIe Gen 5 NVMe drives.

What You'll Learn

Step 1: Install Prometheus and Grafana

Step 2: Install Node Exporter (CPU & RAM)

Step 3: Install NVIDIA DCGM Exporter (GPU Metrics)

Step 4: Install smartctl_exporter (NVMe Wear Leveling)

Step 5: Update Prometheus Configuration

Step 6: Visualize in Grafana

Conclusion: Total Hardware Transparency

Step 1: Install Prometheus and Grafana

For ease of management, we will deploy the core Prometheus and Grafana services using Docker Compose. Ensure Docker and Docker Compose are installed on your Ubuntu 24.04 LTS server.

Create a directory for your observability stack:

bash


mkdir -p ~/observability/prometheus
cd ~/observability

Create a basic Prometheus configuration file (prometheus/prometheus.yml) so the service can start. We will add our hardware targets to this later:

bash


nano prometheus/prometheus.yml

Add the following:

yaml


global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Now, create the docker-compose.yml file:

bash


nano docker-compose.yml

Add the following configuration:

yaml


version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    ports:
      - "9090:9090"
    network_mode: "host" # Use host network to easily scrape local exporters

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=SuperSecretGrafanaPassword
    network_mode: "host"

Start the core stack:

bash


sudo docker compose up -d

Step 2: Install Node Exporter (CPU & RAM)

To get baseline server metrics (CPU load, memory usage, disk I/O), we need the standard Prometheus Node Exporter. Since it needs root access to the host's /proc and /sys directories, it is best installed directly on the host OS.

Download and install Node Exporter:

bash


cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-*linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

Create a systemd service file:

bash


sudo nano /etc/systemd/system/node_exporter.service

Add:

ini


[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
User=nobody
ExecStart=/usr/local/bin/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target

Enable and start the service:

bash


sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Step 3: Install NVIDIA DCGM Exporter (GPU Metrics)

If you are renting an iDatam GPU Server, standard tools won't capture deep GPU metrics. You need NVIDIA's Data Center GPU Manager (DCGM).

Since we already have the NVIDIA Container Toolkit installed (from our previous AI tutorials), we can run the DCGM exporter as a Docker container.

Run the exporter, giving it access to all GPUs:

bash


sudo docker run -d --gpus all --rm -p 9400:9400 \
  --name nvidia-dcgm-exporter \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04

(This exposes highly detailed metrics—including Tensor Core utilization, memory temperatures, and PCIe bandwidth—on port 9400).

Step 4: Install smartctl_exporter (NVMe Wear Leveling)

NVMe drives have a finite lifespan measured in Terabytes Written (TBW). If you are running an I/O-heavy database, you must monitor the "Percentage Used" SMART metric to replace drives before they fail.

Install the smartmontools package:

bash


sudo apt install smartmontools -y

Download the smartctl_exporter:

bash


cd /tmp
wget https://github.com/prometheus-community/smartctl_exporter/releases/download/v0.11.0/smartctl_exporter-0.11.0.linux-amd64.tar.gz
tar xvfz smartctl_exporter-*.tar.gz
sudo mv smartctl_exporter-*/smartctl_exporter /usr/local/bin/

Because this exporter requires root privileges to read raw disk data via smartctl, create a root-level systemd service:

bash


sudo nano /etc/systemd/system/smartctl_exporter.service

Add:

ini


[Unit]
Description=Prometheus SMART Exporter
After=network.target

[Service]
User=root
ExecStart=/usr/local/bin/smartctl_exporter
Restart=always

[Install]
WantedBy=multi-user.target

Enable and start it:

bash


sudo systemctl daemon-reload
sudo systemctl enable --now smartctl_exporter

(This exposes NVMe SMART metrics on port 9633).

Step 5: Update Prometheus Configuration

Now, we must tell Prometheus to scrape all three exporters we just set up.

Edit your prometheus.yml file:

bash


nano ~/observability/prometheus/prometheus.yml

Append the new jobs under scrape_configs:

yaml


  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'nvidia_dcgm'
    static_configs:
      - targets: ['localhost:9400']

  - job_name: 'smartctl_nvme'
    static_configs:
      - targets: ['localhost:9633']

Restart the Prometheus container to apply the changes:

bash


sudo docker restart observability-prometheus-1

Step 6: Visualize in Grafana

Your server is now successfully collecting thousands of hardware data points every 15 seconds. Let's visualize them.

Open your browser and navigate to Grafana at http://YOUR_SERVER_IP:3000.
Log in using admin and the password SuperSecretGrafanaPassword (configured in Step 1).
Go to Connections > Data Sources, select Prometheus, and enter http://localhost:9090 as the URL. Click Save & Test.
Go to Dashboards > Import.

Instead of building dashboards from scratch, you can import pre-built, community-standard dashboards using their IDs:

For Node Exporter (CPU/RAM): Enter ID 1860 and click Load.
For NVIDIA DCGM (GPUs): Enter ID 12239 and click Load.
For NVMe SMART Data: Enter ID 10530 and click Load.

You now have a mission-control command center displaying real-time power draw on your H100s, thermal throttling alerts for your CPU, and the exact percentage of life remaining on your Gen 5 NVMe arrays.

Conclusion: Total Hardware Transparency

Hardware failure in the enterprise space is inevitable; the key to maintaining 100% uptime is predictability. By monitoring your own hardware sensors at the bare-metal level, you can proactively migrate workloads or swap drives weeks before a catastrophic failure occurs.

This level of granular observability is simply impossible on managed cloud platforms that obscure hardware realities to protect their margins.

When you deploy your infrastructure on iDatam’s Unmetered Dedicated Servers, you aren't just renting compute power—you are taking ownership of the metal. Enjoy transparent, root-level hardware access, and monitor your infrastructure exactly how your DevOps team requires.