Homelab Observability: Grafana + Prometheus + Loki

2 minute read

“If it isn’t observable, it’s a rumor.”

Today’s homelab sprint wasn’t about new apps. It was about confidence. I wanted a single place to answer: What’s healthy, what’s noisy, and what’s about to fail? So I built a full observability stack around Grafana + Prometheus + Loki and wired in node, container, log, and probe signals.

Note: This post does not expose internal IPs or Tailscale DNS names. Any references to hosts are intentionally generic.

✅ What I Shipped

Metrics pipeline

Prometheus scrapes host and container metrics.
Node Exporter feeds CPU, memory, disk, network, uptime.
cAdvisor adds per-container metrics.
Blackbox Exporter probes HTTP service health.

Logs pipeline

Promtail ships container + system logs.
Loki stores and indexes logs.
Grafana overlays metrics + logs in one view.

Dashboards

Homelab Overview (status, CPU/memory by node, top containers, errors)
Services Overview (CPU, memory, disk, network, uptime, status per node/service)
TrueNAS Netdata Overview (Netdata metrics)
Router Overview (Wi-Fi + router stats)
Jellyfin Overview (service health + host metrics)

🧭 Architecture (ASCII)

                   +--------------------+
                   |      Grafana       |
                   |  Dashboards + UX   |
                   +---------+----------+
                             |
                             | queries
          +------------------+------------------+
          |                                     |
  +-------+-------+                     +-------+-------+
  |  Prometheus   |                     |     Loki      |
  |  Metrics DB   |                     |   Logs DB     |
  +-------+-------+                     +-------+-------+
          |                                     |
  +-------+-------+                     +-------+-------+
  | Node Exporter |                     |   Promtail    |
  | Host Metrics  |                     | Log Shipper   |
  +-------+-------+                     +-------+-------+
          |
  +-------+-------+     +------------------------+
  |  cAdvisor     |     | Blackbox Exporter      |
  | Container CPU |     | HTTP/TCP Probes        |
  +---------------+     +------------------------+

In short: metrics flow into Prometheus, logs flow into Loki, and Grafana sits above both.

🖼️ Snapshot (Dashboard excerpt)

Homelab observability dashboard excerpt

⚙️ Implementation Highlights

1) Service labels for readable graphs

I added service labels in Prometheus targets so dashboards show friendly names rather than raw URLs. That one change made every legend and table readable at a glance.

2) Netdata for TrueNAS

TrueNAS didn’t expose node-exporter easily, so I deployed Netdata and scraped its Prometheus output instead. It cleanly feeds CPU, RAM, disk, and network with zero hacks.

3) Router metrics (OpenWrt)

OpenWrt node-exporter Lua modules give me Wi-Fi signal, bitrate, and stations. I kept those on a dedicated Router Overview dashboard to avoid noise.

4) Kiosk-style Overview embedded in Homepage

I enabled Grafana anonymous viewer mode and embedded the Homelab Overview via an iframe. That makes the homepage feel like a real NOC wall without needing a login.

🧨 Gotchas (And Fixes That Worked)

Grafana iframe login loop
- Fix: enable GF_SECURITY_ALLOW_EMBEDDING, anonymous viewer, and proper GF_SERVER_ROOT_URL with SameSite=None + Secure cookies.
cAdvisor couldn’t talk to Docker
- Fix: bump cAdvisor to a newer version that supports the current Docker API.
Prometheus/Loki permissions
- Fix: set ownership of data directories to the UID used by the container images.
Router exporter only bound to loopback
- Fix: change the OpenWrt exporter to listen on LAN (not 127.0.0.1).
Homepage bound to LAN only
- Fix: ensure Tailscale Serve proxies to the LAN IP, not localhost.

These are the little operational details that turn dashboards from “demo” into “reliable.”

✅ Results

What I can answer instantly now:

Are core services responding? (blackbox)
Which node is hot or memory-bound? (node-exporter)
Which container is noisy? (cAdvisor)
What’s spamming logs? (Loki)

That’s the difference between reactive firefighting and proactive ops.

🔜 Next Up

Add alerting channels (email or chat)
Optional: add Proxmox export metrics
Add top slow endpoints for services with APIs

📚 References

🧠 Meta Description

Built a homelab observability stack with Grafana, Prometheus, Loki, and Promtail. This post documents the architecture, dashboards, and real-world fixes needed to make the system reliable.

Share on

LinkedIn
X.com (Twitter) Reddit

Mat Leszkiewicz