$ cat blog/homelab-infrastructure.md
Note: This is a living document. The homelab evolves constantly. Last major update: March 2026.
Every project I build needs somewhere to run. For years I rented VPSs, wrestled with cloud credits, and accepted that "free tier" actually meant "free until you need persistence." Then I acquired a retired Threadripper workstation and realized I could stop paying rent on other people's computers.
This post documents how a single bare-metal machine running Kubernetes became the infrastructure layer for my entire digital life — from music streaming to AI inference to this very website.
Why One Node?
The homelab community loves high availability. Three-node clusters, automatic failover, geographic distribution. I respect it, but it's not what I need right now.
I chose a single-node architecture for three reasons:
1. Simplicity is availability. A distributed system has more failure modes than a single machine. With one node, when something breaks I know exactly where to look. No network partitions, no split-brain, no consensus algorithms to debug at 2 AM.
2. Cost efficiency. One Threadripper 3970X (32 cores), 256GB ECC RAM, and ~337TB of ZFS storage costs less than three modest nodes with equivalent total resources. The hardware was acquired over time — NVMe boot drives first, then spinning rust in batches as deals appeared.
3. Learning focus. I wanted to learn Kubernetes application patterns, not Kubernetes cluster administration. Running a single control plane lets me focus on the workloads rather than etcd backup strategies.
The tradeoff is acceptable downtime. If the machine dies, services go offline until I fix it. Since this is personal infrastructure serving primarily me (and occasionally family), that's fine. The system has been surprisingly stable — 99.9% uptime over the past year with most downtime being planned maintenance.
The Hardware
| Component | Specification | Purpose |
|---|---|---|
| CPU | AMD Threadripper 3970X (32c/64t) | VM workloads, compilation, AI inference |
| RAM | 256GB ECC DDR4-3200 | ZFS ARC cache, VM memory, general buffer |
| Boot | 2× 2TB NVMe (mirror) | OS, container images, databases |
| Storage | 337TB ZFS (2× raidz3 vdevs) | Media, backups, bulk storage |
| SLOG/L2ARC | NVMe + 9.2TB SSD tier | Sync writes, hot cache |
| Network | Intel I211 1GbE + Tailscale | LAN + mesh VPN |
The storage architecture is the interesting part. Two RAID-Z3 vdevs with 10 drives each provides redundancy for the media collection while keeping usable capacity high. The drives are a mix of sizes acquired opportunistically — newer 20TB drives in one vdev, older 8-14TB in the other. ZFS handles this gracefully.
NVMe SLOG (Separate Intent Log) was critical. Without it, synchronous writes — which databases and many applications require — would bottleneck on spinning rust. The SLOG devices absorb the sync traffic and flush to disk asynchronously.
Software Stack
Everything runs on Kubernetes. Not because it's trendy, but because it's the right abstraction for my brain.
Operating System: Ubuntu LTS on bare metal. I tried Proxmox and TrueNAS Scale but found them opinionated in ways that got in my way. Plain Ubuntu with K3s installed directly gives me full control.
Container Orchestration: K3s in single-node mode. Lightweight, CNCF-compliant, uses containerd. The entire Kubernetes datastore lives on the mirrored NVMe boot drives.
Storage:
- ZFS for host-level storage management
- democratic-csi to expose ZFS datasets as Kubernetes PVCs
- hostPath mounts for shared media directories that multiple pods need
Infrastructure as Code: OpenTofu (Terraform fork). Every service is defined in .tf files. A single tofu apply can recreate the entire cluster state. Secrets live in .tfvars files that are gitignored but backed up separately.
Ingress: ingress-nginx in hostPort mode. This preserves real client IPs (important for services like Navidrome and Authentik) while allowing the pfSense router to forward traffic directly to the node.
Secret Management: Kubernetes Secrets created by Terraform, referencing variables from .tfvars. No Vault, no external secret store. For a single-node homelab, this is sufficient.
Service Topology
Services are organized by namespace:
| Namespace | Purpose | Example Services |
|---|---|---|
media |
Entertainment stack | Jellyfin, Navidrome, qBittorrent/VPN, *arr apps |
apps |
Custom applications | g2-bridge, shoparr, health tracker |
identity |
Authentication | Authentik (server, worker, postgres, redis) |
ingress |
Network entry | ingress-nginx controller |
storage |
CSI driver | democratic-csi for ZFS provisioning |
backups |
Data protection | Restic REST server, Backrest |
knowledge |
Wiki | MediaWiki mirror of English Wikipedia |
monitoring |
Dashboard | Homepage (service dashboard) |
ai |
LLM inference | Open WebUI (frontend for Ollama) |
memory |
Knowledge graph | Memory MCP + Neo4j |
Each namespace has appropriate NetworkPolicy rules. Most deny ingress by default and explicitly allow only required traffic (ingress-nginx → service, intra-namespace for related pods). Egress is generally unrestricted except for www which is locked down tightly.
A Day in the Life
Morning: Check the Homepage dashboard over coffee. All green. Music starts playing from Navidrome (currently: Fred again.. - ten days). The status bar on kennysworld.xyz shows this track to anyone visiting my site — a small ambient detail I enjoy.
Working: VS Code connects to a devcontainer running on the homelab via Tailscale. The g2-bridge project lets me voice-control OpenCode from my smart glasses, with the entire pipeline running locally — voice transcription (Whisper), AI processing (OpenCode CLI), and display output all stay within the LAN.
Evening: Jellyfin streams 4K HDR content to the living room TV. The *arr stack (Sonarr, Radarr, Lidarr, Prowlarr, Readarr) has been busy — new episodes appeared automatically, downloaded via qBittorrent through a VPN tunnel (gluetun sidecar), and organized into the ZFS pool. The RAK4631 air quality sensor in my office publishes MQTT readings that Home Assistant tracks.
Night: Backrest runs incremental backups to the Restic REST server. If the house burns down, I can restore from the offsite copy. Meanwhile, Ollama keeps a 176GB model loaded in RAM for late-night coding questions.
What I Learned
ZFS is non-negotiable. The combination of copy-on-write, checksums, snapshots, and flexible vdev layouts makes it perfect for long-term data storage. I've caught silent corruption that would have gone unnoticed on lesser filesystems.
GitOps is worth the overhead. Having every service defined in Terraform means I can reason about the entire infrastructure by reading files. When something breaks, I can see exactly what changed. The "Landing the Plane" protocol in my AGENTS.md ensures I never leave uncommitted changes sitting on the server.
Kubernetes complexity is manageable at small scale. The learning curve is steep, but once you internalize the abstractions (Pods, Services, PVCs, ConfigMaps), adding a new service is copy-paste-modify. The bjw-s app-template Helm chart provides a consistent structure.
Tailscale is magic. WireGuard mesh networking with zero configuration. I can access any service from anywhere without opening firewall holes or remembering IP addresses. The only wrinkle: MagicDNS complicates LAN-only services if you're not careful.
Current Obsessions
Memory MCP: Building a knowledge graph that stores context from my OpenCode sessions and Obsidian notes. Running Qwen3.5-35B on an external GPU via vLLM proxy. The graph queries use both vector similarity and graph traversal — feels like the future of personal knowledge management.
g2-bridge: The smart glasses project consumes most of my creative energy right now. Making it truly useful requires solving hard problems in voice UX, display constraints, and latency. But the feeling of asking Claude to refactor code while walking around the house is worth it.
Wikipedia Mirror: Weekly imports of the English Wikipedia dump into a self-hosted MediaWiki. 107GB of compressed XML becomes browsable knowledge. TemplateStyles and QuickInstantCommons make it usable. The import takes 6-8 hours and runs Sunday mornings.
What's Next
The homelab isn't finished — it's never finished. Some upcoming projects:
- Distributed storage: Evaluating Ceph or Longhorn to eventually move beyond single-node ZFS
- Backup automation: Better alerting when backups fail, not just when they succeed
- Power monitoring: Track actual power consumption per workload
- Public services: Selectively exposing more services to the internet with proper security
Numbers That Might Interest You
- Services running: 33+ Helm releases
- Containers: ~60 pods
- Storage utilization: ~40% of 337TB
- Monthly power: ~$85 (estimated)
- Uptime: 99.9% over past year
- Network traffic: ~2TB/month mostly media streaming
Conclusion
Building this infrastructure taught me more about systems engineering than any job ever could. When you own the entire stack — from bare metal to application code — you develop intuition for where problems originate that you simply can't get from managed services.
The homelab is also a statement of values. I believe in owning my data, in building things that last, in the dignity of self-sufficiency. Every service I host myself is a small act of resistance against the enclosure of the digital commons.
If you're thinking about building your own: start smaller than this. A Raspberry Pi running Docker is enough to learn. Scale up when you hit real limits, not imagined ones. The Threadripper was overkill for my first two years of homelabing. But now that it's here, I'm glad I have the headroom.
Questions welcome. Find me on the site status bar or via the usual channels.
Hardware notes: Threadripper 3970X, 256GB ECC, ZFS pool "tank" with 2× RAID-Z3 vdevs + NVMe SLOG/L2ARC. Full specs in the architecture docs.
Software notes: Ubuntu LTS, K3s single-node, OpenTofu IaC, ~33 Helm releases across 10 namespaces.
Published: March 28, 2026 | Tags: homelab, k3s, infrastructure, self-hosting, kubernetes, zfs