What I Learned Running a Homelab Like a Startup

Why Running a Homelab Beats Any Course

Running a homelab taught me more about when infrastructure tools become necessary than any certification course. Here’s what hands-on infrastructure ownership looks like, and why it’s the difference between knowing what tools do and knowing when you’d actually need them.

Most engineers know what these tools do. They’ve learned Kubernetes from a tutorial, and added it to a resume. Ask when, specifically, you’d actually reach for one, and the answer gets vague.

Most people will say “scale.” Fair enough. But when is that scale? Ten services? A hundred? At what point does Docker Compose stop being enough? The honest answer for most engineers is: I don’t know.

That’s not an insult. It’s a structural problem with how most of us learn. We study tools in the abstract, or implement them because a ticket says to. We learn that Kubernetes exists and what it does. We rarely learn why someone would replace Docker Compose with Kubernetes, or what their stack looked like the day they did.

What’s missing is the experience of actually needing the tool. Personally running into that problem, getting burned by it, and then going looking for an answer.

My homelab, a second-hand Intel NUC I bought on Facebook Marketplace, gave me that moment. Repeatedly. Often at inconvenient times. And in doing so, it taught me more about why these tools exist and when you’d actually need any of them.

Remote Access Without Opening Ports

The first thing I wanted after setting up my home server was remote access: the ability to reach my services when I wasn’t home.

The obvious answer is a VPN. The obvious implementation, OpenVPN, requires you to forward a port from your router to your server. Open a hole in your firewall and let the outside world knock on it. Simple enough.

Except: a home router is not a managed firewall. It runs software that rarely gets patched. Every port you expose is a permanent, 24/7 attack surface on a device you can’t harden to production standards. Home IPs do get scanned; automated scanners hit every address. And if something did get in, I’d find out from a service going dark, not a log. All of this for a convenience feature didn’t feel like a reasonable trade.

What I found instead was Tailscale, a mesh VPN that lets devices reach each other directly without anyone opening inbound ports. Nothing is exposed. Nothing is open.

Making a Home Server Self-Healing

Tailscale running meant I was dependent on the server. Which meant I needed to think through what happens when it doesn’t run.

Power cut: machine off, no way to turn it back on remotely. Container crash: doesn’t restart by default. Kernel panic: machine hangs without rebooting. Tailscale frozen: Restart=on-failure won’t catch a process that’s technically alive but unresponsive.

The fixes:

auto power-on after AC restore in the BIOS
restart: unless-stopped on every container
kernel panic auto-reboot via sysctl
systemd watchdog timer monitoring Tailscale’s health check endpoint

Every one of them exists because I thought through the failure mode and closed it. That’s the thought process behind every resilience pattern in enterprise infrastructure: HA pairs, auto-scaling, runbooks, on-call rotations. All invented by someone who thought through a failure mode.

Managing Docker Containers with Portainer

A few weeks in, I had no idea what was running.

Services started with docker run, flags copied from documentation, some with compose files and some without. Nothing consistent. docker ps | grep told me what was up at that moment; it told me nothing about why memory was climbing or which container had died overnight.

Portainer fixed this: one screen, every container, its status, logs, and resource usage. Stack-based management instead of per-container commands. The concept wasn’t a better workflow for running Docker. It was why visibility is a first-class concern at any scale. Portainer handles mine. Kubernetes exists for what Portainer can’t: many hosts, hundreds of containers, rolling deploys. Not my problem today.

Reverse Proxy and HTTPS with Nginx Proxy Manager and Let’s Encrypt

Every service’s address looked like https://192.168.1.33:8096. That works. It’s not something you send to anyone. The self-signed cert means every browser greets you with a security warning, which trains you to click through warnings. Its own problem.

Nginx Proxy Manager sits in front of all services on ports 80 and 443. One wildcard certificate for *.nuxlet.com via Let’s Encrypt’s DNS challenge through Cloudflare, no ports exposed, just a DNS record. Every service gets a subdomain: jellyfin.nuxlet.com, vault.nuxlet.com, qbt.nuxlet.com. Clean URLs, browser trusts them.

TLS termination and reverse proxies are not infrastructure nerdery. They’re the gap between “running a thing” and “running a product.” The reason ingress controllers exist in Kubernetes, the reason there’s an entire role called platform engineer, the reason “I just deployed the app” is the beginning of the work and not the end of it. Someone has to handle what developers don’t think about. I now understand what they’re not thinking about.

Backup Strategy for Self-Hosted Services

Every failure I had until this point was recoverable. Services went down and came back up. Configs got corrupted and I rebuilt them. Nothing was irreplaceable.

Then I added Vaultwarden, an unofficial community-maintained server that implements the Bitwarden client API. Every password, every 2FA seed, everything I rely on to access every other service. On my server. Under my control. If the disk died that night, all of it was gone.

For the first time, I had data worth protecting.

The backup setup: restic running daily at 3am, encrypting and shipping vault data to Backblaze B2. Thirty daily snapshots, twelve monthly. I tested a restore before I trusted the setup.

I know that if I have to restore, I’m losing at most one day of password updates. That’s my recovery point objective. I know roughly how long a restore takes. That’s my recovery time objective.

RPO and RTO had been words I knew from studying for certifications. Once I had Vaultwarden running, they became obvious. Of course you need to define how much data loss is acceptable before you build a backup system; otherwise how do you know if the system is good enough? Of course enterprises run DR drills, not because auditors require them, but because “we have backups” and “we can actually restore from backups” are two very different claims that need separate verification.

Backup strategy isn’t an afterthought. It’s a design decision you make when you first decide that the data matters. The question I had to be able to answer, and now can, is whether the backup actually works when you need it.

What’s Still Missing

Each layer I added revealed what it couldn’t handle. The previous layer trains your eye for the next one.

Identity and Access

A dozen services, each with its own login screen, user database, and access model. No audit trail, no single place to revoke access if a password leaks. Other people now use some of these services, which means I’m setting each of them up separately on each service. That’s not a policy, it’s a workaround.

Keycloak and oauth2-proxy point toward the solution. This is why IAM is an entire job at any company running more than a handful of services: “just add another user” doesn’t scale, and “just reset the password” isn’t an access policy.

Monitoring and Observability

Services fail silently. I find out Vaultwarden is down when a login fails, not from an alert. I find out the disk is full when a container crashes, not when it hits 80%.

The plan: Prometheus scraping container and host metrics, Grafana surfacing them as dashboards, blackbox-exporter pinging each service endpoint on a schedule. Together they’d tell you a service is unreachable before anyone tries to use it. On-call rotations exist for the same reason: someone needed to know before the users did.

Redundancy at the Disk Level

Data accumulates. A media library, photos, important files. The natural response is a NAS with RAID. And the moment you start thinking about that, you understand why storage engineers talk about redundancy at the disk level, not just at the backup level. RAID protects against hardware failure. Backups protect against everything else. These are separate problems.

Power flickers kill writes. A UPS is the answer. And when you’ve reasoned your way to “I need an uninterruptible power supply for my home server,” you understand, concretely, why data centers have generators. Not for comfort. For data integrity.

You Don’t Need a Homelab. You Need the Feedback Loop

Each layer was the answer to a problem the previous layer created. Reliability came after dependency. Visibility came after scale. Presentation came after URLs started mattering. Backups came after data did. That’s not a coincidence. It’s what any system looks like when it grows from real use rather than a reference architecture: each layer creates the problem the next one solves.

The documentation habit I’ve built around this, a self-hosted wiki and runbooks for the non-obvious fixes, exists for the same reason enterprise teams write runbooks. Not because the process is valuable in itself, but because complexity demands it. When a service breaks at 11pm and I haven’t touched it in three months, I want a note that explains what’s wrong and how to fix it. So does every on-call engineer on any team anywhere.

The insight here isn’t “go build a homelab.” If you can’t, or won’t, that’s a reasonable position. Hardware requires space, electricity, and a tolerance for things breaking at bad moments.

The method is: run something small that actually matters to you, where you own the full stack and broken means broken for you. A maintained VPS, a self-hosted side project, a home network you’re responsible for. The homelab is one version of that.

What you’re looking for is a system where you own the problem end to end: you build it, it breaks, you fix it, you build what stops it breaking again.

The homelab lets you hit smaller versions of the same problems the tools were built to solve. Once you have, you understand them differently. Not just what they do, but why anyone built them and when you’ll actually need them. Chances are, you’d already read about the solution. You just haven’t run into the problem yet.

What I’m running (for those who want the specifics):

Category	Tool	Purpose
Hardware	Intel NUC (secondhand, ~$50)	Home server
Networking	Tailscale	Zero-config mesh VPN
Containers	Docker + Portainer	Container management UI
Proxy / TLS	Nginx Proxy Manager + Let’s Encrypt + Cloudflare	Reverse proxy, wildcard HTTPS via DNS challenge
Passwords	Vaultwarden	Self-hosted password manager
Backups	restic → Backblaze B2	Encrypted offsite backups
IAM (planned)	Keycloak + oauth2-proxy	Single sign-on
Monitoring (planned)	Prometheus + Grafana + blackbox-exporter	Metrics, dashboards, endpoint probing