Building a Homelab with Flux GitOps: Lessons from Three Months
How I built a home Kubernetes cluster with Flux, SOPS secrets, and a dashboard that actually works — plus the skills that helped document everything along the way.
I've been running a homelab for a few years, but this March I decided to do it properly. Talos Linux on a Mac Mini, Flux for GitOps, SOPS for secrets, and a homepage dashboard that shows everything at a glance.
Three months in, I have a working cluster with Immich, Plex, Grafana, CrowdSec, Traefik, and a handful of other services. More importantly, I have documentation that actually compounds — each problem I solved made the next one easier.
Here's what I built, what broke, and how I'll do it again.
The Stack
- Talos Linux — Immutable, API-driven Kubernetes. No SSH, no shell, no apt-get. Sounds limiting until you realize you never need to debug "what changed on the node."
- Flux — GitOps controller. Everything lives in a git repo, Flux reconciles it to the cluster. Push a change, it lands. Revert a change, it's gone.
- SOPS + Age — Encrypted secrets in git. No plaintext, no sealed-secrets complexity. Decrypt with your age key, apply, done.
- Homepage — Dashboard with widgets for every service. Internal URLs, API tokens, done.
- Traefik — Ingress with ACME DNS challenges for
*.example.comcerts. - CrowdSec — Bouncer + LAPI for intrusion detection at the edge.
The Journey (aka Things That Broke)
Storage Classes Matter More Than You Think
Early mistake: I used NFS (nfs-synology) for everything. Loki crashed in a persistent loop with directory not empty errors. PostgreSQL had WAL corruption.
The problem: NFS lacks POSIX fsync/locking semantics. Databases need them.
| Storage Class | Use For | Don't Use For |
|---|---|---|
local-path | Databases, WAL, TSDB | — |
nfs-synology | Photo libraries, bulk storage | Anything with strict fsync |
Lesson: local-path for databases, NFS for everything else. One decision fixed a week of crash loops.
Node Scheduling Has Layers
The cluster went through four scheduling strategies:
- Worker node only — simple, single point of failure
- All nodes with control-plane tolerations — apps ran on control plane, messy
- Worker-only apps — clean isolation
- Monitoring on control plane — freed worker resources when I ran out
Current rule: Apps on workers. Monitoring on control plane. Documented exceptions in HelmRelease comments.
Helm Upgrades: The SSA Trap
Flux HelmRelease upgrades failed with:
invalid operation: cannot use force conflicts and force replace together
Turns out install.serverSideApply is a boolean (false), but upgrade.serverSideApply is an enum string ("disabled"). Same field name, different types.
spec:
install:
serverSideApply: false # boolean
upgrade:
serverSideApply: disabled # enum string
force: true
The OpenTelemetry operator upgrade (0.109 → 0.110) removed kube-rbac-proxy, changing Service port layouts. Three-way merge created duplicate metrics ports. Fixed by pinning the version and using force: true + serverSideApply: disabled.
Lesson: Check chart changelogs for breaking infrastructure changes. Stay one version behind if unsure.
Synology + Traefik = Special Handling
Getting Traefik to proxy Synology DSM/Drive required three specific fixes:
- No CrowdSec on Synology routes — DSM's WebSocket-based UI and asset loading break with request inspection middleware
- Disable HTTP/2 upstream — Some DSM builds have HTTP/2 bugs causing connection failures
- EndpointSlice-only — Mixing v1
EndpointswithEndpointSlicecauses ~50% 502s when IPs differ
apiVersion: traefik.io/v1alpha1
kind: ServersTransport
metadata:
name: synology-insecure
namespace: traefik
spec:
insecureSkipVerify: true
disableHTTP2: true
Cluster hygiene tip: If you ever have ~50% 502s on a route, check for mixed Endpoints + EndpointSlice:
kubectl get endpoints -n traefik synology-dsm # Delete if present
Manually created Endpoints spawn a mirrored EndpointSlice via endpointslicemirroring-controller. Delete the Endpoints object and rely on Git-managed EndpointSlice only.
The VoidAuth Hairpin (CoreDNS Split-Horizon)
Initial VoidAuth setup had 700-900ms latency on unauthenticated requests. Root cause: APP_URL=https://auth.example.com caused internal requests to hairpin NAT — out to the public IP and back.
Fix: CoreDNS split-horizon to resolve auth.example.com to the Traefik ClusterIP internally:
# infrastructure/configs/coredns.yaml
data:
Corefile: |
.:53 {
# ... other config ...
template IN A auth.example.com {
answer "{{ .Name }} 30 IN A 10.101.208.179" # Traefik ClusterIP
}
forward . /etc/resolv.conf
# ...
}
Result: p99 latency reduced from 1.2s → ~300ms (3-4x faster).
What Actually Worked
SOPS for Secrets
SOPS with Age keys is dead simple. One gotcha: there's no sops --delete command. The workaround:
# Decrypt
SOPS_AGE_KEY_FILE=~/.config/sops/age/keys.txt sops -d file.sops.yaml > /tmp/plain.yaml
# Edit
vim /tmp/plain.yaml
# Re-encrypt (filename must match .sops.yaml path regex)
cp /tmp/plain.yaml /tmp/plain.yaml.sops.yaml
SOPS_AGE_KEY_FILE=~/.config/sops/age/keys.txt sops --config .sops.yaml -e /tmp/plain.yaml.sops.yaml > file.sops.yaml
Annoying but reliable. Documented it once, never had to figure it out again.
Homepage Widgets & Secret Management
Once you understand the rules, Homepage is great:
- Secret keys must be prefixed
HOMEPAGE_VAR_for template substitution - Use internal service URLs (
http://service.namespace.svc.cluster.local) subPathmounts don't hot-reload — Reloader handles this automatically
Gotcha: Complete secret updates — When updating Homepage secrets, don't patch individual fields. Delete and recreate the secret to ensure all credentials are present:
kubectl delete secret -n homepage homepage-secrets
kubectl apply -f new-secret.yaml # With ALL credentials from SOPS
kubectl rollout restart -n homepage deployment/homepage
This prevents the "missing credentials" issue where widgets show 401 errors because only some secrets were updated.
Widgets for Immich, Grafana, Traefik, CrowdSec, Plex, and Synology all working.
Flux Reconciliation
The workflow is muscle memory now:
# Edit the repo
# Commit and push
git add -A && git commit -m "feat: add thing" && git push
# Reconcile
flux reconcile kustomization apps --with-source
# Verify
kubectl get helmrelease -A
Flux Web UI
Flux has a built-in web UI (provided by the flux-operator) that shows the status of all GitOps resources at a glance. It's accessible at flux.example.com and provides:
- Reconciliation status — Visual indicators for each Kustomization and HelmRelease
- Resource details — Click into any object to see its spec and status
- Error visibility — Quickly spot what's failing and why
- History & events — See recent reconciliation attempts and outcomes
The enabling change: Add these values to your flux-instance HelmRelease:
spec:
values:
instance:
web:
enabled: true
domain: flux.example.com
This deploys the flux-operator UI service in the flux-system namespace. The service exposes port http-web (typically 8080) and is then exposed externally via Traefik IngressRoute:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: flux-ui
namespace: flux-system
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: "Host(`flux.example.com`)"
services:
- name: flux-operator
port: http-web
The UI is particularly helpful when debugging stuck reconciliations or understanding why a HelmRelease isn't deploying. It complements the CLI workflow for day-to-day operations.
Note: In the anonymized config, this is exposed via Traefik IngressRoute to flux.example.com with CrowdSec and VoidAuth protection.
Custom Grafana Dashboards
Four custom dashboards were built to make monitoring actionable:
- Homelab Overview — Single-pane view of cluster health: CPU/memory gauges for control plane and worker nodes, disk usage, and key service status
- K8s Cluster Health — Deep dive into Kubernetes resource health, pod counts, and namespace-level metrics
- Traefik Site Health — Request rates, response times, error rates, and service-level metrics for all ingress traffic
- Immich Deep Dive — Photo library metrics: upload rates, storage usage, transcoding performance, and user activity
Each dashboard is deployed as a ConfigMap with grafana_dashboard: "1" label, auto-discovered by Grafana. They're maintained in infrastructure/configs/grafana-dashboards.yaml and traefik-dashboard.yaml.
Why custom dashboards? Standard Kubernetes dashboards show raw metrics; custom dashboards show your services in context. When Immich is slow, you can immediately see if it's storage, transcoding, or database queries.
More details: See the flux-homelab-skill Grafana dashboards reference for deployment patterns and maintenance.
Documentation That Compounds
The most valuable output of this project isn't the cluster — it's the docs/solutions/ directory in your infrastructure repository.
Every time I solved a non-trivial problem, I documented it: the symptoms, what didn't work, the fix, and why it works. This is the ce:compound pattern — each solution documented makes the next one faster.
The flux-homelab skill (available at github.com/marr/flux-homelab-skill) encapsulates the patterns from this homelab — Flux reconciliation, SOPS secrets, Homepage widgets, and service-specific gotchas. It's like a runbook that travels with the agent.
When I ask "why is Immich down," it knows to check storage class first, then node scheduling, then Flux kustomization health. It remembers that node names changed, that OCI chart sources have two valid patterns, and that CrowdSec blocks wget user agents.
Skill contents include:
- Cluster hygiene — Detecting mixed Endpoints + EndpointSlice issues, orphaned services, hardcoded LAN IPs
- Flux reconciliation — Proper sequence for GitRepository → Kustomization → HelmRelease
- HelmRelease patterns —
upgrade.force+serverSideApplyinteractions, stuck kustomization recovery - Secret management — SOPS file structure, secret synchronization, validation workflows
The ce:compound pattern (referenced from Compound Engineering) is a structured approach to documentation — researching problems, assembling solutions, writing them down, and validating that they work. Each documented solution makes the next one faster. This pattern applies to any project, not just homelabs.
Getting Started with Flux Homelab Experimentation
If you want to try this yourself:
1. Pick Your Platform
Talos is great if you want an immutable, API-driven cluster. k3s is great if you want something more familiar. Either way, you need:
- A machine (old laptop, NUC, Mac Mini, VM)
- A git repo for your Flux configuration
- Time to break things
2. Bootstrap Flux
Follow the official Flux getting started guide to install Flux on your cluster, then bootstrap your GitOps repository:
flux bootstrap github \
--owner=your-username \
--repository=homelab-infra \
--branch=main \
--path=clusters/my-cluster
This creates the bootstrap kustomization that points to your infrastructure/ and apps/ directories.
Flux v2 Gotchas:
- Namespace changes — In v2, Flux components run in
flux-systemby default. Some older guides referencefluxnamespace. - Kustomize controller — The controller now handles kustomizations natively; you don't need separate
kustomizeCLI workflows. - GitHub token scope — The bootstrap command needs a token with
repoandworkflowpermissions. Personal Access Tokens work, but GitHub Apps are recommended for production. - Path separators — Use forward slashes even on Windows; Flux paths are Git repository paths, not filesystem paths.
- OCI chart sources — Flux 2.8+ uses
source.toolkit.fluxcd.io/v1for GitRepository/HelmRepository/OCIRepository, buthelm.toolkit.fluxcd.io/v2for HelmRelease. OCI-backed charts have two valid patterns:HelmRepositorywithtype: oci, orOCIRepositorydirectly. If one pattern fails, try the other. - HelmRelease serverSideApply — The
install.serverSideApplyfield is a boolean, butupgrade.serverSideApplyis an enum string ("disabled"). Mixing them incorrectly causes validation errors. Use:install: serverSideApply: false upgrade: serverSideApply: disabled force: true - Stuck kustomizations — If a
HelmReleasehealth check fails, downstream kustomizations withspec.dependsOnwill wait indefinitely. Checkkubectl describe kustomization <name> -n flux-systemforHealthCheckFailed. Highhelm-controllerCPU often indicates a failing Helm release in a retry loop.
3. Structure Your Repo
homelab-infra/
├── clusters/ # Flux bootstrap
├── infrastructure/
│ ├── controllers/ # Traefik, Prometheus, etc.
│ └── configs/ # Routes, monitoring, alerts
└── apps/
└── my-app/ # namespace, kustomization, helmrelease
4. Add SOPS Early
Set up age encryption before you have any secrets:
age-keygen -o ~/.config/sops/age/keys.txt
# Add the public key to .sops.yaml
5. Build Incrementally
Don't try to deploy everything at once. Start with:
- Traefik (ingress)
- One app (immich, nextcloud, whatever)
- Homepage (dashboard)
- Monitoring (Prometheus + Grafana)
- Security (CrowdSec + auth)
Each step teaches you something. Document each step.
6. Document as You Go
Create a docs/solutions/ directory in your repo. When you solve something non-trivial, write it down. Use YAML frontmatter so future-you (or your AI assistant) can search by module, tags, and problem type.
The compounding effect is real. Week 1 I spent hours debugging Loki crash loops. Week 12, I checked the documented solution and fixed a similar issue in minutes.
What's Next
The cluster keeps growing. Next up:
- Gitea Actions for CI (self-hosted runners in-cluster)
- Synology Drive integration (finally got the Traefik routing right)
- Better alerting rules in Alertmanager
- Maybe a second node for actual HA
The homelab isn't a project with an end state. It's an ongoing experiment where each iteration teaches something new. The key is capturing those lessons so they compound.
For more on AI-assisted development, see my post on Building a Knowledge Graph with Obsidian and MCP.