Building a Homelab with Flux GitOps: Lessons from Three Months

How I built a home Kubernetes cluster with Flux, SOPS secrets, and a dashboard that actually works — plus the skills that helped document everything along the way.

I've been running a homelab for a few years, but this March I decided to do it properly. Talos Linux on a Mac Mini, Flux for GitOps, SOPS for secrets, and a homepage dashboard that shows everything at a glance.

Three months in, I have a working cluster with Immich, Plex, Grafana, CrowdSec, Traefik, and a handful of other services. More importantly, I have documentation that actually compounds — each problem I solved made the next one easier.

Here's what I built, what broke, and how I'll do it again.

The Stack

  • Talos Linux — Immutable, API-driven Kubernetes. No SSH, no shell, no apt-get. Sounds limiting until you realize you never need to debug "what changed on the node."
  • Flux — GitOps controller. Everything lives in a git repo, Flux reconciles it to the cluster. Push a change, it lands. Revert a change, it's gone.
  • SOPS + Age — Encrypted secrets in git. No plaintext, no sealed-secrets complexity. Decrypt with your age key, apply, done.
  • Homepage — Dashboard with widgets for every service. Internal URLs, API tokens, done.
  • Traefik — Ingress with ACME DNS challenges for *.example.com certs.
  • CrowdSec — Bouncer + LAPI for intrusion detection at the edge.

The Journey (aka Things That Broke)

Storage Classes Matter More Than You Think

Early mistake: I used NFS (nfs-synology) for everything. Loki crashed in a persistent loop with directory not empty errors. PostgreSQL had WAL corruption.

The problem: NFS lacks POSIX fsync/locking semantics. Databases need them.

Storage ClassUse ForDon't Use For
local-pathDatabases, WAL, TSDB
nfs-synologyPhoto libraries, bulk storageAnything with strict fsync

Lesson: local-path for databases, NFS for everything else. One decision fixed a week of crash loops.

Node Scheduling Has Layers

The cluster went through four scheduling strategies:

  1. Worker node only — simple, single point of failure
  2. All nodes with control-plane tolerations — apps ran on control plane, messy
  3. Worker-only apps — clean isolation
  4. Monitoring on control plane — freed worker resources when I ran out

Current rule: Apps on workers. Monitoring on control plane. Documented exceptions in HelmRelease comments.

Helm Upgrades: The SSA Trap

Flux HelmRelease upgrades failed with:

invalid operation: cannot use force conflicts and force replace together

Turns out install.serverSideApply is a boolean (false), but upgrade.serverSideApply is an enum string ("disabled"). Same field name, different types.

spec:
  install:
    serverSideApply: false   # boolean
  upgrade:
    serverSideApply: disabled  # enum string
    force: true

The OpenTelemetry operator upgrade (0.109 → 0.110) removed kube-rbac-proxy, changing Service port layouts. Three-way merge created duplicate metrics ports. Fixed by pinning the version and using force: true + serverSideApply: disabled.

Lesson: Check chart changelogs for breaking infrastructure changes. Stay one version behind if unsure.

Synology + Traefik = Special Handling

Getting Traefik to proxy Synology DSM/Drive required three specific fixes:

  1. No CrowdSec on Synology routes — DSM's WebSocket-based UI and asset loading break with request inspection middleware
  2. Disable HTTP/2 upstream — Some DSM builds have HTTP/2 bugs causing connection failures
  3. EndpointSlice-only — Mixing v1 Endpoints with EndpointSlice causes ~50% 502s when IPs differ
apiVersion: traefik.io/v1alpha1
kind: ServersTransport
metadata:
  name: synology-insecure
  namespace: traefik
spec:
  insecureSkipVerify: true
  disableHTTP2: true

Cluster hygiene tip: If you ever have ~50% 502s on a route, check for mixed Endpoints + EndpointSlice:

kubectl get endpoints -n traefik synology-dsm  # Delete if present

Manually created Endpoints spawn a mirrored EndpointSlice via endpointslicemirroring-controller. Delete the Endpoints object and rely on Git-managed EndpointSlice only.

The VoidAuth Hairpin (CoreDNS Split-Horizon)

Initial VoidAuth setup had 700-900ms latency on unauthenticated requests. Root cause: APP_URL=https://auth.example.com caused internal requests to hairpin NAT — out to the public IP and back.

Fix: CoreDNS split-horizon to resolve auth.example.com to the Traefik ClusterIP internally:

# infrastructure/configs/coredns.yaml
data:
  Corefile: |
    .:53 {
        # ... other config ...
        template IN A auth.example.com {
            answer "{{ .Name }} 30 IN A 10.101.208.179"  # Traefik ClusterIP
        }
        forward . /etc/resolv.conf
        # ...
    }

Result: p99 latency reduced from 1.2s → ~300ms (3-4x faster).

What Actually Worked

SOPS for Secrets

SOPS with Age keys is dead simple. One gotcha: there's no sops --delete command. The workaround:

# Decrypt
SOPS_AGE_KEY_FILE=~/.config/sops/age/keys.txt sops -d file.sops.yaml > /tmp/plain.yaml
# Edit
vim /tmp/plain.yaml
# Re-encrypt (filename must match .sops.yaml path regex)
cp /tmp/plain.yaml /tmp/plain.yaml.sops.yaml
SOPS_AGE_KEY_FILE=~/.config/sops/age/keys.txt sops --config .sops.yaml -e /tmp/plain.yaml.sops.yaml > file.sops.yaml

Annoying but reliable. Documented it once, never had to figure it out again.

Homepage Widgets & Secret Management

Once you understand the rules, Homepage is great:

  • Secret keys must be prefixed HOMEPAGE_VAR_ for template substitution
  • Use internal service URLs (http://service.namespace.svc.cluster.local)
  • subPath mounts don't hot-reload — Reloader handles this automatically

Gotcha: Complete secret updates — When updating Homepage secrets, don't patch individual fields. Delete and recreate the secret to ensure all credentials are present:

kubectl delete secret -n homepage homepage-secrets
kubectl apply -f new-secret.yaml  # With ALL credentials from SOPS
kubectl rollout restart -n homepage deployment/homepage

This prevents the "missing credentials" issue where widgets show 401 errors because only some secrets were updated.

Widgets for Immich, Grafana, Traefik, CrowdSec, Plex, and Synology all working.

Flux Reconciliation

The workflow is muscle memory now:

# Edit the repo
# Commit and push
git add -A && git commit -m "feat: add thing" && git push

# Reconcile
flux reconcile kustomization apps --with-source

# Verify
kubectl get helmrelease -A

Flux Web UI

Flux has a built-in web UI (provided by the flux-operator) that shows the status of all GitOps resources at a glance. It's accessible at flux.example.com and provides:

  • Reconciliation status — Visual indicators for each Kustomization and HelmRelease
  • Resource details — Click into any object to see its spec and status
  • Error visibility — Quickly spot what's failing and why
  • History & events — See recent reconciliation attempts and outcomes

The enabling change: Add these values to your flux-instance HelmRelease:

spec:
  values:
    instance:
      web:
        enabled: true
        domain: flux.example.com

This deploys the flux-operator UI service in the flux-system namespace. The service exposes port http-web (typically 8080) and is then exposed externally via Traefik IngressRoute:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: flux-ui
  namespace: flux-system
spec:
  entryPoints:
    - websecure
  routes:
    - kind: Rule
      match: "Host(`flux.example.com`)"
      services:
        - name: flux-operator
          port: http-web

The UI is particularly helpful when debugging stuck reconciliations or understanding why a HelmRelease isn't deploying. It complements the CLI workflow for day-to-day operations.

Note: In the anonymized config, this is exposed via Traefik IngressRoute to flux.example.com with CrowdSec and VoidAuth protection.

Custom Grafana Dashboards

Four custom dashboards were built to make monitoring actionable:

  • Homelab Overview — Single-pane view of cluster health: CPU/memory gauges for control plane and worker nodes, disk usage, and key service status
  • K8s Cluster Health — Deep dive into Kubernetes resource health, pod counts, and namespace-level metrics
  • Traefik Site Health — Request rates, response times, error rates, and service-level metrics for all ingress traffic
  • Immich Deep Dive — Photo library metrics: upload rates, storage usage, transcoding performance, and user activity

Each dashboard is deployed as a ConfigMap with grafana_dashboard: "1" label, auto-discovered by Grafana. They're maintained in infrastructure/configs/grafana-dashboards.yaml and traefik-dashboard.yaml.

Why custom dashboards? Standard Kubernetes dashboards show raw metrics; custom dashboards show your services in context. When Immich is slow, you can immediately see if it's storage, transcoding, or database queries.

More details: See the flux-homelab-skill Grafana dashboards reference for deployment patterns and maintenance.

Documentation That Compounds

The most valuable output of this project isn't the cluster — it's the docs/solutions/ directory in your infrastructure repository.

Every time I solved a non-trivial problem, I documented it: the symptoms, what didn't work, the fix, and why it works. This is the ce:compound pattern — each solution documented makes the next one faster.

The flux-homelab skill (available at github.com/marr/flux-homelab-skill) encapsulates the patterns from this homelab — Flux reconciliation, SOPS secrets, Homepage widgets, and service-specific gotchas. It's like a runbook that travels with the agent.

When I ask "why is Immich down," it knows to check storage class first, then node scheduling, then Flux kustomization health. It remembers that node names changed, that OCI chart sources have two valid patterns, and that CrowdSec blocks wget user agents.

Skill contents include:

  • Cluster hygiene — Detecting mixed Endpoints + EndpointSlice issues, orphaned services, hardcoded LAN IPs
  • Flux reconciliation — Proper sequence for GitRepository → Kustomization → HelmRelease
  • HelmRelease patternsupgrade.force + serverSideApply interactions, stuck kustomization recovery
  • Secret management — SOPS file structure, secret synchronization, validation workflows

The ce:compound pattern (referenced from Compound Engineering) is a structured approach to documentation — researching problems, assembling solutions, writing them down, and validating that they work. Each documented solution makes the next one faster. This pattern applies to any project, not just homelabs.

Getting Started with Flux Homelab Experimentation

If you want to try this yourself:

1. Pick Your Platform

Talos is great if you want an immutable, API-driven cluster. k3s is great if you want something more familiar. Either way, you need:

  • A machine (old laptop, NUC, Mac Mini, VM)
  • A git repo for your Flux configuration
  • Time to break things

2. Bootstrap Flux

Follow the official Flux getting started guide to install Flux on your cluster, then bootstrap your GitOps repository:

flux bootstrap github \
  --owner=your-username \
  --repository=homelab-infra \
  --branch=main \
  --path=clusters/my-cluster

This creates the bootstrap kustomization that points to your infrastructure/ and apps/ directories.

Flux v2 Gotchas:

  • Namespace changes — In v2, Flux components run in flux-system by default. Some older guides reference flux namespace.
  • Kustomize controller — The controller now handles kustomizations natively; you don't need separate kustomize CLI workflows.
  • GitHub token scope — The bootstrap command needs a token with repo and workflow permissions. Personal Access Tokens work, but GitHub Apps are recommended for production.
  • Path separators — Use forward slashes even on Windows; Flux paths are Git repository paths, not filesystem paths.
  • OCI chart sources — Flux 2.8+ uses source.toolkit.fluxcd.io/v1 for GitRepository/HelmRepository/OCIRepository, but helm.toolkit.fluxcd.io/v2 for HelmRelease. OCI-backed charts have two valid patterns: HelmRepository with type: oci, or OCIRepository directly. If one pattern fails, try the other.
  • HelmRelease serverSideApply — The install.serverSideApply field is a boolean, but upgrade.serverSideApply is an enum string ("disabled"). Mixing them incorrectly causes validation errors. Use:
    install:
      serverSideApply: false
    upgrade:
      serverSideApply: disabled
      force: true
    
  • Stuck kustomizations — If a HelmRelease health check fails, downstream kustomizations with spec.dependsOn will wait indefinitely. Check kubectl describe kustomization <name> -n flux-system for HealthCheckFailed. High helm-controller CPU often indicates a failing Helm release in a retry loop.

3. Structure Your Repo

homelab-infra/
├── clusters/           # Flux bootstrap
├── infrastructure/
│   ├── controllers/    # Traefik, Prometheus, etc.
│   └── configs/        # Routes, monitoring, alerts
└── apps/
    └── my-app/         # namespace, kustomization, helmrelease

4. Add SOPS Early

Set up age encryption before you have any secrets:

age-keygen -o ~/.config/sops/age/keys.txt
# Add the public key to .sops.yaml

5. Build Incrementally

Don't try to deploy everything at once. Start with:

  1. Traefik (ingress)
  2. One app (immich, nextcloud, whatever)
  3. Homepage (dashboard)
  4. Monitoring (Prometheus + Grafana)
  5. Security (CrowdSec + auth)

Each step teaches you something. Document each step.

6. Document as You Go

Create a docs/solutions/ directory in your repo. When you solve something non-trivial, write it down. Use YAML frontmatter so future-you (or your AI assistant) can search by module, tags, and problem type.

The compounding effect is real. Week 1 I spent hours debugging Loki crash loops. Week 12, I checked the documented solution and fixed a similar issue in minutes.

What's Next

The cluster keeps growing. Next up:

  • Gitea Actions for CI (self-hosted runners in-cluster)
  • Synology Drive integration (finally got the Traefik routing right)
  • Better alerting rules in Alertmanager
  • Maybe a second node for actual HA

The homelab isn't a project with an end state. It's an ongoing experiment where each iteration teaches something new. The key is capturing those lessons so they compound.


For more on AI-assisted development, see my post on Building a Knowledge Graph with Obsidian and MCP.

© David Marr