Projects

Real infrastructure, real automation, real results. Each project solves a specific problem with measurable outcomes.

Flagship Case Studies

Deep dives into the engineering behind the most complex systems in the lab.

Flagship Live Infrastructure

4-Node AI Inference Cluster

Problem: Cloud AI APIs are expensive, rate-limited, and create vendor lock-in for serious inference workloads.

Approach: Built a local 4-node cluster with GPU inference on an RTX 4090, centralized storage, and AI agent orchestration across all nodes.

Results: Running 3 LLMs locally (32B, 16B, 70B parameters), zero cloud inference cost, sub-second response times, 99.9% gateway availability target.

What I Built
  • Jasper (i9-13900K, 64GB, RTX 4090) — GPU inference gateway running Ollama
  • Nova (N305, 32GB) — Proxmox + TrueNAS + Ansible controller
  • Mira (i7-2600K, 16GB) + Orin (Dual Xeon E5-2667v4, 16GB) — compute fleet
  • OpenClaw AI agent orchestration with multi-model routing
  • 52 cores / 80 threads / 128 GB RAM / 24 GB VRAM total cluster resources
Python Proxmox Ansible Ollama CUDA OpenClaw
Recruiter TL;DR: Built and operate a production-grade AI cluster from bare metal — GPU inference, config management, monitoring, and documentation that rivals what you'd see at a mid-size SaaS company.

SRE Automation Pipeline

Problem: Manual monitoring and ad-hoc incident response don't scale, even for a homelab. Needed the same reliability practices used at Google/Netflix.

Approach: Built a complete SRE pipeline from scratch: SLO tracking, error budgets, burn rate alerting, incident management, and postmortem generation.

Results: 6 SLOs tracked across 5 rolling windows, automated incident creation on burn rate spikes, recruiter-grade postmortems auto-generated on resolution.

What I Built
  • SLO engine: 6 objectives (gateway 99.9%, inference 99.5%, dashboard 99%)
  • Error budget tracking with 5 rolling windows (1h, 6h, 24h, 7d, 28d)
  • Multi-window burn rate alerting (fast/medium/slow thresholds)
  • Incident Commander: auto-trigger, timeline, evidence packs, postmortem markdown
  • Gatekeeper safety gates: deny risky actions when burn rate is high
  • 100+ tests across the full pipeline
Python SRE Bash Jinja2
Recruiter TL;DR: Implemented Google-style SLOs and incident management for a homelab — the same practices used by SRE teams at scale, applied to real infrastructure.

GitOps Backup & Secret Scanning

Problem: Infrastructure configs scattered across nodes with no audit trail, backup, or secret leak prevention.

Approach: Automated daily state collection from every node, sanitized before commit, with CI-based secret scanning as a hard gate.

Results: 4 nodes auto-committing daily, 11 secret patterns blocked by CI, zero credential leaks since deployment.

What I Built
  • Per-node backup scripts: systemd timers (Linux) + scheduled tasks (Windows)
  • Sanitize pipeline stripping API keys, tokens, emails, phone numbers before commit
  • GitHub Actions workflow scanning every push for 11 secret patterns
  • Credential policy documentation and incident response runbook
  • Segregated repo structure: each node owns its folder
Bash PowerShell GitHub Actions systemd
Recruiter TL;DR: Automated compliance-grade backups with secret scanning — the kind of secrets hygiene you'd expect from a security-conscious engineering team.

Proxmox Virtualization Cluster

Problem: Needed isolated environments for services, testing, and development without buying more hardware.

Approach: 3-node Proxmox cluster with centralized TrueNAS storage, Ansible-managed VM provisioning, and automated health checks.

Results: Running 10+ VMs/containers across heterogeneous hardware (N305, i7-2600K, Dual Xeon), managed via IaC.

What I Built
  • 3-node Proxmox VE cluster (Nova + Mira + Orin)
  • TrueNAS on Nova: NFS/SMB shares for centralized VM storage
  • Ansible playbooks for VM provisioning and configuration
  • Automated health checks every 6 hours
  • Mixed hardware fleet: consumer to enterprise (Dell PowerEdge R630)
Proxmox Ansible TrueNAS NFS
Recruiter TL;DR: Enterprise virtualization on heterogeneous hardware — from mini-PCs to rackmount servers — all managed as code.

Network Security & Segmentation

Problem: A flat network with mixed workloads (AI inference, storage, management) creates unnecessary blast radius for security incidents.

Approach: Deployed OPNsense on dedicated hardware with VLAN segmentation, firewall rules, and Tailscale overlay for secure remote access.

Results: Traffic isolated between management, storage, and inference planes. SSH key-only auth, no password-based access to any system.

What I Built
  • OPNsense firewall on Qotom Q20342G9 dedicated appliance
  • VLAN segmentation: management, storage, inference, IoT
  • UniFi U7 Pro XG access point (WiFi 7)
  • Tailscale mesh VPN for zero-trust remote access
  • 2.5GbE + 10GbE network segments for performance-sensitive traffic
  • SSH key-only authentication across all nodes
OPNsense VLANs Tailscale UniFi
Recruiter TL;DR: Enterprise-grade network security at home — segmentation, zero-trust access, and dedicated firewall hardware. Not a consumer router with port forwarding.

Chaos Testing & Resilience Engineering

Problem: You can't trust your infrastructure until you've intentionally broken it and watched it recover.

Approach: Built a chaos injection framework that kills services, saturates resources, and partitions networks — then validates recovery automatically.

Results: 95% chaos test success rate target, resilience score tracking, regression gates that block deploys if reliability drops.

What I Built
  • Chaos injection framework: service kills, resource saturation, network partitions
  • Resilience score engine: quantified reliability with regression gates
  • Automatic recovery validation after each chaos experiment
  • Integration with SLO pipeline — chaos failures trigger incident management
  • Evidence pack collection for post-chaos analysis
Python Chaos Engineering Bash Ansible
Recruiter TL;DR: Netflix-style chaos engineering applied to a homelab. If you want someone who tests failure scenarios before they happen in production, this is the proof.

Interested in working together?

Get in Touch