Projects
Real infrastructure, real automation, real results. Each project solves a specific problem with measurable outcomes.
Flagship Case Studies
Deep dives into the engineering behind the most complex systems in the lab.
4-Node AI Inference Cluster
Problem: Cloud AI APIs are expensive, rate-limited, and create vendor lock-in for serious inference workloads.
Approach: Built a local 4-node cluster with GPU inference on an RTX 4090, centralized storage, and AI agent orchestration across all nodes.
Results: Running 3 LLMs locally (32B, 16B, 70B parameters), zero cloud inference cost, sub-second response times, 99.9% gateway availability target.
What I Built
- Jasper (i9-13900K, 64GB, RTX 4090) — GPU inference gateway running Ollama
- Nova (N305, 32GB) — Proxmox + TrueNAS + Ansible controller
- Mira (i7-2600K, 16GB) + Orin (Dual Xeon E5-2667v4, 16GB) — compute fleet
- OpenClaw AI agent orchestration with multi-model routing
- 52 cores / 80 threads / 128 GB RAM / 24 GB VRAM total cluster resources
SRE Automation Pipeline
Problem: Manual monitoring and ad-hoc incident response don't scale, even for a homelab. Needed the same reliability practices used at Google/Netflix.
Approach: Built a complete SRE pipeline from scratch: SLO tracking, error budgets, burn rate alerting, incident management, and postmortem generation.
Results: 6 SLOs tracked across 5 rolling windows, automated incident creation on burn rate spikes, recruiter-grade postmortems auto-generated on resolution.
What I Built
- SLO engine: 6 objectives (gateway 99.9%, inference 99.5%, dashboard 99%)
- Error budget tracking with 5 rolling windows (1h, 6h, 24h, 7d, 28d)
- Multi-window burn rate alerting (fast/medium/slow thresholds)
- Incident Commander: auto-trigger, timeline, evidence packs, postmortem markdown
- Gatekeeper safety gates: deny risky actions when burn rate is high
- 100+ tests across the full pipeline
GitOps Backup & Secret Scanning
Problem: Infrastructure configs scattered across nodes with no audit trail, backup, or secret leak prevention.
Approach: Automated daily state collection from every node, sanitized before commit, with CI-based secret scanning as a hard gate.
Results: 4 nodes auto-committing daily, 11 secret patterns blocked by CI, zero credential leaks since deployment.
What I Built
- Per-node backup scripts: systemd timers (Linux) + scheduled tasks (Windows)
- Sanitize pipeline stripping API keys, tokens, emails, phone numbers before commit
- GitHub Actions workflow scanning every push for 11 secret patterns
- Credential policy documentation and incident response runbook
- Segregated repo structure: each node owns its folder
Proxmox Virtualization Cluster
Problem: Needed isolated environments for services, testing, and development without buying more hardware.
Approach: 3-node Proxmox cluster with centralized TrueNAS storage, Ansible-managed VM provisioning, and automated health checks.
Results: Running 10+ VMs/containers across heterogeneous hardware (N305, i7-2600K, Dual Xeon), managed via IaC.
What I Built
- 3-node Proxmox VE cluster (Nova + Mira + Orin)
- TrueNAS on Nova: NFS/SMB shares for centralized VM storage
- Ansible playbooks for VM provisioning and configuration
- Automated health checks every 6 hours
- Mixed hardware fleet: consumer to enterprise (Dell PowerEdge R630)
Network Security & Segmentation
Problem: A flat network with mixed workloads (AI inference, storage, management) creates unnecessary blast radius for security incidents.
Approach: Deployed OPNsense on dedicated hardware with VLAN segmentation, firewall rules, and Tailscale overlay for secure remote access.
Results: Traffic isolated between management, storage, and inference planes. SSH key-only auth, no password-based access to any system.
What I Built
- OPNsense firewall on Qotom Q20342G9 dedicated appliance
- VLAN segmentation: management, storage, inference, IoT
- UniFi U7 Pro XG access point (WiFi 7)
- Tailscale mesh VPN for zero-trust remote access
- 2.5GbE + 10GbE network segments for performance-sensitive traffic
- SSH key-only authentication across all nodes
Chaos Testing & Resilience Engineering
Problem: You can't trust your infrastructure until you've intentionally broken it and watched it recover.
Approach: Built a chaos injection framework that kills services, saturates resources, and partitions networks — then validates recovery automatically.
Results: 95% chaos test success rate target, resilience score tracking, regression gates that block deploys if reliability drops.
What I Built
- Chaos injection framework: service kills, resource saturation, network partitions
- Resilience score engine: quantified reliability with regression gates
- Automatic recovery validation after each chaos experiment
- Integration with SLO pipeline — chaos failures trigger incident management
- Evidence pack collection for post-chaos analysis