Infrastructure · Automation · Reliability Engineering — Fairfield, CA. Get in touch →

Projects

Real infrastructure, real automation, real results. Each project solves a specific problem with measurable outcomes.

Flagship Case Studies

Deep dives into the engineering behind the most complex systems in the lab.

AI Cluster

4-node infrastructure with RTX 4090 inference, SRE automation, and multi-model orchestration.

View Architecture →

SRE Pipeline

SLOs, error budgets, burn-rate alerting, incident management, and auto-generated postmortems.

Read Case Study →

GitOps Backups

Automated daily backups with CI-enforced secret scanning — 11 patterns blocked, zero leaks.

Read Case Study →

Flagship Live Infrastructure

4-Node AI Inference Cluster

Problem: Cloud AI APIs are expensive, rate-limited, and create vendor lock-in for serious inference workloads.

Approach: Built a local 4-node cluster with GPU inference on an RTX 4090, centralized storage, and AI agent orchestration across all nodes.

Results: Running 3 LLMs locally (32B, 16B, 70B parameters), zero cloud inference cost, sub-second response times, 99.9% gateway availability target.

What I Built

Jasper (i9-13900K, 64GB, RTX 4090) — GPU inference gateway running Ollama
Nova (N305, 32GB) — Proxmox + TrueNAS + Ansible controller
Mira (i7-2600K, 16GB) + Orin (Dual Xeon E5-2667v4, 16GB) — compute fleet
OpenClaw AI agent orchestration with multi-model routing
52 cores / 80 threads / 128 GB RAM / 24 GB VRAM total cluster resources

Python Proxmox Ansible Ollama CUDA OpenClaw

GitHub Interactive Diagram Documentation

Recruiter TL;DR: Built and operate a production-grade AI cluster from bare metal — GPU inference, config management, monitoring, and documentation that rivals what you'd see at a mid-size SaaS company.

SRE Automation Pipeline

Problem: Manual monitoring and ad-hoc incident response don't scale, even for a homelab. Needed the same reliability practices used at Google/Netflix.

Approach: Built a complete SRE pipeline from scratch: SLO tracking, error budgets, burn rate alerting, incident management, and postmortem generation.

Results: 6 SLOs tracked across 5 rolling windows, automated incident creation on burn rate spikes, recruiter-grade postmortems auto-generated on resolution.

What I Built

SLO engine: 6 objectives (gateway 99.9%, inference 99.5%, dashboard 99%)
Error budget tracking with 5 rolling windows (1h, 6h, 24h, 7d, 28d)
Multi-window burn rate alerting (fast/medium/slow thresholds)
Incident Commander: auto-trigger, timeline, evidence packs, postmortem markdown
Gatekeeper safety gates: deny risky actions when burn rate is high
100+ tests across the full pipeline

Python SRE Bash Jinja2

View Code

Recruiter TL;DR: Implemented Google-style SLOs and incident management for a homelab — the same practices used by SRE teams at scale, applied to real infrastructure.

GitOps Backup & Secret Scanning

Problem: Infrastructure configs scattered across nodes with no audit trail, backup, or secret leak prevention.

Approach: Automated daily state collection from every node, sanitized before commit, with CI-based secret scanning as a hard gate.

Results: 4 nodes auto-committing daily, 11 secret patterns blocked by CI, zero credential leaks since deployment.

What I Built

Per-node backup scripts: systemd timers (Linux) + scheduled tasks (Windows)
Sanitize pipeline stripping API keys, tokens, emails, phone numbers before commit
GitHub Actions workflow scanning every push for 11 secret patterns
Credential policy documentation and incident response runbook
Segregated repo structure: each node owns its folder

Bash PowerShell GitHub Actions systemd

View Code

Recruiter TL;DR: Automated compliance-grade backups with secret scanning — the kind of secrets hygiene you'd expect from a security-conscious engineering team.

Proxmox Virtualization Cluster

Problem: Needed isolated environments for services, testing, and development without buying more hardware.

Approach: 3-node Proxmox cluster with centralized TrueNAS storage, Ansible-managed VM provisioning, and automated health checks.

Results: Running 10+ VMs/containers across heterogeneous hardware (N305, i7-2600K, Dual Xeon), managed via IaC.

What I Built

3-node Proxmox VE cluster (Nova + Mira + Orin)
TrueNAS on Nova: NFS/SMB shares for centralized VM storage
Ansible playbooks for VM provisioning and configuration
Automated health checks every 6 hours
Mixed hardware fleet: consumer to enterprise (Dell PowerEdge R630)

Proxmox Ansible TrueNAS NFS

Recruiter TL;DR: Enterprise virtualization on heterogeneous hardware — from mini-PCs to rackmount servers — all managed as code.

Network Security & Segmentation

Problem: A flat network with mixed workloads (AI inference, storage, management) creates unnecessary blast radius for security incidents.

Approach: Deployed OPNsense on dedicated hardware with VLAN segmentation, firewall rules, and Tailscale overlay for secure remote access.

Results: Traffic isolated between management, storage, and inference planes. SSH key-only auth, no password-based access to any system.

What I Built

OPNsense firewall on Qotom Q20342G9 dedicated appliance
VLAN segmentation: management, storage, inference, IoT
UniFi U7 Pro XG access point (WiFi 7)
Tailscale mesh VPN for zero-trust remote access
2.5GbE + 10GbE network segments for performance-sensitive traffic
SSH key-only authentication across all nodes

OPNsense VLANs Tailscale UniFi

Recruiter TL;DR: Enterprise-grade network security at home — segmentation, zero-trust access, and dedicated firewall hardware. Not a consumer router with port forwarding.

Chaos Testing & Resilience Engineering

Problem: You can't trust your infrastructure until you've intentionally broken it and watched it recover.

Approach: Built a chaos injection framework that kills services, saturates resources, and partitions networks — then validates recovery automatically.

Results: 95% chaos test success rate target, resilience score tracking, regression gates that block deploys if reliability drops.

What I Built

Chaos injection framework: service kills, resource saturation, network partitions
Resilience score engine: quantified reliability with regression gates
Automatic recovery validation after each chaos experiment
Integration with SLO pipeline — chaos failures trigger incident management
Evidence pack collection for post-chaos analysis

Python Chaos Engineering Bash Ansible

View Code

Recruiter TL;DR: Netflix-style chaos engineering applied to a homelab. If you want someone who tests failure scenarios before they happen in production, this is the proof.

Interested in working together?

Get in Touch