Josh Kappler

I build autonomous
AI agents.

I build production AI agents from scratch. I write the orchestration layer myself: tool loops, state machines, memory, and multi-provider routing. No LangChain, no CrewAI. Before engineering I grew a YouTube channel (Boffy) to 2.1M subscribers, so I can ship long projects to the end and explain technical things in a way people actually want to watch.

Open to founding, forward-deployed, applied-AI, and DevRel roles · San Francisco or remote

01 / Projects

What I have built

Everything here was built from scratch. I write the orchestration layer myself. No LangChain, no CrewAI, no agent frameworks.

01Deal Analysis + Investment Memo Platform

memo-engine

memo-engine output 1
memo-engine output 2
memo-engine output 3

Live demo: the AMC Entertainment run, end to end

memo-engine is an AI deal-analysis and investment-memo platform built for a private credit investment firm. It takes a messy deal data room (PDFs, Excel models, Word drafts, Outlook emails, scans) and produces an institutional-format credit memo where every claim is cited back to the exact source page or cell. Reasoning passes run on Claude Fable 5 over agentic RAG with pgvector; parsing, extraction, drafting, and export run as durable workflow steps. The client build is under NDA. The public demo is the same system pointed at public data: an 80-file SEC data room for AMC Entertainment, ingested and analyzed end to end, browsable down to each citation.

Next.js 16TypeScriptAnthropic SDKClaude Fable 5PostgrespgvectorVoyage AIVercel Workflow DevKit

Technical Details

Contextual retrieval: per-chunk Sonnet 4.6 prefixes run over the full document, with the first 400K chars cached via ephemeral prompt caching so every call reads at the $0.30/M cached rate
Voyage AI voyage-3 embeddings (1024-dim) batched by byte budget (≤400KB, ≤96 items) to respect the 320K-token-per-batch cap on dense financial text
Forced tool_use with Zod-to-JSON-schema for ~40-field structured extraction: credit snapshot, capital structure, financials, covenants, management, comps, scenarios
Reasoning and extraction split by API constraint: Fable 5 thinks through the deal (thinking is always on), then Sonnet runs the forced tool_use extraction, because the API rejects thinking combined with forced tool choice
Durable pipeline orchestration via Vercel Workflow DevKit: parse, analysis, research, internal memo, and external memo each run as a step with its own 800s budget
Multi-format export: PDF via @sparticuz/chromium + puppeteer-core (Vercel-compatible headless Chromium), Excel with ExcelJS formulas and sensitivity tables, DOCX, ZIP bundle
02Full-Stack Claim Adjudication · Built in 36 Hours

claim-wright

claim-wright output 1
claim-wright output 2
claim-wright output 3

The claim workspace, batch history, and calibration workbench (sample data)

claim-wright is a fully working full-stack claim-adjudication system built end to end in a single 36-hour sprint. It reads the documents behind a security-deposit insurance claim (lease, tenant ledger, deposit-waiver addendum, move-out itemization, repair invoices) and recommends a payout capped at the policy benefit, or a decline, with a line-by-line audit trail behind every dollar. The split is the whole point: Claude Opus 4.8 does the reading and extracts structured facts, then a pure-Python deterministic engine applies the caps, rules, and eligibility gates, so a payout can never be a number the model invented. On a held-out test split it lands within $250 of the human adjudicator on 91% of claims with a median error of $0, at about $0.33 per claim.

Python 3.13Anthropic SDKClaude Opus 4.8PydanticDjango-NinjaReact 19ViteSQLite

Technical Details

Full stack in a weekend: a Python adjudication core, a Django-Ninja API, a React 19 single-page app, and a packaged desktop build, all shipped end to end in 36 hours
Model reads, engine decides: forced tool_use extraction pulls charges, ledger balance, and eligibility, then a pure-Python function computes the payout, so every dollar traces back to code and a document line and nothing is hallucinated
91% of claims within $250 of the human decision, median error of $0, mean absolute error of $62, at about $0.33 per claim on the held-out test split
Multi-user with per-tenant SQLite databases and workspace sharing: a run can be snapshotted and shared read-only into a space, copied on share so a viewer never touches the originator's live data
Security hardening throughout: master-approved signup, PBKDF2 passwords, session tokens stored only as SHA-256 hashes, and an allow-list column projection that structurally blocks the human-answer fields from ever reaching the model
Built-in white-hat security pass: the code is reviewed by autohack, my own autonomous bug-hunter, which traces user input to sinks and has a second model try to disprove each finding
Hybrid document reading routes each PDF page by text density: about 75% read free with pure-Python pdfminer, scanned pages go to vision, and no poppler or tesseract binaries means the same code runs everywhere including the desktop build
Calibration with zero API calls: extractions are stored and the engine is a pure function, so a candidate rulebook (JSON, not code) re-scores against the human decisions by replaying stored reads
03Autonomous Security Agent

autohack

autohack output 1
autohack output 2

The real-time hunt dashboard (sample data)

A 5-package TypeScript monorepo that polls four bounty platforms, spawns hour-long Claude sessions to hunt for vulnerabilities, validates its own findings through adversarial review, and submits reports without human intervention. A separate Sonnet pass compresses verbose findings before submission. The system writes hunt outcomes, near-misses, and triager feedback to a JSON memory store so every future session starts with context from every past one. The same harness also runs a bounty agent on the Algora platform: it spawns Claude Code sessions for long autonomous runs, executes the test suite, opens PRs, and addresses review feedback on its own.

TypeScriptAnthropic SDKNext.js 15SQLiteDrizzletRPC

Technical Details

12-state finding lifecycle from discovery through submission across HackerOne, Immunefi, Huntr, and an aggregator covering Bugcrowd, Intigriti, and YesWeHack
Adversarial review: a separate Claude instance scores findings on a 0-15 binary rubric. Anything below 8 is rejected before it reaches a triager
Ephemeral prompt caching cuts input tokens by roughly 90% across repeated hunt sessions, with a local backend fallback for development
Cross-process coordination via lock files, shared runtime-override JSON with a 2-second TTL cache, and stale-PID detection on startup
Error classification (transient, permanent, validation, timeout) decides whether to retry, skip, or kill the hunt
Real-time tRPC dashboard with xterm.js terminal streaming live Claude tool calls and reasoning
04Claude Code, Driven From an Apple Watch

pinch

pinch output 1
pinch output 2
pinch output 3
pinch output 4
pinch output 5

Real screens from the watchOS app

pinch drives a real Claude Code session from an Apple Watch over cellular. A native watchOS SwiftUI app is the thin client; a Node and TypeScript backend runs the Claude Agent SDK against the live repos on my Mac, and a tunnel exposes it to the wrist. watchOS refuses WebSockets on the watch's network path, so the transport is plain HTTP request/response with a short-poll loop instead of a socket. Prompts go through a durable on-device outbox that retries until a confirmed 2xx, and the backend dedups by client prompt id so an at-least-once retry never double-runs a turn. Session state is recorded durably, so a backend restart or idle sweep revives the same conversation with full context through the SDK's resume.

SwiftSwiftUIwatchOSTypeScriptClaude Agent SDKNode.jsngrok

Technical Details

HTTP request/response with a short-poll loop instead of a socket: watchOS refuses URLSessionWebSocketTask on the watch's cellular path, so the watch polls /api/* while the browser simulator keeps the WebSocket, both driving one shared session lifecycle
Durable persisted outbox on the watch: a prompt is removed only on a confirmed 2xx, drained FIFO single-flight with Sending / Sent / Not sent states, and the backend dedups by client prompt id so a retry can never double-run a turn
Session resume across restarts: our session id maps to the SDK session id in a durable record, so an idle-swept or restarted backend rebuilds the conversation with options.resume and Claude keeps full context
Poll-cursor invariant kills duplicate replies: a resumed session continues its persisted cursor while a revived session resets to zero on a backend reset signal, so the event log never re-delivers history
Watch-aware output: a cached system-prompt append tells the model it is speaking to a wrist screen with text-to-speech, so replies stay plain-text and brief without touching tools, edits, or rigor
Stable ngrok static domain with bearer-token auth on every request; the watch can restart the backend from Settings, which builds first and only swaps the process if the build succeeds

02 / YouTube

2.1M

subscribers on YouTube

270M+

Total Views

136+

Videos

7+

Years

I have been creating content on YouTube for over seven years under the name Boffy. I grew the channel from zero to 2.1 million subscribers, mostly gaming. No team at first. A lot of the videos were technical: game modding, and how PC parts like graphics cards change the way a game runs. I learned how to make that watchable for a huge audience.

Eventually I hired editors and designers, negotiated sponsorships with RedMagic, Wargaming, GeoGuessr, and others, and spent a lot of time in analytics trying to figure out what was actually working.

Running a YouTube channel at this scale is mostly a feedback loop. You put something out, look at how people respond, and adjust. Same instinct I bring to shipping and explaining software.

Brand Partnerships
RedMagicWargamingGeoGuessrYouToozFactorGamerSuppsEllify

03 / About

How I got here

I build AI agents from scratch. I write the orchestration layer myself. Tool-use loops, state machines, memory management, multi-provider routing. Every system in the project list was built solo, no LangChain, no CrewAI, no agent frameworks.

Before this I spent seven years growing a YouTube channel from zero to 2.1 million subscribers, mostly gaming. A lot of it meant taking something complicated, like a modding workflow or why one graphics card beats another, and making a general audience actually want to sit through it. That is the same skill developer advocacy runs on, which is why it interests me as much as engineering does.

What I work with

TypeScriptPythonNext.jsPostgreSQLSQLiteZodPydanticAnthropic SDKClaude Fable 5GroqOpenRouter

How I build

  • Hand-rolled orchestration, no LangChain, no CrewAI
  • Claude Code as primary dev tool
  • Model tiering per step: Fable 5 reasons, Sonnet extracts, Haiku routes
  • Multi-provider LLM routing (Claude, Groq, OpenRouter, Ollama)
  • Full-stack: backend, frontend, dashboards, deployment
  • State machines for agent lifecycle management
  • Recording outcomes and feeding them back into future runs

04 / Contact

Get in touch.

I'm looking for AI engineering or developer-advocacy roles at early-stage startups in the Bay Area or remote. I spent seven years making gaming videos for a 2.1M-subscriber audience, a lot of it explaining game modding and PC hardware, and now I build AI agents from scratch. If you're building something interesting, I want to hear about it.

Josh Kappler · 2026