SPEAKER NOTES — SLIDE 1 (10 seconds) Just the title. Let it breathe. SAY: "Hi, I'm [name]. I built gpumod because I got tired of fighting my GPU every time I wanted to try a new model. Let me show you what I mean."
SPEAKER NOTES — SLIDE 2 (40 seconds) SAY: "Raise your hand if you've ever OOM-crashed trying to load a model. Yeah. We've all been there. You see a post on r/LocalLLaMA — someone says this new model is amazing. But you have 24 gigs. There are 15 different quants. Q4_K_M? Q5_K_S? Q6_K? Which one actually fits? And if it fits, can you still run your embedding service next to it? You end up doing VRAM math in a spreadsheet. Or worse — you YOLO it and get a CUDA OOM."
SPEAKER NOTES — SLIDE 3 (40 seconds) SAY: "So I built gpumod. It's an MCP server — meaning your AI assistant can directly search models, check VRAM, and manage GPU services. Instead of writing systemd units and doing VRAM math, you just ask in natural language. Let me show you exactly what this looks like. Live." TRANSITION: Switch to terminal with Claude Code open. Make sure gpumod MCP server is connected. Make sure current mode is "blank" before starting. PRE-DEMO CHECKLIST: gpumod mode switch blank # Verify: gpu_status should show ~24GB free, blank mode
SPEAKER NOTES — SLIDE 4 (~3 minutes) This is the core. You type prompts into Claude Code. Claude calls gpumod MCP tools automatically. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ STEP 1 — SEARCH (20 seconds) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ TYPE: "Search for Qwen3-Coder GGUF models" Claude calls: search_hf_models(search="Qwen3-Coder", task="code") → Shows list of repos including unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF SAY: "There it is. 30B parameters, Mixture of Experts. Let's see which quants fit my 24 gigs." ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ STEP 2 — FILTER BY VRAM (30 seconds) ← THE MONEY SHOT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ TYPE: "List GGUF files for unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF that fit my machine" Claude calls: list_gguf_files( repo_id="unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF", vram_budget_mb=24000 ) → Filters out quants that won't fit, shows only viable ones SAY: "Out of all those quants, only these fit my card. No spreadsheet. No guessing. It just tells me." ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ STEP 3 — CHECK CURRENT STATE (20 seconds) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ TYPE: "What's my GPU status?" Claude calls: gpu_status() → Shows: RTX 4090, 24GB total, blank mode, ~24GB free SAY: "24 gigs free. But I don't just want the coder model — I also need my embedding service for code search. Will both fit?" ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ STEP 4 — SIMULATE (40 seconds) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ TYPE: "Simulate code mode" Claude calls: simulate_mode(mode_id="code") → Shows: qwen3-coder 20GB + vllm-embedding-code 2.5GB = 22.5GB → fits: true, headroom: ~2GB SAY: "22.5 out of 24. It fits with 2 gigs to spare. No OOM crash. I know before I deploy." ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ STEP 5 — SHOW ALL MODES (30 seconds) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ TYPE: "Show me all available modes" Claude calls: list_modes() → Shows 10 modes: blank, code, rag, nemotron, speak, multi-agent-qwen3-coder, multi-agent-qwen3-next, etc. SAY: "I've built up 10 different modes over time. Code mode for development. RAG mode for search. Multi-agent modes that run 5 parallel sessions. Even a speak mode with ASR and TTS. One command to switch between any of them." ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ STEP 6 — SWITCH MODE (30 seconds, optional if time allows) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ TYPE: "Switch to code mode" Claude calls: switch_mode(mode_id="code") → Starts vllm-embedding-code + qwen3-coder → Shows services starting up SAY: "Done. Embedding service is up, coder model is loading. One command — gpumod handles the rest." NOTE: qwen3-coder takes ~30-60s to fully load. If it's still loading, say: "The model is loading now — large models take 30-60 seconds. But the point is: I didn't write a single config file." ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ IF SOMETHING GOES WRONG: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - Tool call slow → "Real GPU operations take a few seconds" - Simulate says won't fit → "See? It just saved me from an OOM crash" - Service fails to start → "This is why we simulate first" - Anything weird → TYPE: "Switch to blank mode" (always works as recovery) TRANSITION: Switch back to slides.
SPEAKER NOTES — CHEAT SHEET (not shown to audience) Keep this slide on your second monitor / presenter view. The audience sees the previous slide (LIVE DEMO). If you're using Marp presenter mode, this slide is your safety net. Copy-paste each prompt in order. TIMING GUIDE: 1. Search → 20s (wait for results, point out the model) 2. List GGUFs → 30s (THE MONEY SHOT — show filtering) 3. GPU status → 20s (show 24GB free, blank mode) 4. Simulate → 40s (show VRAM breakdown, fits: true) 5. List modes → 30s (show 10 modes, narrate variety) 6. Switch mode → 30s (optional — skip if behind on time) AFTER STEP 6 (or 5 if skipping): Switch back to slides → next slide is CTA IF STUCK: "Switch to blank mode" always resets everything.
SPEAKER NOTES — SLIDE 6 (30 seconds) SAY: "That's gpumod. From Reddit post to running model — search, filter, simulate, deploy. All through natural language. It's open source, Apache 2.0, two commands to install. Link is on the screen. Questions?"