Metadata-Version: 2.4
Name: verd
Version: 0.2.4
Summary: Multi-LLM debate engine — verdicts everywhere
Author: Manas Karra
License: MIT
Keywords: llm,debate,verdict,ai,multi-model,code-review
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: openai>=1.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: mcp>=1.0.0
Provides-Extra: slack
Requires-Dist: slack-bolt>=1.18.0; extra == "slack"
Requires-Dist: httpx>=0.27.0; extra == "slack"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"

# verd

**Five minds enter. They argue, challenge, cross-examine. Only the truth walks out.**

verd spawns multiple AI models from different families — each with a specialized role — has them debate your question across rounds, then a stronger judge delivers the final verdict with strengths, issues, and actionable fixes.

Use it everywhere: **CLI** for code reviews, **MCP** inside Claude Code and Cursor, **Slack** as `@verd` in any conversation, or **pipe** anything into it.

## Install

```bash
pip install verd
```

## Setup

verd talks to any OpenAI-compatible API. Set two env vars (or put them in a `.env` file):

```bash
export OPENAI_API_KEY=your-key
export OPENAI_BASE_URL=https://openrouter.ai/api/v1
```

verd runs multiple models in parallel (Claude, Gemini, GPT) so it needs a provider that routes to all of them. [OpenRouter](https://openrouter.ai) is the easiest — one key, all models. [LiteLLM proxy](https://docs.litellm.ai/) works too.

> **Note:** Model names must match your provider's IDs. The defaults use OpenRouter-style names (`claude-sonnet-4-6`, `gemini-2.5-flash`, etc.). If your LiteLLM proxy uses different names (e.g. `anthropic/claude-sonnet-4-6`), override them in `~/.verd.yaml`.

## Usage

```bash
verd "Kafka or RabbitMQ for our event pipeline?" -f architecture.md
verd "can this auth middleware be bypassed?" -f auth.py middleware.py
verdh "should we merge this?" -gb main
verdl "is O(n^2) acceptable for n=1000?"
cat deploy.yaml | verd "any misconfigs that could expose prod?"
```

## Output

```
FAIL  77%  In-memory rate limiter is unsafe for production

claude:FAIL  gpt:FAIL  gemini:FAIL  gpt:FAIL  (FULL)

+ Conceptually correct sliding-window logic
- Global dict is unsynchronized — race conditions in multi-thread servers
- Per-user lists grow without bounds — memory leak / DoS vector
! gpt-5-mini caught the risk of system clock jumps with time.time()
→ Move state to Redis with atomic operations

completed in 69.3s • 22,449 tokens • ~$0.07
```

Vote breakdown, unique catches (`!`), dissent, strengths, issues, and actionable fixes — all in one view.

## Modes

| Command | Debaters | Roles | Rounds | Speed | Cost |
|---------|----------|-------|--------|-------|------|
| `verdl` | 2 + judge | analyst, devils_advocate | 1 | ~15-30s | ~$0.02 |
| `verd` | 4 + judge | analyst, devils_advocate, logic_checker, pragmatist | 2 | ~30-60s | ~$0.15+ |
| `verdh` | 5 + judge + web | analyst, devils_advocate, logic_checker, fact_checker, pragmatist | 3 | ~60-120s | ~$0.40+ |

## Benchmark

Tested on the [Martian Code Review Benchmark](https://codereview.withmartian.com) — 50 real PRs from Cal.com, Discourse, Grafana, Keycloak, and Sentry with expert-labeled golden comments. No code-review-specific tuning.

| Mode | Precision | Recall | F1 Score | Avg Issues |
|------|-----------|--------|----------|------------|
| GPT-5.4 (alone) | 13.0% | 70.6% | 21.9% | 14.6 |
| Claude Opus 4.6 (alone) | 18.5% | 69.9% | 29.2% | 10.1 |
| **verdh (5-model debate)** | **29.1%** | **64.0%** | **40.0%** | **5.9** |

**+37% F1 over Claude solo. 57% more precise. 42% fewer false positives.** Fewer issues, more of them real.

## How it works

1. Your question + content gets sent to multiple AI models in parallel
2. Each model has a **specialized role** (analyst, devils_advocate, logic_checker, fact_checker, pragmatist)
3. Models see each other's responses and cross-examine for 1-3 rounds
4. **Anti-groupthink prompts** ensure models hold their ground when they have evidence — consensus without new evidence is rejected
5. A stronger judge model synthesizes the debate, weighting each reviewer by their role
6. **Confidence is calculated from vote distribution** — a fact_checker's dissent lowers confidence more than a devils_advocate's expected pushback
7. You get: verdict, vote breakdown, strengths, issues, unique catches, dissent, and actionable fixes

The key insight: different model families have different blind spots and training biases. Claude spots nuance GPT misses. Gemini catches logic errors DeepSeek overlooks. More importantly — if the same model writes the review and judges its quality, it's likely to agree with itself. Cross-model diversity means the judge is a genuine quality gate, not a model grading its own homework. The debate surfaces what each model uniquely caught and tells you exactly which model caught what.

## Roles

| Role | Job | Example catch |
|------|-----|---------------|
| **analyst** | Balanced initial assessment, main arguments for and against | "The architecture is sound but the auth flow has a gap" |
| **devils_advocate** | Find what others miss — edge cases, hidden assumptions, failure modes | "What happens when the token expires mid-transaction?" |
| **logic_checker** | Verify reasoning quality — fallacies, off-by-one, race conditions | "The pagination math is wrong: total_pages needs ceil division" |
| **fact_checker** | Web-grounded verification — do these APIs/libraries actually work? | "That library was deprecated in v3, use the new API" |
| **pragmatist** | Real-world practicality — will this ship? What's the ops burden? | "This works but needs 3 new infra dependencies your team doesn't know" |

The judge weighs each reviewer's input by role — a fact_checker citing sources carries more weight than a devils_advocate pushing back.

## Config

Customize models via `~/.verd.yaml`, env vars (`VERD_JUDGE`, `VERD_DEBATERS`, `VERD_BUDGET`, `VERD_TIMEOUT`), or CLI flags. Precedence: CLI > env > file > defaults.

```yaml
# ~/.verd.yaml
judge: gpt-5.4
debaters: claude-sonnet-4-6, gpt-4.1, gemini-2.5-flash
budget: 1.00
```

## Flags

```
-f FILE [FILE ...]    files to review         -g / -gs / -gb REF   git diffs
-d [DIR]              scan directory           -a / --ext / --exclude   filters
-q                    verdict only             --json                raw JSON
--judge MODEL         override judge           --debaters MODEL ...  override debaters
--budget USD          cost limit               --timeout SECONDS     per-call timeout
```

## MCP — Claude Code / Cursor

Add to `~/.claude.json` or `~/.cursor/mcp.json`:

```json
{
  "mcpServers": {
    "verd": {
      "command": "verd-mcp",
      "env": {
        "OPENAI_API_KEY": "your-openrouter-key",
        "OPENAI_BASE_URL": "https://openrouter.ai/api/v1"
      }
    }
  }
}
```

Then use `verd`, `verdl`, or `verdh` as tools directly in chat. Ask a question, paste code, then say "use verd to check this."

## Slack

Install with Slack dependencies:

```bash
pip install "verd[slack]"
```

Create a Slack app with Socket Mode enabled, add bot scopes (`app_mentions:read`, `channels:history`, `groups:history`, `chat:write`, `reactions:write`, `im:history`, `im:write`, `users:read`), then:

```bash
export SLACK_BOT_TOKEN=xoxb-...
export SLACK_APP_TOKEN=xapp-...
export SLACK_SIGNING_SECRET=...
verd-slack
```

Usage in Slack:
- `@verd what do you think?` — reads thread or last 20 channel messages, debates, replies in thread
- `@verd deep is this secure?` — uses verdh (5 models + web search)
- `@verd quick is this right?` — uses verdl (fast, 2 models)
- `@verd last 50 what's the consensus?` — reads last 50 messages as context
- `/verd should we use Kafka?` — slash command with live progress updates
- `/verdl is this correct?` — quick slash command
- `/verdh any security issues?` — deep slash command

Optional: restrict access via environment variables:
```bash
export VERD_ALLOWED_CHANNELS=C123,C456    # empty = all channels
export VERD_ALLOWED_USERS=U123,U456       # empty = all users
```
