Metadata-Version: 2.4
Name: agentfit
Version: 0.1.0
Summary: AI-Readiness Auditor: audit how well LLMs can work with a codebase
Project-URL: Homepage, https://github.com/voicutomut/AgentFit
Project-URL: Repository, https://github.com/voicutomut/AgentFit
Project-URL: Bug Tracker, https://github.com/voicutomut/AgentFit/issues
Project-URL: Changelog, https://github.com/voicutomut/AgentFit/releases
License: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Keywords: agents,ai,benchmarking,code-quality,devtools,llm,static-analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: typer[all]>=0.12
Provides-Extra: all
Requires-Dist: anthropic>=0.25; extra == 'all'
Requires-Dist: ollama>=0.1; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.25; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: mypy>=1.9; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.14; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: ollama
Requires-Dist: ollama>=0.1; extra == 'ollama'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Description-Content-Type: text/markdown

```
█████╗  ██████╗ ███████╗███╗   ██╗████████╗    ███████╗██╗████████╗
██╔══██╗██╔════╝ ██╔════╝████╗  ██║╚══██╔══╝    ██╔════╝██║╚══██╔══╝
███████║██║  ███╗█████╗  ██╔██╗ ██║   ██║       █████╗  ██║   ██║
██╔══██║██║   ██║██╔══╝  ██║╚██╗██║   ██║       ██╔══╝  ██║   ██║
██║  ██║╚██████╔╝███████╗██║ ╚████║   ██║       ██║     ██║   ██║
╚═╝  ╚═╝ ╚═════╝ ╚══════╝╚═╝  ╚═══╝   ╚═╝       ╚═╝     ╚═╝   ╚═╝
                
```

# AgentFit

**Does your codebase speak LLM?**

AgentFit audits how well AI models can actually *work* with your Python code — not just read it, but complete functions, fix bugs, navigate a live repo with tools, and explain architecture. It scores five static AI-readiness metrics, then verifies them by benchmarking real LLMs against auto-generated challenges.

> AgentFit eats its own dog food — it scores ≥ 80/100 on its own metrics.

---

## What it does

```
agentfit benchmark ./src
```

**Stage 1 — Static Analysis** scores your codebase on five dimensions that predict how well LLMs will perform on it:

| Metric | What it measures |
|--------|-----------------|
| Schema Density | How many data-passing functions use Pydantic / TypedDict / dataclasses |
| DRYness | Absence of duplicated function bodies |
| Docstring Richness | Presence of `>>>` usage examples in public docstrings |
| Test Coverage Structural | Ratio of test files to source files |
| Import Clarity | Absence of circular imports and dependency tangles |

**Stage 2 — LLM Benchmarking** auto-generates coding challenges (completion, debugging, explanation, refactoring) from your source tree, sends them to every configured provider, and judges responses with a second LLM.

**Stage 3 — Agentic Benchmarking** introduces a real mutation into a copy of your codebase and lets the model use filesystem + test tools over multiple turns to find and fix the bug — just like a developer would.

**Stage 4 — Correlated Reporting** finds which static metrics actually correlate with LLM performance on your specific codebase and surfaces prioritised, actionable recommendations.

---

## Install

```bash
pip install agentfit
```

Requires Python 3.11+. Optional provider SDKs:

```bash
pip install agentfit[anthropic]   # Anthropic Claude
pip install agentfit[openai]      # OpenAI / any OpenAI-compatible endpoint
pip install agentfit[all]         # everything
```

---

## Quick start

```bash
# 1. Scaffold a config file
agentfit init

# 2. Full audit — static analysis + LLM benchmarking
agentfit benchmark ./src

# 3. Static analysis only (no API keys needed)
agentfit benchmark ./src --no-llm

# 4. Gate CI — exit code 1 if score drops below 60
agentfit benchmark ./src --fail-below 60

# 5. Save full results to JSON
agentfit benchmark ./src --save-results
```

---

## Try it without any API key

`--no-llm` runs the full static analysis pipeline and gives you a scored report — no LLM provider, no API key, no cost:

```bash
pip install agentfit
agentfit init
agentfit benchmark ./src --no-llm
```

You get scores for all five metrics (Schema Density, DRYness, Docstring Richness, Test Coverage Structural, Import Clarity) plus ranked recommendations. The LLM benchmarking and correlation stages are skipped — those need a provider configured in `ai-bench.yml`.

---

## Sample output

```
╭──────────────────────── AgentFit Report ─────────────────────────╮
│ Source: ./src                                                      │
│ Generated: 2026-03-26T14:00:00+00:00                               │
│ Overall Score: 71.4   Threshold: 60.0  ✓ PASS                      │
╰────────────────────────────────────────────────────────────────────╯

Static Analysis
 Metric                    Score   Bar          Correlation
 Schema Density             82.0   ████████░░   strong ↑ (r=0.81)
 DRYness                    71.0   ███████░░░   —
 Docstring Richness         43.0   ████░░░░░░   strong ↑ (r=0.74)
 Test Coverage Structural   55.0   █████░░░░░   —
 Import Clarity             89.0   ████████░░   —

LLM Benchmark Results
 Provider    Model              Attempted  Passed  Mean Score  P50ms  P95ms
 anthropic   claude-sonnet-4-6       15      12        74.2    1203   2847
 qwen        local                   15       9        61.3    3100   5200

Recommendations
  1. [HIGH] Add usage examples to public functions
     Docstring Richness is 43.0/100. Strong positive correlation with LLM
     scores (r=0.74). Adding >>> examples significantly improves LLM performance.
  2. [MEDIUM] Increase structural test coverage
     ...
```

Use `--verbose` to also print per-metric warnings.

---

## Configuration

`agentfit init` writes an `ai-bench.yml` to the current directory:

```yaml
version: "1"

analysis:
  source_path: "."
  languages:
    - python
  metric_weights:
    schema_density: 1.0    # set to 0 to exclude from overall score
    dryness: 1.0
    docstring_richness: 1.0
    test_coverage: 1.0
    import_clarity: 1.0

providers:
  anthropic:
    enabled: true
    model: "claude-sonnet-4-6"
  openai:
    enabled: false
    model: "gpt-4o"
    # base_url: "https://your-local-endpoint/v1"   # any OpenAI-compatible API
    # name: "my-provider"                          # display name in reports

benchmarking:
  challenges_per_module: 3
  max_concurrent_requests: 5
  max_tool_rounds: 10        # agentic mode: max turns per challenge

scoring:
  judge_model: "claude-sonnet-4-6"
  judge_provider: "anthropic"

reporting:
  output_format: "text"
  fail_below: null
```

---

## Agentic benchmarking

AgentFit automatically generates `agentic_debugging` challenges for any source file that has a matching test file. Each challenge:

1. Introduces one mutation into a copy of your source tree (e.g. flips `==` → `!=`)
2. Gives the model access to five tools: `read_file`, `list_files`, `search_code`, `write_file`, `run_tests`
3. Runs a multi-turn loop until the model fixes the bug or `max_tool_rounds` is reached
4. Scores the result on correctness, fix quality, test verification, and round efficiency

**Supported providers:** Anthropic and any OpenAI-compatible endpoint. Ollama is not supported (no tool-use API).

You can also force any manual challenge through the agentic loop with `agentic: true`:

```yaml
# challenges.yml
- id: "explain-scoring"
  source_module: "agentfit.scoring"
  challenge_type: "explanation"
  agentic: true          # model reads the real codebase before answering
  prompt: |
    Read the source files and explain how challenge scoring works end-to-end.
  context_code: ""
  expected_behavior: |
    A detailed explanation covering ChallengeGenerator, JudgeLLM, and Scorer.
```

Run with manual challenges:

```bash
agentfit benchmark ./src --challenges challenges.yml --save-results
```

---

## Local / self-hosted LLM endpoints

Any OpenAI-compatible API works — Ollama, LM Studio, ngrok tunnels, Qwen, Mistral, etc.:

```yaml
providers:
  openai:
    enabled: true
    model: "qwen2.5-coder:14b"
    base_url: "https://xxxx.ngrok-free.app/v1"
    name: "qwen"    # shows as "qwen" in the report table
```

No API key required when `base_url` is set.

---

## CLI reference

```
agentfit init [--output PATH]
    Scaffold ai-bench.yml in the current directory.

agentfit benchmark SOURCE_PATH
    [--config PATH]          Override config file location
    [--fail-below SCORE]     Exit 1 if overall score < SCORE
    [--no-llm]               Static analysis only (no API keys needed)
    [--verbose, -v]          Show per-metric warnings
    [--challenges PATH]      YAML file of manually authored challenges
    [--max-challenges N]     Cap auto-generated challenges (manual always included)
    [--save-results]         Write full report to agentfit-results.json
    [--manual-eval]          Export challenge/response/verdict triples to JSONL
    [--load-evals PATH]      Merge a manual eval JSONL into auto scores
    [--output-format FORMAT] 'text' (default) or 'html'
    [--output-file PATH]     Write HTML report to file
    [--badge]                Write agentfit-badge.json
```

---

## Multi-language support

AgentFit analyses Python natively and has regex-based analysers for:

| Language | Schema Density | DRYness | Docstring Richness | Import Clarity | Test Coverage |
|----------|---------------|---------|-------------------|----------------|---------------|
| Python | ✓ | ✓ | ✓ | ✓ | ✓ |
| TypeScript / JS | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
| Rust | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
| Go | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
| Java | ✓ | ✓ | ✓ | ✓ | needs tests¹ |

> ¹ **Test Coverage Structural** works by pairing source files with test files (e.g. `engine.go` → `engine_test.go`). For non-Python languages the metric will score 0 if your project has no test files alongside the source — add tests to your project to get a meaningful score.

```yaml
analysis:
  languages:
    - python
    - typescript
    - rust
```

---

## Roadmap

| Version | Theme | Status |
|---------|-------|--------|
| v0.1 | Python static analysis + Anthropic/OpenAI/Ollama runners | ✓ Done |
| v0.2 | TypeScript/JavaScript AST analysis | ✓ Done |
| v0.3 | Rust/C/C++ + Go + Java support | x Partialy |
| v0.4 | Agentic tool harness + multi-turn debugging challenges | ✓ Done |
| v0.5 | HTML report export + CI badge generation | ✓ Done |
| v1.0 | Real pytest-cov integration + VS Code extension | Planned |

### What's left (v1.0)

- [ ] **Real `pytest-cov` integration** — blend runtime branch coverage with the structural score
- [ ] **VS Code extension** — inline metric decorations, status bar score, WebView report panel
- [ ] **CI/GitHub Actions** — self-audit job (`agentfit benchmark ./agentfit --fail-below 90`) on every PR
- [ ] **`mypy --strict`** — full type-checking across all modules
- [ ] **PyPI publish** — `pip i
stall agentfit` from the public registry

---

## Development

```bash
git clone https://github.com/voicutomut/AgentFit
cd AgentFit
pip install -e ".[dev]"
pytest                     # 739 tests
ruff check agentfit/       # lint
```

See [ROADMAP.md](ROADMAP.md) for the full phased implementation plan.

---

## Community suggestions — help shape AgentFit

AgentFit is early and the five metrics are our first take at what makes a codebase LLM-friendly. We want your input.

**Open an issue or start a discussion if you have thoughts on any of these:**

### Are the current metrics the right ones?

The five we picked:

| Metric | Our hypothesis |
|--------|---------------|
| **Schema Density** | Typed data structures give LLMs clear contracts to reason about |
| **DRYness** | Duplicated logic confuses context windows and wastes tokens |
| **Docstring Richness** | `>>>` examples are the most information-dense context you can give a model |
| **Test Coverage Structural** | Tests tell the model what "correct" looks like |
| **Import Clarity** | Circular deps and star imports obscure the dependency graph |

Do these match your experience? Have you noticed other code properties that seem to help or hurt LLM performance on your projects?

### What metrics are we missing?

Some candidates we're considering — tell us which matter most to you:

- **Naming consistency** — do identifiers follow a single convention? Does the model have to context-switch between styles?
- **Function length / cyclomatic complexity** — do shorter, focused functions produce better LLM completions?
- **Comment density** — inline comments vs. docstrings, which helps more?
- **Dependency freshness** — does using up-to-date libraries (in the model's training data) improve results?
- **Magic number / constant density** — does replacing raw literals with named constants help?
- **Error handling coverage** — does consistent exception handling improve LLM-generated patches?

### Other ways to contribute

- **Share a benchmark result** — run `agentfit benchmark ./your-repo --save-results` and share the JSON output. Real data helps us validate which metrics actually correlate with LLM performance.
- **Propose a new challenge type** — beyond completion, debugging, refactoring, and explanation, what coding tasks should we be measuring?
- **Report false positives** — if a metric scores your codebase unfairly, open an issue with a minimal example.

[Open an issue](https://github.com/voicutomut/AgentFit/issues) · [Start a discussion](https://github.com/voicutomut/AgentFit/discussions)
