Metadata-Version: 2.4
Name: getbrief
Version: 0.7.0
Summary: Content compression for AI agents. Extract once, render per query.
License: MIT
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: fastapi>=0.115.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: openai>=1.55.0
Requires-Dist: pydantic>=2.8.0
Requires-Dist: typer>=0.12.0
Requires-Dist: uvicorn>=0.30.0
Requires-Dist: trafilatura>=2.0.0
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: mcp>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.2.0; extra == "dev"
Provides-Extra: playwright
Requires-Dist: playwright>=1.40.0; extra == "playwright"
Provides-Extra: transcribe
Requires-Dist: faster-whisper>=1.0.0; extra == "transcribe"

<div align="center">
  <img src="assets/logo.png" alt="Brief Logo" width="300" />

  [![PyPI](https://img.shields.io/pypi/v/getbrief?color=blue)](https://pypi.org/project/getbrief/)
  [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
  [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
</div>

# Brief

Reading the web is the most expensive thing your agent does, and the least of what it's good at.

Brief reads content so your agent doesn't have to. Give it a URL and a question, and it extracts, summarizes, and caches the answer. Webpages, videos, PDFs, Reddit, GitHub, all through one interface.

```python
from brief import brief

# Is this page even relevant? (~1 sentence)
brief("https://fastapi.tiangolo.com/", "async support", depth=0)

# What do I need to know? (summary + key points)
brief("https://fastapi.tiangolo.com/", "async support", depth=1)

# Give me everything. (detailed analysis with examples)
brief("https://fastapi.tiangolo.com/", "async support", depth=2)

# Works with local projects too
brief("./my-project", "how does authentication work", depth=1)
```

Every answer is cached as a plain `.brief` file. Ask once, reuse forever. A team of agents can share research without repeating work. One agent investigates, another reasons, another writes, and nobody re-reads the same source.

The more your system runs, the more it already knows.

```python
# Agent A researches FastAPI (extraction + LLM call)
brief("https://fastapi.tiangolo.com/", "async support", depth=2)

# Minutes later, Agent B is writing a comparison doc.
# Same URL, same question. Instant. No fetch, no LLM, no tokens.
check_brief("https://fastapi.tiangolo.com/")
# → "async-support-deep.brief: FastAPI handles async natively..."

# Agent B goes deeper with a new question. One LLM call, no re-extraction.
brief("https://fastapi.tiangolo.com/", "error handling", depth=1)
```

Agent A paid the cost. Agent B got it free. Agents should spend tokens on reasoning, not searching.

## Install

Requires Python 3.12+.

```bash
pip install getbrief
```

Brief uses any OpenAI-compatible LLM for summarization. Create a `.env` file:

```bash
BRIEF_LLM_API_KEY=sk-or-v1-your-key
BRIEF_LLM_BASE_URL=https://openrouter.ai/api/v1
BRIEF_LLM_MODEL=google/gemma-3-4b-it:free # any OpenRouter free or cheap model works
```

Free models work great. Also works with OpenAI, Ollama (local), and Groq. See [.env.example](.env.example) for all options.

## Common patterns

### Triage many URLs, then go deep on what matters

```python
from brief import brief_batch, brief

# Scan 10 URLs for pennies. Which ones are relevant?
headlines = brief_batch([
    "https://docs.python.org/3/library/asyncio.html",
    "https://fastapi.tiangolo.com/",
    "https://flask.palletsprojects.com/",
], query="python async web framework", depth=0)

# Go deep on the one that matters
detail = brief("https://fastapi.tiangolo.com/", "async support", depth=2)
```

### Compare sources

```python
from brief import compare

# Briefs each source, then synthesizes a comparison
result = compare(
    ["https://fastapi.tiangolo.com/", "https://flask.palletsprojects.com/"],
    query="how do they handle middleware",
    depth=1,
)
```

### Check what's already been researched

```python
from brief import check_brief

# Overview of all sources
check_brief()

# Detail for a specific URL
check_brief("https://fastapi.tiangolo.com/")

# Search by topic across all briefs
check_brief("authentication")
# → matching briefs from any source that mentions authentication
```

## Depth levels

```
depth=0   headline    one sentence, is this worth reading?
depth=1   summary     2-3 sentences + key points (default)
depth=2   deep dive   detailed analysis with specifics, examples, trade-offs
```

Each (query, depth) pair produces its own `.brief` file.

## Content types

Brief handles six content types with the same interface:

- **Webpages** — [trafilatura](https://trafilatura.readthedocs.io/) strips navigation, ads, and scripts. Falls back to [httpx](https://www.python-httpx.org/), then optionally [Playwright](https://playwright.dev/) for bot-protected sites.
- **Videos** — [yt-dlp](https://github.com/yt-dlp/yt-dlp) fetches captions. If none exist, [faster-whisper](https://github.com/SYSTRAN/faster-whisper) transcribes audio locally.
- **PDFs** — [pymupdf](https://pymupdf.readthedocs.io/) extracts text page by page.
- **Reddit** — fetches post content and top comments via Reddit's JSON API.
- **GitHub** — repo metadata, README, file tree, docstrings, and issues. Ask a specific question and Brief reads the source files that answer it — not just the README.
- **Local paths** — project directories and files from disk. Ask about caching and Brief finds every file that mentions it, even if no file is named `cache`.

## Interfaces

### Python

```python
from brief import brief, brief_batch, compare, check_brief
```

### CLI

```bash
brief --uri "https://example.com" --query "key takeaways"
brief --uri "https://example.com" --depth 0
brief --compare --batch "https://url1.com" --batch "https://url2.com" --query "compare"
brief --list
```

### MCP

```json
{
  "mcpServers": {
    "brief": {
      "command": "uvx",
      "args": ["--from", "getbrief", "brief-mcp"],
      "env": {
        "BRIEF_LLM_API_KEY": "sk-or-v1-your-key",
        "BRIEF_LLM_BASE_URL": "https://openrouter.ai/api/v1",
        "BRIEF_LLM_MODEL": "google/gemma-3-4b-it:free"
      }
    }
  }
}
```

This gives your agent three tools:

- **brief_content** — brief a URL with a query at depth 0–2
- **check_existing_brief** — no URI = overview, URL = what's been asked, topic = search across all briefs
- **compare_sources** — compare multiple URLs with synthesis + TRAIL breadcrumbs

### HTTP API

```bash
uvicorn brief.api:app --port 8080
```

```bash
# Brief a URL
curl -X POST http://localhost:8080/brief \
  -H "Content-Type: application/json" \
  -d '{"uri": "https://fastapi.tiangolo.com/", "query": "async support", "depth": 1}'

# List all briefs
curl http://localhost:8080/briefs

# Health check
curl http://localhost:8080/health
```

## The `.briefs/` folder

Every URL gets its own subdirectory. Each (query, depth) adds a new `.brief` file:

```
.briefs/
├── fastapi-tiangolo-com/
│   ├── _source.json                 raw extraction, no LLM output
│   ├── _files/                      query-fetched source code (GitHub)
│   ├── async-support.brief          depth=1 answer
│   └── async-support-deep.brief     depth=2 answer, same query, richer
├── _comparisons/                    cached cross-source comparisons
└── _index.sqlite3                   fast lookups + full-text search
```

Each `.brief` file includes a TRAIL section at the bottom, listing sibling briefs for the same source:

```
─── TRAIL ──────────────────────────────────────
→ async-support.brief
→ error-handling-deep.brief
→ _source.json
```

When an agent opens any `.brief` file, it instantly sees what else has already been asked about that source. No API call, no index lookup, just read the file. This means agents can build on each other's research naturally.

Sources are re-extracted automatically when they go stale — GitHub repos after 7 days, webpages after 3, Reddit threads after 1 day, videos and PDFs never (they're immutable).

## Configuration

Brief uses any OpenAI-compatible provider. `OPENAI_API_KEY` also works as a fallback if `BRIEF_LLM_API_KEY` is not set.

For video transcription without captions:

```bash
pip install getbrief[transcribe]  # installs faster-whisper
```

For bot-protected sites (Cloudflare, etc.):

```bash
pip install getbrief[playwright]
playwright install chromium
```

For GitHub repos, the public API is rate-limited to 60 requests/hour. Set a token for higher limits:

```bash
GITHUB_TOKEN=ghp_your-token
```

## Troubleshooting

- **Paywalled / auth-protected content** — Brief returns a clear error for 401/403/429 responses. It cannot extract content behind logins or paywalls.
- **Bot protection (Cloudflare, etc.)** — Install Playwright: `pip install getbrief[playwright] && playwright install chromium`
- **Stale data** — Brief auto-refreshes stale sources, but you can force re-extraction: `brief --uri <URL> --force`
- **Clear all cached data** — Delete the `.briefs/` folder.
- **LLM not responding** — Check your `.env` file has valid API keys. Brief falls back to a heuristic summary if the LLM is unavailable.

## Contributing

Brief is designed to be easy to extend. New extractors live in `brief/extractors/` and each one is a single file implementing one function:

```python
def extract(uri: str) -> list[dict[str, Any]]:
    """Return a list of chunks with 'text' and 'start_sec' keys."""
```

Contributions welcome: new content types, better summarization, CLI improvements, or API enhancements.
