Metadata-Version: 2.4
Name: broken-link-finder
Version: 0.1.1
Summary: Find broken outbound links on websites and discover contact emails for outreach. Usable as a CLI, library, or MCP server for Claude Code.
Requires-Python: >=3.10
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: mcp[cli]>=1.0.0
Description-Content-Type: text/markdown

# broken-link-finder

Find broken outbound links on any website and discover contact emails for outreach. Works as a **CLI tool**, **Python library**, or **MCP server for Claude Code**.

## Install

```bash
pip install broken-link-finder
```

## Usage

### CLI

```bash
# Check a CSV of target URLs for broken links
broken-link-finder --input targets.csv --output results.csv

# Limit to 5 pages for a test run
broken-link-finder --input targets.csv --limit 5

# Custom delay between requests (default: 1 second)
broken-link-finder --input targets.csv --delay 2.0

# Re-process already-crawled URLs
broken-link-finder --input targets.csv --force
```

**Input CSV format:**

| University | Tier | Category | Target URL | Notes |
|---|---|---|---|---|
| Example University | 1 | Career Center | https://example.edu/careers | |

### MCP Server (Claude Code)

Add to your project's `.mcp.json` or `~/.claude/settings.json`:

```json
{
  "mcpServers": {
    "broken-links": {
      "command": "uvx",
      "args": ["broken-link-finder-mcp"]
    }
  }
}
```

This gives Claude three tools:

| Tool | Description |
|---|---|
| `find_broken_links` | Crawl a page, check all outbound links, return broken ones with contact emails |
| `check_url` | Quick check if a single URL is reachable |
| `find_contact_emails` | Find contact emails for any website |

### Python Library

```python
import asyncio
from broken_link_finder import BrokenLinkFinder, ContactFinder

async def main():
    async with BrokenLinkFinder(delay=1.0) as finder:
        result = await finder.crawl_page("https://example.com/resources")
        for broken in result.broken_links:
            print(f"{broken.broken_url} -> {broken.status_code}")

    cf = ContactFinder()
    emails = await cf.find_emails("https://example.com")
    print(emails)

asyncio.run(main())
```

## Contact Email Discovery

The `ContactFinder` uses three strategies to find the right person to contact:

1. **Page scraping** — extracts `mailto:` links and email patterns from the crawled page
2. **Contact page discovery** — checks common paths (`/contact`, `/about`, `/contact-us`, etc.)
3. **Hunter.io API** — optional domain-level email lookup

Set `HUNTER_API_KEY` as an environment variable to enable Hunter.io lookups.

Emails are scored by outreach relevance: career/webmaster/editor emails rank highest, named-person emails rank next, generic addresses lower, and free providers (gmail, etc.) rank lowest.

## Incremental Processing

The CLI tracks processed URLs in `output/processed-urls.json`. On subsequent runs, already-processed URLs are automatically skipped. Use `--force` to re-process everything.

## Docker

```bash
docker compose run --rm broken-link-finder python main.py --limit 5
docker compose up  # full run
```
