Metadata-Version: 2.4
Name: nordcrawl
Version: 0.1.1
Summary: LLM-ready batch crawling for Nordic company data
Home-page: https://github.com/gracestack/nordcrawl
Author: Gracestack AB
Author-email: info@gracestack.se
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: crawl4ai>=0.8.6
Requires-Dist: supabase>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: aiofiles>=24.1.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: rich>=13.9.4
Requires-Dist: typer>=0.12.0
Requires-Dist: pytest>=8.0.0
Requires-Dist: pytest-asyncio>=0.23.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# NordCrawl

![Python](https://img.shields.io/badge/python-3.11%2B-blue)
![Supabase](https://img.shields.io/badge/Supabase-ready-3ECF8E)
![License](https://img.shields.io/badge/license-Apache_2.0-brightgreen)

**LLM-ready batch crawling for Nordic company data.**

NordCrawl crawls URL lists from CSV/JSON, extracts structured Swedish company data, and can upsert the results directly into Supabase.

![NordCrawl Rich demo](assets/rich-demo.svg)

## Live demo

The screenshot above is based on a real crawl run with Rich output.

- Demo input: `examples/urls.json`
- Verified output: `examples/results.json`
- More details: `examples/README.md`

## Why people star projects like this

- **Fast to try**: one command from file or CLI
- **Useful immediately**: real data extraction for Swedish company pages
- **Practical output**: JSON, CSV, and Supabase sync
- **Production-minded**: retries, dedupe, rate limiting, and validation

## Features

- Batch crawling with configurable concurrency
- Per-domain rate limiting
- Retry logic with exponential backoff
- Swedish company data extraction
  - company name
  - org number
  - email
  - phone
  - address
- Consistent JSON schema via Pydantic
- CSV/JSON import and export
- Supabase upserts with dedupe by URL
- CLI-first workflow with a clean Python API

## PyPI

The package name is **`nordcrawl`**.

After publishing, install it with:

```bash
pip install nordcrawl
```

## Quick start

### 1) Install dependencies

```bash
pip install -r requirements.txt
```

### 2) Configure environment

Copy the example file and fill in your Supabase keys:

```bash
cp .env.example .env
```

### 3) Run a crawl

```bash
python3 main.py crawl-urls \
  https://www.ikea.com/se/sv/ \
  https://www.hemkop.se/ \
  https://www.ica.se/ \
  --output results.json
```

## CLI

### Crawl from file

```bash
python3 main.py crawl data/test_urls.json --output results.json --format json
```

### Save directly to Supabase

```bash
python3 main.py crawl data/test_urls.json --save-db
```

### Show stats

```bash
python3 main.py stats --limit 50
```

### Search stored companies

```bash
python3 main.py search "IKEA" --limit 20
```

### Export from Supabase

```bash
python3 main.py export company_data.csv --format csv
```

## Python API

```python
import asyncio
from src import BatchCrawler, SupabaseManager

async def main() -> None:
    crawler = BatchCrawler(max_concurrent=5, rate_limit_delay=2.0)
    results = await crawler.crawl_urls([
        "https://www.ikea.com/se/sv/",
        "https://www.hemkop.se/",
        "https://www.ica.se/",
    ])

    print(results)

asyncio.run(main())
```

## Input formats

### CSV

```csv
url
https://www.example.se
https://www.example2.se
```

### JSON

```json
[
  "https://www.example.se",
  {"url": "https://www.example2.se"}
]
```

## Supabase table

NordCrawl expects a table named `company_data` with these columns:

- `url`
- `company_name`
- `org_number`
- `email`
- `phone`
- `address`
- `postal_code`
- `city`
- `website`
- `crawled_at`
- `status`
- `raw_html_hash`
- `extraction_confidence`
- `created_at`
- `updated_at`

The migration is included here:

```bash
supabase/migrations/20240325000001_create_company_data_table.sql
```

## Example output

```json
{
  "url": "https://www.example.se",
  "company_name": "Exempel AB",
  "org_number": "556123-4567",
  "email": "info@example.se",
  "phone": "+46701234567",
  "address": "Storgatan 1",
  "postal_code": "123 45",
  "city": "Stockholm",
  "status": "completed",
  "extraction_confidence": 0.85
}
```

## Docker / Supabase note

Local `supabase start` requires Docker access. If you do not want to run Supabase locally:

1. Apply the migration in your remote Supabase project.
2. Set `SUPABASE_URL` and `SUPABASE_KEY` in `.env`.
3. Run the crawler with `--save-db`.

## Project structure

```text
.
├── main.py
├── requirements.txt
├── src/
│   ├── api/
│   ├── crawlers/
│   ├── database/
│   ├── extractors/
│   └── models/
├── supabase/
│   └── migrations/
└── tests/
```

## Roadmap

- Better selector coverage for Swedish business websites
- Richer address parsing
- Supabase auth-aware dashboard
- Web UI for crawl jobs
- More sample datasets and benchmarks

## Contributing

PRs are welcome.

1. Fork the repo
2. Create a branch
3. Add tests
4. Open a pull request

## License

Apache-2.0

---

If you like the project, **star it on GitHub** to help it grow.
