Metadata-Version: 2.1
Name: robust_crawl
Version: 0.1
Summary: A library for robust cralwer based on proxy pool and token bucket, support browser and requests
Home-page: https://github.com/R0k1e/robust_crawl.git
Author: Haoyu Wang
Author-email: Haoyu_Wang_1103@outlook.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: gevent
Requires-Dist: playwright
Requires-Dist: tenacity
Requires-Dist: brotli
Requires-Dist: numpy
Requires-Dist: requests
Requires-Dist: pyyaml
Requires-Dist: bs4
Requires-Dist: fake-useragent
Requires-Dist: openai

# RobustCrawl
A library for robust cralwer based on proxy pool and token bucket, support browser and requests

# Install
``` 
pip install robust_crawl
playwright install chrome
export OPENAI_API_KEY="yourkey"  
export OPENAI_API_BASE="your base" # optional
```

brew install go

brew install mihomo

set imported proxy file (.yml / .yaml) in ./config

# Config
save it in ./config/robust_crawl_config.json

```json
{
        "max_concurrent_requests": 500,
        "GPT": {
            "model_type": "gpt-3.5-turbo"
        },
        "TokenBucket": {
            "tokens_per_minute": 20,
            "bucket_capacity": 5,
            "url_specific_tokens": {
                "export.arxiv": {
                    "tokens_per_minute": 19,
                    "bucket_capacity": 1
                }
            }
        },
        "Proxy": {
            "is_enabled": true,
            "core_type": "mihomo", 
            "start_port": 33333,
            "config_paths": [
                "the comparative path to the proxy file, imported by clash-verge core",
                "./config/proxy.yaml"
            ]
        },
        "ContextPool": {
            "num_contexts": 5,
            "work_contexts": 15,
            "have_proxy": true,
            "duplicate_proxies": false,
            "ensure_none_proxies":  true,
            "download_pdf": false,
            "downloads_path": "./output/browser_downloads",
            "preference_path": "./output/broswer_config",
            "context_lifetime": 60,
            "context_cooling_time":1
        }
}
```
