Metadata-Version: 2.1
Name: web-raider
Version: 0.0.2
Summary: Web Raider
Home-page: https://github.com/ThePyProgrammer/web-raider
Author: Prannaya
Author-email: prannayagupta@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: boto3 (>=1.35.22,<2.0.0)
Requires-Dist: bs4 (>=0.0.2,<0.0.3)
Requires-Dist: fastapi (>=0.115.0,<0.116.0)
Requires-Dist: googlesearch-python (>=1.2.5,<2.0.0)
Requires-Dist: litellm (>=1.46.6,<2.0.0)
Requires-Dist: lxml[html-clean] (>=5.3.0,<6.0.0)
Requires-Dist: newspaper3k (>=0.2.8,<0.3.0)
Requires-Dist: pandas (>=2.2.2,<3.0.0)
Requires-Dist: requests (>=2.32.3,<3.0.0)
Requires-Dist: scikit-learn (>=1.5.1,<2.0.0)
Requires-Dist: uvicorn (>=0.30.6,<0.31.0)
Requires-Dist: websockets (>=13.1,<14.0)
Project-URL: Repository, https://github.com/ThePyProgrammer/web-raider
Description-Content-Type: text/markdown

# web-raider

## Overview

Web Raider is a powerful web scraping and data extraction tool designed to help you gather information from various websites efficiently. It provides a simple interface to configure and run web scraping tasks, making it easy to collect and process data for your projects.

## Setup Guide

1. Clone this repository from GitHub.
2. Open terminal (after redirecting yourself to the repo) and run the following commands:

    - `pip install poetry` (don't create venv through python. does not go well.)
    - `poetry lock` (creates venv for you)
    - `poetry install`

### Setup for Raider Backend

Run `pip install -e .` from the git root directory. Raider Backend will call Web Raider using `pipeline_main(user_query: str)` from `web_raider/pipeline.py`.

## Usage

1. Configure your scraping tasks by editing the configuration files in the `config` directory.
2. Run the scraper using the command: `poetry run python main.py`
3. The scraped data will be saved in the `output` directory.

## How the Repository Works

- **web-raider/**: Contains the core logic of the application.
  - **article.py**: Handles the extraction of codebase URLs from articles.
  - **codebase.py**: Defines the `Codebase` class and its subclasses for different code hosting platforms.
  - **connection_manager.py**: Manages WebSocket connections and message buffering.
  - **evaluate.py**: Evaluates codebases based on a query.
  - **model_calls.py**: Handles calls to external models for query simplification, relevance, scoring, and ranking.
  - **pipeline.py**: Defines the main pipeline for processing user queries.
  - **search.py**: Handles Google search queries and filters results.
  - **shortlist.py**: Shortlists codebases based on a query.
  - **url_classifier.py**: Classifies URLs into different categories.
  - **utils.py**: Contains utility functions.
  - **constants.py**: Defines constants used across the application.
  - **__init__.py**: Initializes the web-raider package.
- **assets/**: Contains auxiliary files and configurations.
  - **key_import.py**: Handles the import of API keys.
  - **prompts.py**: Defines various prompts used in model calls.
  - **__init__.py**: Initializes the assets package.
- **tests/**: Contains unit tests for the application. Run the tests using `pytest` to ensure everything is working correctly.

## Tasklist to complete before Wallaby

1. fix relative/absolute import problem. don't rely on `-m`
2. need to be able to run the code from any directory

## Future Implementations/Improvements

- Use Machine Learning Classification Algorithms to classify types of URLs to their type (Codebase, Article, Forum)
- Find a way to handle Forum URLs (right now they are not processed)
- Find a way to scrape code directly from Articles and Forum URLs (right now only links are scraped)
- Properly implement main query breakdown instead of just whacking LLM

