Metadata-Version: 2.1
Name: html-login-field-detector
Version: 0.1.2
Summary: A library for detecting login fields in HTML using DistilBERT.
Author-email: Victor Delaplaine <vdelaplainess@gmail.com>
License: MIT
Project-URL: homepage, https://github.com/ByVictorrr
Project-URL: repository, https://github.com/ByVictorrr/LoginFieldDetector
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: transformers>=4.33.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: beautifulsoup4>=4.11.0
Requires-Dist: diskcache>=5.6.0
Requires-Dist: huggingface_hub>=0.15.0
Requires-Dist: scikit-learn>=1.1.0
Requires-Dist: matplotlib>=3.5.0
Requires-Dist: lxml>=4.8.0
Requires-Dist: lxml_html_clean>=0.4.0
Requires-Dist: fake-useragent>=1.5.0
Requires-Dist: certifi>=2024.6.0
Requires-Dist: cloudscraper>=1.2.0
Requires-Dist: tensorboard>=2.17.0
Requires-Dist: babel>=2.8.0
Provides-Extra: cpu
Requires-Dist: torch>=2.0.0; extra == "cpu"
Requires-Dist: torchvision>=0.15.0; extra == "cpu"
Requires-Dist: torchaudio>=2.0.0; extra == "cpu"
Provides-Extra: gpu
Requires-Dist: torch==2.5.1+cu118; extra == "gpu"
Requires-Dist: torchvision==0.20.1+cu118; extra == "gpu"
Requires-Dist: torchaudio==2.5.1+cu118; extra == "gpu"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"

# HTML Login Field Detector

`html-login-field-detector` is a Python library designed to identify and process login fields in HTML documents. Powered by machine learning (DistilBERT) and modern web scraping tools, this library provides a robust solution for automating form detection in web applications.

## Features
- Detects login forms in HTML documents.
- Utilizes Hugging Face's DistilBERT model for advanced text processing.
- Integrates seamlessly with Python web scraping workflows.
- Supports GPU acceleration for faster processing.

## Installation

### Using pip
To install the library along with the CPU-compatible dependencies:
```bash
pip install html-login-field-detector[cpu]
```

For GPU compatibility:
```bash
pip install html-login-field-detector[gpu] --extra-index-url https://download.pytorch.org/whl/cu118
```

## Usage
```python
from login_field_detector import LoginFieldDetector

# Initialize the detector
detector = LoginFieldDetector()

# Detect login fields in an HTML document
html_source = "<html>...</html>"  # Your HTML content
result = detector.detect(html_source)

print(result)  # Output details of detected login fields
```

## Dataset
This project includes a dataset of login page URLs for training and testing purposes, located at `dataset/training_urls.json`. The dataset can be extended or updated as needed.

## Development
Clone the repository and install the dependencies locally:
```bash
git clone https://github.com/ByVictorrr/LoginFieldDetector.git
cd LoginFieldDetector

# Install dependencies
pip install -e .[cpu,test]
```

### Running Tests
Run the tests using `pytest`:
```bash
pytest
```

## License
This project is licensed under the [MIT License](LICENSE).

## Contributing
We welcome contributions! Please fork the repository, make changes, and submit a pull request.

## Links
- **Homepage**: [ByVictorrr on GitHub](https://github.com/ByVictorrr)
- **Repository**: [LoginFieldDetector](https://github.com/ByVictorrr/LoginFieldDetector)
- **Dataset**: `dataset/training_urls.json`

