Metadata-Version: 2.1
Name: jwsoup
Version: 0.0.2
Summary: This package is design to scrape Bible data on JW.org for NLP/IAgen task.
Author: Salif SAWADOGO
Author-email: salif.sawadogo.pro@gmail.com
Keywords: jwsoup
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.0.0
Requires-Dist: beautifulsoup4>=4.9.0
Requires-Dist: loguru>=0.5.0
Requires-Dist: pandas>=1.0.0
Requires-Dist: pyarrow>=10.0.0
Requires-Dist: beartype>=0.17.0

# JW Scraper

`jwsoup` is a simple Python package that scrapes Bible data from the JW.org website. The package provides functionality for scraping Bible verses and saving them in a structured format. It supports scraping data from one or multiple pages, handling paginated content, and storing the results in a Parquet file.

## Features

- Scrape Bible verses from individual or multiple pages.
- Clean the scraped verse text to remove unwanted characters.
- Store the scraped data in a Parquet file for further analysis.
- Simple interface with reusable functions.

## Installation

To install `jwsoup`, you can use pip from PyPI:

```bash
pip install jwsoup
```

Alternatively, if you want to install it locally from the source, clone the repository and run the following commands:

```bash
git clone https://github.com/sawadogosalif/jwsoup.git
cd jwsoup
pip install .
```

## Usage

### Scrape a Single Page

You can scrape a single page of Bible verses using the `scrape_single_page` function. This function returns a list of verses and the URL for the next page (if available).

```python
jwsoup.text import scrape_single_page
url = "https://www.jw.org/fr/biblioth%C3%A8que/bible/bible-d-etude/livres/Gen%C3%A8se/1/"
verses, next_url = scrape_single_page(url)

# Print the scraped verses
for verse in verses:
    print(f"{verse[0]}: {verse[1]}")

# Print the next URL
print(f"Next page URL: {next_url}")
```

### Scrape Multiple Pages

To scrape multiple pages starting from a given URL, use the `scrape_multi_page` function. This function will follow pagination and save the scraped data in a Parquet file.

```python
from jwsoup.text import scrape_multi_page

start_url = "https://www.jw.org/mos/d-s%E1%BA%BDn-yiisi/biible/nwt/books/S%C9%A9ngre/1/"
output_dir = "bible_data_moore.parquet"
res = scrape_multi_page(start_url, output_dir=output_dir, max_pages=5, page_sep="books")
```

### Save Data to Parquet

The scraped data is stored in a Parquet file for efficient storage and querying. You can specify the output file and partition the data by page.

```python
import pandas as pd
pd.read_parque(output_dir).head()
```
![alt text](assets/output_multi_page.PNG)


## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Author

- **Salif Sawadogopro**  
  Email: [salif.sawadogopro@gmail.com](mailto:salif.sawadogopro@gmail.com)

## Acknowledgments

- Thanks to the `requests`, `beautifulsoup4`, `pandas`, `loguru`, and `pyarrow` libraries for making scraping and data handling easier.
- Thanks to [JW](https://www.jw.org/) for providing an accessible and rich resource of Bible texts in multiple langages

# Changelog

## [0.0.1] - 2024-11-23
### Added
- Initial release of `jw_soup`.
- Supports scraping of text-based Bible verses from JW.org.
- Extracts individual verses and saves them to parquet files using `pyarrow`.
- Includes basic error handling and logging with `loguru`.

### Known Limitations
- Only supports scraping textual data.
- Does not handle multimedia content (audio/video).
- Limited testing for edge cases (e.g., malformed HTML or network interruptions).


## [0.0.2] - 2024-11-23
### Added
- Typo correction in package descritption
