Metadata-Version: 2.4
Name: web-novel-scraper
Version: 2.1.4
Summary: Python tool that allows you to scrape web novels from various sources and save them to more readable formats like EPUB.
Project-URL: Homepage, https://github.com/ImagineBrkr/web-novel-scraper
Project-URL: Documentation, https://web-novel-scraper.readthedocs.io
Project-URL: Repository, https://github.com/ImagineBrkr/web-novel-scraper.git
Author-email: ImagineBrkr <salvattore_25@hotmail.com>
Keywords: Novel Downloader,Scraper,Web Novel,Web Novel Downloader,Web Novel Scraper
Requires-Python: >=3.10
Requires-Dist: bs4>=0.0.2
Requires-Dist: click<9,>=8.0
Requires-Dist: dataclasses-json<1,>=0.6.7
Requires-Dist: ebooklib<1,>=0.18
Requires-Dist: platformdirs
Requires-Dist: python-dotenv
Requires-Dist: requests
Description-Content-Type: text/markdown

# Web Novel scraper CLI

## Table of Contents
- [Introduction](#introduction)
- [Installation](#installation)
- [Basic Concepts](#basic-concepts)
- [Commands](#commands)
- [Basic Examples](#basic-examples)


## Introduction
This tool allows you to scrape web novels from various sources. I made it because my hands hurt from scrolling too much.

## Installation
To install the Web Novel Scraping CLI, you can use pip:

```bash
    pip install web-novel-scraper
```
Or you can manually install it:

1. Clone the repository:
    ```bash
    git clone https://github.com/ImagineBrkr/web-novel-scraper.git
    ```
2. Navigate to the project directory:
    ```bash
    cd web-novel-scraper
    ```
3. Install the project:
    ```bash
    python -m pip install .
    ```
4. Run the CLI tool:
    ```bash
    web-novel-scraper
    ```

## Basic Concepts
### Novel
Refers to a novel which has at least, a Table of Contents (can be one or more) and chapters. 
It also has some metadata that can be saved like author, language, tags, creation or end date, etc.

### Table of Contents (TOC)
Source of Truth for all the chapters the novel will have. It can be from a main URL (it will be requested and saved; if there is more than one page, they will also get requested and saved), or the HTML files can be added directly from a file. All the chapters are autogenerated from this TOC.

### Chapters
A Chapter comes from a URL, is requested and saved as a file on your local machine. Once a file is saved, you will not need to request it anymore.
From this chapter you can get the Title and the Chapter Content.

### Decoder
A set of rules used to extract information from a chapter, such as links, content, title, etc.
We use the host to identify which set of rules we will use. This can be added manually or generated from a TOC URL.
Example:
```json
{
    "host": "novelbin.me",
    "has_pagination": false,
    "title": {
        "element": "h2 a.chr-title",
        "id": null,
        "class": null,
        "selector": null,
        "attributes": null,
        "array": false,
        "extract": {
            "type": "attr",
            "key": "title"
        }
    },
    "content": {
        "element": "div#chr-content",
        "id": null,
        "class": null,
        "selector": null,
        "attributes": null,
        "array": true
    },
    "index": {
        "element": null,
        "id": null,
        "class": null,
        "selector": "ul.list-chapter li a",
        "attributes": null,
        "array": true
    },
    "next_page": {
        "element": null,
        "id": null,
        "class": null,
        "selector": null,
        "attributes": null,
        "array": true
    }
}
```
Uses BeautifulSoup selectors for more flexibility. You can specify the element, id, class, selector, and whether multiple tags will be used.

- `has_pagination`: Used if there is a `toc_main_url` to find the URL of the next page, using `next_page`.
- `index`: Gets the `href` of all tags found when searching the TOC.
- `title` and `content`: The title and content of the chapter, respectively.

In the example above:
- The title is in an `a` tag within an `h2` tag with class `chr-title`, extracting the `title` attribute:
    ```html
    <h2><a class="chr-title" href="https://url-of-chapter" title="Chapter 1"><span class="chr-text">Chapter 1</span></a></h2>
    ```
- The content is in a `div` with id `chr-content`:
    ```html
    <div id="chr-content" class="chr-c" style="font-family: Arial, sans-serif, serif; font-size: 18px; line-height: 160%; margin-top: 15px;">Content...</div>
    ```
- The URL of each chapter is in the `href` of an `a` tag within an `li` tag, which is within a `ul` tag with class `list-chapter`:
    ```html
    <ul class="list-chapter">
        <li><span class="glyphicon glyphicon-certificate"></span>&nbsp;<a href="https://url-of-chapter-1" title="Chapter 1"><span class="nchr-text chapter-title">Chapter 1</span></a></li>
    </ul>
    ```
## Commands
The following commands are available in the Web Novel Scraping CLI:

```bash
Usage: web-novel-scraper [OPTIONS] COMMAND [ARGS]...

  CLI Tool for web novel scraping.

Options:
  --help  Show this message and exit.

Commands:
  add-tags                Add tags to a novel.
  add-toc-html            Add TOC HTML to a novel.
  clean-files             Clean files of a novel.
  create-novel            Create a new novel.
  delete-toc              Delete the TOC of a novel.
  remove-tags             Remove tags from a novel.
  request-all-chapters    Request all chapters of a novel.
  save-novel-to-epub      Save the novel to EPUB format.
  scrap-chapter           Scrap a chapter of a novel.
  set-cover-image         Set the cover image for a novel.
  set-host                Set the host for a novel.
  set-metadata            Set metadata for a novel.
  set-scraper-behavior   Set scraper behavior for a novel.
  set-toc-main-url        Set the main URL for the TOC of a novel.
  show-chapters           Show chapters of a novel.
  show-metadata           Show metadata of a novel.
  show-novel-info         Show information about a novel.
  show-scraper-behavior  Show scraper behavior of a novel.
  show-tags               Show tags of a novel.
  show-toc                Show the TOC of a novel.
  sync-toc                Sync the TOC of a novel.
  version                 Show program version.
```

## Basic Examples
Here are some basic examples:

### Example 1: Creating a Novel using a main URL
```bash
web-novel-scraper create-novel --title 'Novel 1' --author 'ImagineBrkr' --toc-main-url 'https://page.me/Novel-1/toc' --cover 'cover.jpg'
```
Some pages have too much JavaScript, so you can just copy the HTML manually to a file and create the novel from it:
```bash
web-novel-scraper create-novel --title 'Novel 1' --author 'ImagineBrkr' --toc-html 'toc.html' --host 'page.me' --cover 'cover.jpg'
```
If there is more than one page for the TOC, you can add them:
```bash
web-novel-scraper add-toc-html --title 'Novel 1' --toc-html 'toc2.html'
```
You can create the chapters from this TOC, or synchronize if they were already created but there are new chapters.
```bash
web-novel-scraper sync-toc --title 'Novel 1'
```
The default directory will be %APPDATA%/ImagineBrkr/web-novel-scraper for Windows, all the files will be saved there, but you can change it.

### Example 2: Requesting files
We can now download all the chapters
```bash
web-novel-scraper request-all-chapters --title 'Novel 1'
```

### Example 3: Saving to EPUB
With 
```bash
web-novel-scraper save-novel-to-epub --title 'Novel 1'
```

For more detailed usage and options, use --help for each command.

## Configuration
### Environment Variables

The Web Novel Scraping CLI uses the following environment variables for configuration:

- `SCRAPER_LOGGING_LEVEL`: Sets the logging level for the application. By default no logs are written, it accepts the following log levels: (DEBUG, INFO, WARNING, ERROR, CRITICAL).
    ```bash
    export SCRAPER_LOGGING_LEVEL=INFO
    ```

- `SCRAPER_LOGGING_FILE`: Specifies the file where logs will be written. Default is written to the terminal.
    ```bash
    export SCRAPER_LOGGING_FILE=/path/to/logfile.log
    ```

- `SCRAPER_BASE_DATA_DIR`: Defines the base directory for storing novel data. Default is the user data directory.
    ```bash
    export SCRAPER_BASE_DATA_DIR=/path/to/data/dir
    ```

- `SCRAPER_FLARESOLVER_URL`: URL for the FlareSolverr service. Default is `http://localhost:8191/v1`.
    ```bash
    export SCRAPER_FLARESOLVER_URL=http://localhost:8191/v1
    ```
