Metadata-Version: 2.4
Name: colusa
Version: 0.18.0
Summary: Render website to ebook to make it easier to read on devices
Author-email: Huu Hoa NGUYEN <huuhoa@gmail.com>
Maintainer-email: Huu Hoa NGUYEN <huuhoa@gmail.com>
License-Expression: MIT
Project-URL: Homepage, http://github.com/huuhoa/colusa
Project-URL: Repository, http://github.com/huuhoa/colusa
Project-URL: Bug Tracker, http://github.com/huuhoa/colusa/issues
Keywords: ebook,web,converter,asciidoc,epub
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing :: Markup
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: beautifulsoup4>=4.9
Requires-Dist: certifi>=2025.8.3
Requires-Dist: chardet>=5.2.0
Requires-Dist: idna==3.10
Requires-Dist: requests>=2.32.4
Requires-Dist: soupsieve>=2.7
Requires-Dist: urllib3>=2.5.0
Requires-Dist: python-dateutil~=2.9.0
Requires-Dist: PyYAML>=6.0.1
Requires-Dist: setuptools>=80.3.1
Requires-Dist: importlib-metadata; python_version < "3.10"
Provides-Extra: tor
Requires-Dist: torpy~=1.1.6; extra == "tor"
Provides-Extra: dev
Requires-Dist: bump2version>=1.0.1; extra == "dev"
Requires-Dist: gitchangelog>=3.0.4; extra == "dev"
Requires-Dist: readme-renderer>=44.0; extra == "dev"
Requires-Dist: twine>=6.1.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Provides-Extra: all
Requires-Dist: bump2version>=1.0.1; extra == "all"
Requires-Dist: gitchangelog>=3.0.4; extra == "all"
Requires-Dist: readme-renderer>=44.0; extra == "all"
Requires-Dist: twine>=6.1.0; extra == "all"
Requires-Dist: pytest; extra == "all"
Requires-Dist: pytest-cov; extra == "all"
Dynamic: license-file

# colusa

Render website to ebook to make it easier to read on devices.

## Installation

```sh
pip install colusa
```

For development:

```sh
pip install -e ".[all]"
```

## Usage

### Start from scratch

First of all, we need to generate a configuration file for `colusa` to work on.
`colusa` has builtin command to generate a template configuration as starter.
Run following command to generate configuration file:

```bash
$ colusa init new_ebook.json
```

`colusa` will generate a configuration file as below:

```json
{
    "title": "__fill the title__",
    "author": "__fill the author__",
    "version": "v1.0",
    "homepage": "__fill url to home page__",
    "output_dir": "__fill output dir__",
    "urls": []
}
```

We have to modify the configuration file to fill up valid information.

### Add content to ebook

You can add URLs to the config manually, or use the `add-url` command to append them from the terminal:

```bash
# Append a plain URL
$ colusa add-url new_ebook.json https://fs.blog/2018/04/first-principles/

# Auto-fetch the page title
$ colusa add-url new_ebook.json https://fs.blog/2018/04/first-principles/ --fetch-title

# Supply metadata explicitly
$ colusa add-url new_ebook.json https://fs.blog/2018/04/first-principles/ \
    --title "First Principles" --author "Farnam Street" --published "2018-04-01"

# For multi-part books, specify the target part
$ colusa add-url new_ebook.json https://fs.blog/2018/04/first-principles/ --part "Part 1"
```

Alternatively, edit the config file directly and add URLs to the `urls` field.

Example for final configuration:

```json
{
    "title": "The Great Mental Models",
    "author": "Farnam Street Media Inc",
    "version": "v1.0",
    "homepage": "https://fs.blog",
    "output_dir": "fsblog",
    "urls": [
        "https://fs.blog/2018/04/first-principles/",
        "https://fs.blog/2016/04/second-order-thinking/",
        "https://fs.blog/2017/06/thought-experiment/",
        "https://fs.blog/2018/05/probabilistic-thinking/",
        "https://fs.blog/2019/12/survivorship-bias/"
    ]
}
```

### Update ebook content

We can update ebook content by modifying the `urls`, by adding or removing url in the `urls`, the result ebook will be changed.

### Generate ebook content

After adding or removing url in `urls`, we need to invoke `colusa` to have it regenerate ebook content. Run following command at terminal:

```bash
$ colusa generate new_ebook.json
```

By invoking above command, `colusa` will download webpages (specified in `urls`), parse, transform them to asciidoc format, and save them to `output_dir`. `colusa` also create a neccessary information for ebook compilating at later steps.

### Compile directly from colusa

If asciidoctor tools are installed, you can compile the ebook in one step:

```bash
# Generate AsciiDoc and immediately compile to EPUB
$ colusa generate new_ebook.json --build epub

# Build multiple formats at once
$ colusa generate new_ebook.json --build epub --build html

# Compile an already-generated book (without re-downloading)
$ colusa build new_ebook.json --format epub
$ colusa build new_ebook.json           # builds html, epub, and pdf
```

colusa uses `asciidoctor`, `asciidoctor-epub3`, and `asciidoctor-pdf`. Install them from:
- HTML: https://asciidoctor.org
- EPUB: https://asciidoctor.org/docs/asciidoctor-epub3/
- PDF: https://asciidoctor.org/docs/asciidoctor-pdf/

### Preview before generating

Use `--dry-run` to preview which extractor and transformer would be selected for each URL, without downloading anything or writing any files:

```bash
$ colusa generate new_ebook.json --dry-run
```

Example output:

```
[dry-run] Config: new_ebook.json
[dry-run] Output dir: fsblog/
[dry-run] Total URLs: 3

[1/3] https://staffeng.com/guides/overview
      Extractor  : StaffEng (plugin)
      Transformer: StaffEng (plugin)

[2/3] https://medium.com/@user/some-article
      Extractor  : Medium (plugin)
      Transformer: Transformer (base)

[3/3] https://unknown-site.com/article
      Extractor  : Extractor (base)
      Transformer: Transformer (base)
```

## Supporting Unsupported Websites

If a website is not in the supported list, you can define CSS-selector-based parsing rules directly in your book config without touching colusa's code. Dynamic rules are evaluated before built-in plugins; the first matching rule wins.

Add a `site_rules` list to your config:

```json
{
    "title": "My Ebook",
    "author": "Me",
    "version": "v1.0",
    "homepage": "https://example.com",
    "output_dir": "my_ebook",
    "site_rules": [
        {
            "pattern": "//example.com",
            "content": "article.post-body",
            "title": "h1.article-title",
            "author": ".author-name",
            "published": "time.publish-date",
            "cleanup": ["div.ads", "nav.sidebar"]
        }
    ],
    "urls": [
        "https://example.com/some-article"
    ]
}
```

| Field | Description |
|-------|-------------|
| `pattern` | Regex matched against the full URL (e.g. `//example.com`) |
| `content` | CSS selector for the article body. Falls back to built-in detection if omitted or not found |
| `title` | CSS selector for the article title. Falls back to built-in defaults if omitted or not found |
| `author` | CSS selector for the author name. Falls back to built-in defaults if omitted or not found |
| `published` | CSS selector for the publish date. Falls back to built-in defaults if omitted or not found |
| `cleanup` | List of CSS selectors — matching elements are removed from the extracted content |

### External rules file

Rules can also be kept in a separate YAML or JSON file and shared across multiple book configs:

```json
{
    "site_rules_file": "./my-sites.yml"
}
```

`my-sites.yml`:

```yaml
- pattern: "//example.com"
  content: article.post-body
  title: h1.article-title
  cleanup:
    - div.ads
    - nav.sidebar
```

Inline `site_rules` and `site_rules_file` are merged; inline rules are checked first. Relative paths in `site_rules_file` are resolved from the directory containing the config file.

## Compile ebook for consuming purpose

### Prerequisites
Before generating ebook, we need to install asciidoctor tools. Follow install guideline on following websites:

* for generating html: https://asciidoctor.org
* for generating epub: https://asciidoctor.org/docs/asciidoctor-epub3/
* for generating pdf: https://asciidoctor.org/docs/asciidoctor-pdf/

### Generating ebooks
To help with generating ebook, `colusa` also create a `Makefile` in the root folder of the ebook. In the `Makefile`, there are three common targets that we can use to generate ebook in html, epub, pdf formats.

```bash
# to generate html
$ make html

# to generate epub
$ make epub

# to generate pdf
$ make pdf
```

Generated ebooks will be saved to `./output` folder.

```
user:output/ $ ls                                                            [10:55:55]
total 3056
drwxr-xr-x  10 320B images
-rw-r--r--   1 699K index.epub
-rw-r--r--   1 131K index.html
-rw-r--r--@  1 694K index.pdf
```

## List of Supported Websites

Currently `colusa` has built-in support for the following websites. Any other site can be handled using [dynamic site rules](#supporting-unsupported-websites).

* https://untools.co
* https://unintendedconsequenc.es
* https://blog.acolyer.org
* https://fs.blog
* https://increment.com
* https://slack.engineering
* https://medium.com
* https://www.cs.rutgers.edu/~pxk/
* https://www.preethikasireddy.com
* https://engineering.atspotify.com
* https://truyenfull.vn
* https://avikdas.com
* https://www.infoq.com

## Contribution

Contribution is welcome. You can open issues to request for supporting more websites, open PR to help with those issues, or anything else like documentation, code contribution.
