Metadata-Version: 2.1
Name: ebook2text
Version: 1.1.0
Summary: Convert common book file types to text for machine learning
Author: Ashlynn Antrobus
Author-email: Ashlynn Antrobus <ashlynn@prosepal.io>
License: MIT
Project-URL: Respository, https://github.com/ashrobertsdragon/Ebook-conversion-to-Text-for-Machine-Learning
Classifier: License :: OSI Approved :: MIT License
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: annotated-types>=0.4.0
Requires-Dist: anyio<5,>3.5
Requires-Dist: beautifulsoup4>=4.12.3
Requires-Dist: certifi>=2024.2.2
Requires-Dist: cffi>=1.12
Requires-Dist: colorama>=0.4.6
Requires-Dist: cryptography>=42.0.7
Requires-Dist: distro<2,>=1.7
Requires-Dist: EbookLib==0.18
Requires-Dist: h11<=0.15,>=0.13
Requires-Dist: httpcore>1
Requires-Dist: httpx<1,>=0.23.0
Requires-Dist: idna<4,>2.8.0
Requires-Dist: lxml>3.1.0
Requires-Dist: openai>=1.30.1
Requires-Dist: pillow>=10.2.0
Requires-Dist: pycparser==2.22
Requires-Dist: pydantic<3,>1.9
Requires-Dist: pydantic-core==2.18.2
Requires-Dist: python-docx>=1.1.0
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: six==1.16.0
Requires-Dist: sniffio>=1.1.0
Requires-Dist: soupsieve>1.2
Requires-Dist: tqdm>=4.0.0
Requires-Dist: typing-extensions>=4.9.0


# Convert Ebook File

## Overview

This Python script provides functionality for converting various ebook file formats (EPUB, DOCX, PDF, TXT) into a standardized text format. The script processes each file, identifying chapters, and replaces chapter headers with asterisks. It also performs OCR (Optical Character Recognition) for image-based text using GPT-4o and standardizes the text by desmartenizing punctuation.

## Features

- **File Format Support**: Handles EPUB, DOCX, PDF, and TXT formats.
- **Chapter Identification**: Detects and marks chapter breaks.
- **OCR Capability**: Converts text from images using OCR.
- **Text Standardization**: Replaces smart punctuation with ASCII equivalents.

## Requirements

To run this script, you need Python 3.8 or above and the following packages:

- `python-docx`
- `ebooklib`
- `openai`
- `python-dotenv`
- `bs4`
- `ebooklib`
- `pdfminer.six`
- `pillow`

## Usage

1. Ensure all dependencies are installed.
2. Set your environment variable for the OpenAI API key.
3. Place your ebook files in a known directory.
4. Run the script with the path to the ebook file and a metadata dictionary with keys of 'title' and 'author' as arguments.

## Functions

- `read_text_file(file: str) -> str`: Reads a text file and returns its content.
- `write_to_file(content: str, file: str)`: Writes content to a file.
- `convert_file(file_path: str, metadata: dict) -> str`: Main function to convert an ebook file to text.

## Contributing

Contributions to this project are welcome. Please ensure that your code follows the existing style for consistency.

## License

This project is licensed by ProsePal LLC under the MIT license

## Version History

- **v0.1.0** (Release date: November 30, 2023)
  - Initial release

- **v0.1.1** (Release date: December 2, 2023)
  - fixed false positives for is_number

- **v0.2.0** (Release date: December 3, 2023)
  - Conversion of docx files

- **v0.3.0** (Release date: December 8, 2023)
  - Conversion of PDF files

- **v0.3.1** (Release date: Januar 23, 2024)
  - fixed concantation of text in pdf conversion
  - updated pillow version to secure version

- **v1.0.0** (Release date: January 23, 2024)
  - created library instead of single module

- **v1.0.1** (Release date: March 13, 2024)
  - setup.py and requirements.txt typo fixed

- **v1.0.2** (Release date: May 17, 2024)
  - added tests, fixex minor typos

- **v1.1.0** (Release date: May 30, 2024)
  - Change to abstract factory pattern
