Metadata-Version: 2.1
Name: epub_toc
Version: 1.0.0
Summary: A Python tool for extracting table of contents from EPUB files with hierarchical structure support
Home-page: https://github.com/almazilaletdinov/epub_toc
Author: Almaz Ilaletdinov
Author-email: a.ilaletdinov@yandex.ru
Project-URL: Documentation, https://github.com/almazilaletdinov/epub_toc/docs
Project-URL: Source, https://github.com/almazilaletdinov/epub_toc
Project-URL: Tracker, https://github.com/almazilaletdinov/epub_toc/issues
Keywords: epub,ebook,toc,table of contents,parser,ebook-tools
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: epub_meta>=0.0.7
Requires-Dist: lxml>=4.9.3
Requires-Dist: beautifulsoup4>=4.12.2
Requires-Dist: ebooklib>=0.18
Requires-Dist: tika>=2.6
Provides-Extra: dev
Requires-Dist: pytest>=7.4.3; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock>=3.12.0; extra == "dev"
Requires-Dist: black>=23.12.1; extra == "dev"
Requires-Dist: isort>=5.13.2; extra == "dev"
Requires-Dist: flake8>=7.0.0; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Requires-Dist: types-beautifulsoup4>=4.12.0.7; extra == "dev"
Requires-Dist: types-lxml>=2023.10.21; extra == "dev"

# EPUB TOC

[![Python Version](https://img.shields.io/badge/python-3.7%2B-blue)](https://www.python.org)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

A Python tool for extracting table of contents from EPUB files with hierarchical structure support.

## Features
- Multiple extraction methods support (NCX, epub_meta, OPF)
- Automatic best method selection
- Hierarchical TOC structure preservation
- Russian and English language support
- JSON output format
- Detailed logging
- EPUB file analysis reports

## Installation

```bash
pip install epub_toc
```

## Usage

### As a module

```python
from epub_toc import EPUBTOCParser

# Create parser
parser = EPUBTOCParser('path/to/book.epub')

# Extract TOC
toc = parser.extract_toc()

# Print to console
parser.print_toc()

# Save to JSON
parser.save_toc_to_json('output.json')
```

### From command line

```bash
epub-toc path/to/book.epub
```

### EPUB File Analysis

To analyze all EPUB files in tests/data/epub_samples directory:

```bash
python tests/integration/test_epub_analysis.py
```

Analysis results are saved in `reports/` directory:
- `epub_analysis_YYYYMMDD_HHMMSS.json` - detailed report in JSON format
- `epub_analysis_YYYYMMDD_HHMMSS.txt` - brief report in text format
- `toc/*.json` - extracted TOCs for each EPUB file

Report structure:
1. JSON report contains:
   - Overall statistics for all files
   - Extraction methods success rate
   - Detailed results for each file
   - Links to extracted TOC files

2. Text report includes:
   - Brief statistics
   - Information about each file
   - Paths to extracted TOCs

3. TOC files:
   - Saved in `toc/` subdirectory
   - Named as `book_name_toc.json`
   - Contain complete TOC in JSON format

## Output Format

TOC is saved in JSON format with the following structure:

```json
{
  "metadata": {
    "title": "Book Title",
    "authors": ["Author 1", "Author 2"],
    "publisher": "Publisher Name",
    "publication_date": "2024-01-01",
    "language": "en",
    "description": "Book description",
    "cover_image_path": "path/to/cover.jpg",
    "isbn": "978-3-16-148410-0",
    "rights": "Copyright information",
    "series": "Series Name",
    "series_index": 1,
    "identifiers": {
      "isbn13": "978-3-16-148410-0",
      "uuid": "550e8400-e29b-41d4-a716-446655440000"
    },
    "subjects": ["Fiction", "Adventure"],
    "file_size": 1234567,
    "file_name": "book.epub"
  },
  "toc": [
    {
      "title": "Chapter 1",
      "href": "chapter1.html",
      "level": 0,
      "children": [
        {
          "title": "Section 1.1",
          "href": "chapter1.html#section1",
          "level": 1,
          "children": []
        }
      ]
    }
  ]
}
```

All metadata fields are optional and will be omitted if not available in the EPUB file.

## Testing

The module has been successfully tested on various EPUB files:
- Russian books (NCX method)
- English books (epub_meta method)
- Files with different TOC structures
- Files of different sizes (from 400KB to 8MB)

## Requirements
- Python 3.7+
- epub_meta>=0.0.7
- lxml>=4.9.3
- beautifulsoup4>=4.12.2 

## Contributing

We welcome contributions! If you'd like to help:

1. Fork the repository
2. Create a branch for your changes
3. Make changes and add tests
4. Ensure all tests pass
5. Create a Pull Request

See [CONTRIBUTING.md](CONTRIBUTING.md) for details.

## Security

If you discover a security vulnerability, please DO NOT create a public issue.
Instead, send a report following the instructions in [SECURITY.md](SECURITY.md)

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) file for details.

## Roadmap

- [ ] Additional EPUB format support
- [ ] Improved complex hierarchical structure handling
- [ ] Integration with popular e-readers
- [ ] Web service API
- [ ] Additional language support 
