Metadata-Version: 2.4
Name: pdfx-tool
Version: 0.2.2
Summary: A local CLI tool for PDF and image manipulation - merge PDFs, split by page/range, convert images to PDF with enhancement
Home-page: https://pypi.org/project/pdfx-tool/
Author: Manoj Adhikari
Author-email: Manoj Adhikari <adhikarim@etsu.edu>
Maintainer-email: Manoj Adhikari <adhikarim@etsu.edu>
License: MIT
Project-URL: Homepage, https://pypi.org/project/pdfx-tool/
Project-URL: API Documentation, https://github.com/jonamadk/PDF-X/blob/main/API_DOCUMENTATION.md
Project-URL: Repository, https://github.com/jonamadk/PDF-X
Project-URL: Development, https://github.com/jonamadk/PDF-X/blob/main/DEVELOPMENT.md
Project-URL: Issues, https://github.com/jonamadk/pdfx/issues
Keywords: pdf,cli,merge,split,filter,ocr,convert
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Utilities
Classifier: Topic :: Office/Business
Classifier: Topic :: Multimedia :: Graphics :: Graphics Conversion
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pypdf>=3.0.0
Requires-Dist: cryptography>=41.0.0
Provides-Extra: full
Requires-Dist: pymupdf>=1.22.0; extra == "full"
Requires-Dist: Pillow>=10.0.0; extra == "full"
Requires-Dist: pytesseract>=0.3.10; extra == "full"
Requires-Dist: pdf2image>=1.16.3; extra == "full"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# PdfX

PdfX is a local CLI tool for PDF and image manipulation that keeps your documents private on your own machine. No uploading to online services — everything happens locally! The "X" stands for utility functions: merge PDFs, split by pages or ranges, convert images to PDFs, and enhance scanned documents.

This README explains how to install and use the tool, with clear examples and available command-line options.

## Why use PdfX?
- **Privacy first**: Keep your documents private — everything happens locally on your machine
- **Comprehensive features**: Merge PDFs, split by pages/ranges, convert and enhance images
- **Scanned document enhancement**: Improve image and PDF quality with color enhancement filters
- **Lightweight**: Minimal dependencies for basic features, optional advanced features available
- **Script-friendly**: Perfect for automation and batch processing workflows

## Features
- **Merge PDFs**: Combine multiple PDF files from a directory into a single PDF document
- **Split PDF into pages**: Break a PDF into individual page files
- **Split PDF by page range**: Extract specific pages or page ranges (e.g., pages 1-3, 5, 7-10)
- **Convert image to PDF**: Turn a single image (JPEG, PNG, BMP, GIF, TIFF, WebP) into a PDF
- **Merge images to PDF**: Combine multiple images from a directory into a single PDF document
- **Image to PDF with enhancement**: Convert images to PDF with color enhancement for scanned document quality
- **Enhance PDF colors**: Improve existing PDFs with scanned document color enhancement
- **Organized outputs**: Results saved to `Merged_Doc/` and `Splitted_Docs/` directories for easy management
  

## Installation 🔧

### Install from PyPI (Recommended)

For basic PDF merge/split functionality:
```bash
pip install pdfx-tool
```

For full functionality (including color filtering, OCR, image processing):
```bash
pip install pdfx-tool[full]
```

For development:
```bash
pip install pdfx-tool[dev]
```

### Install from source

Clone the repository and install:
```bash
git clone https://git@github.com:jonamadk/PDF-X.git
cd pdfx
pip install -e .
```

Or with full dependencies:
```bash
pip install -e .[full]
```

### System dependencies (optional)

For advanced features, you may need additional system tools:
- **Tesseract OCR** (for `--ocr`): `brew install tesseract` (macOS) or install from https://github.com/tesseract-ocr/tesseract
- **Poppler** (for some PDF imaging backends used by tools like `pdf2image`): `brew install poppler` (macOS)

_Note: Basic merge/split functionality works without these extras._

## Quick start ✨

After installation, use the `pdfx` command:

```bash
pdfx --help
```

Or run as a Python module:

```bash
python -m pdfx --help
```

### Command-line options (friendly explanations)
- `-m, --merge-dir PATH`  
  Directory containing PDFs to merge. If you omit this, the tool looks for `filesToMerge` next to `main.py`.

- `-o, --output NAME`  
  Output filename for the merged PDF (default: `Merged Document.pdf`). The file will be written to `<merge_dir_parent>/Merged_Doc/`.

- `--split`  
  Split each PDF located in the source directory into single-page PDFs. By default, the source directory used is `filesToSplit` next to `main.py`.

- `--split-dir PATH`  
  Use this directory as the source for `--split` instead of the default.

- `--split-file PATH`  
  Split a single PDF file into pages (or a page range). By default the output goes to `../Splitted_Docs/<file_stem>/`.

- `--split-out PATH`  
  Write split pages to this directory (overrides the default Splitted_Docs location).

- `--page-range "RANGE"`  
  When splitting, extract only the pages you want. Example: `--page-range "1-3,5"`. Ranges are 1-based and multiple segments are allowed.

- `--filter-file PATH`  
  Filter a single PDF by text color and write a new PDF containing only text spans that match the color (requires `pymupdf`).

- `--color-filter COLOR`  
  Target color to filter. Accepts multiple formats:
  - Hex: `#RRGGBB` (e.g., `#FF0000`)
  - RGB: `R,G,B` (e.g., `255,0,0`)
  - Named: `red`, `blue`, `green`, `black`, `white`, `yellow`, `cyan`, `magenta`, `orange`, `purple`, `pink`, `brown`, `gray`, `lime`, `navy`, `teal`, `silver`, `maroon`, `olive`

- `--color-tolerance FLOAT`  
  Color distance tolerance (default `0.0`). A larger value allows fuzzy color matches.

- `--image-filter NAME`  
  Apply an image-level filter to every page in a PDF. Choices: `enhance`, `bw`, `grayscale`, `invert`, `auto`.

- `--image-strength FLOAT`  
  Strength or threshold for image filters. Meaning depends on the filter: for `enhance` it's a contrast factor (1.0 = no-change), for `bw` it's the threshold (0–255, default 128).

- `--image-to-pdf`  
  Convert image file(s) to PDF. Use with `--image-file` for a single image or `--image-dir` for a directory of images.

- `--image-file PATH`  
  Single image file to convert to PDF (requires `--image-to-pdf`).

- `--image-dir PATH`  
  Directory containing images to convert to a single PDF. All supported image formats (JPEG, PNG, BMP, GIF, TIFF, WebP) are processed in sorted order (requires `--image-to-pdf`).

- `--image-out PATH`  
  Output path for the PDF generated from images (optional; defaults to `<image_name>.pdf` or `<dir_name>_to_pdf.pdf`).

- `--image-enhance TYPE`  
  Apply color enhancement to images during PDF conversion. Choices: `enhance` (contrast), `brightness`, `color` (saturation), `sharpness`, `auto` (autocontrast).

- `--image-enhance-strength FLOAT`  
  Strength of image enhancement. For contrast/brightness/color/sharpness: 1.0 = no change, < 1.0 = decrease, > 1.0 = increase. Default: 1.5. For `auto` enhancement, this parameter is ignored.

## Image filter mapping (what each filter does) 🔧

- **enhance** — increases page contrast (useful for scanned docs to make text crisper). Default strength 1.5.
- **bw** — converts to pure black & white using a threshold; pass threshold via `--image-strength` (0–255).
- **grayscale** — converts pages to grayscale.
- **invert** — inverts colors (useful for turning light-on-dark into dark-on-light).
- **auto** — applies autocontrast.

## Advanced features (recoloring, OCR, image-to-PDF, auto-enhancement, diagnostics) ✨

- **Auto PDF Enhancement (Smart Scanned Document Enhancement)** — `--pdf-enhance` automatically enhances scanned PDFs with intelligent defaults. Requires `--filter-file` for input. Choose enhancement mode with `--pdf-enhance-mode`:
  - `smart` (default, recommended): Applies contrast enhancement (1.5x) followed by autocontrast for optimal clarity
  - `contrast`: Standard contrast enhancement (1.5x) for moderately faded documents
  - `strong`: Heavy contrast enhancement (2.0x) for heavily faded or low-quality scans
  - `auto`: Autocontrast only for fine-tuning brightness levels
  - Example:
    ```bash
    pdfx --filter-file scanned.pdf --pdf-enhance
    # Creates: scanned_enhanced.pdf with smart auto-enhancement
    ```
  - Example (strong mode for heavily faded documents):
    ```bash
    pdfx --filter-file faded_scan.pdf --pdf-enhance --pdf-enhance-mode strong
    # Creates: faded_scan_enhanced.pdf with 2.0x contrast boost
    ```

- **Image-to-PDF conversion with color enhancement** — `--image-to-pdf` converts one or more images into a PDF document with optional color enhancement. Use `--image-file` for a single image or `--image-dir` to batch-convert all images in a directory. Apply enhancements with `--image-enhance` (contrast, brightness, color saturation, sharpness, or auto) and adjust strength with `--image-enhance-strength`. Perfect for scanned documents that need enhanced clarity.
  - Example (single image with contrast enhancement):
    ```bash
    pdfx --image-to-pdf --image-file scan.jpg --image-enhance enhance --image-enhance-strength 1.8 --image-out scan.pdf
    ```
  - Example (directory of scans with auto-enhancement):
    ```bash
    pdfx --image-to-pdf --image-dir /path/to/scans --image-enhance auto
    # Creates: /path/to/scans_to_pdf.pdf (with autocontrast applied to each image)
    ```
  - Example (increase color saturation):
    ```bash
    pdfx --image-to-pdf --image-file photo.jpg --image-enhance color --image-enhance-strength 1.3 --image-out photo.pdf
    ```

- **Shorthand flags** — Use `--enhance`, `--bw`, `--grayscale`, `--invert`, or `--auto` as convenient aliases for `--image-filter <name>`. When using `--bw` a sensible default threshold is set if you don't pass `--image-strength`.

- **Recolor vector text** — `--recolor COLOR` recolors matched vector text spans (requires `--color-filter` and `--filter-file`).
  - `--recolor-tolerance FLOAT` to allow fuzzy matches.
  - `--recolor-replace` will attempt to replace underlying text; use cautiously as it may affect layout.
  - Example:
    ```bash
    pdfx --filter-file filesToMerge/1.pdf --color-filter '255,0,0' --recolor '#0000FF' --filter-out out_recolor.pdf
    ```

- **OCR (make searchable)** — `--ocr` produces a searchable PDF by OCR-ing each page (requires Tesseract + `pytesseract`). Use `--ocr-lang` to set language(s).
  - Example:
    ```bash
    pdfx --filter-file filesToMerge/1.pdf --ocr --ocr-lang eng --filter-out out_ocr.pdf
    ```

- **Diagnostics** — `--dump-colors` prints detected text-span colors and samples to help you pick a `--color-filter`.
  - Example: `pdfx --filter-file filesToMerge/1.pdf --dump-colors`

_Note: Recolor overlays recolored text to preserve layout; image filters rasterize pages and may make text unsearchable. Pick the approach that suits your needs._

## Examples 

### Image to PDF Conversion

- **Convert a single image to PDF:**
  ```bash
  pdfx --image-to-pdf --image-file photo.jpg --image-out photo.pdf
  ```

- **Convert and enhance a scanned document (increase contrast):**
  ```bash
  pdfx --image-to-pdf --image-file scan.jpg --image-enhance enhance --image-enhance-strength 1.8 --image-out scan_enhanced.pdf
  ```

- **Auto-enhance multiple scans in a folder:**
  ```bash
  pdfx --image-to-pdf --image-dir /path/to/scans --image-enhance auto
  # Creates: /path/to/scans_to_pdf.pdf with autocontrast applied
  ```

- **Restore color vibrancy to a faded photo:**
  ```bash
  pdfx --image-to-pdf --image-file old_photo.jpg --image-enhance color --image-enhance-strength 1.3 --image-out photo_restored.pdf
  ```

- **Increase sharpness for blurry images:**
  ```bash
  pdfx --image-to-pdf --image-file blurry.jpg --image-enhance sharpness --image-enhance-strength 1.6 --image-out sharp.pdf
  ```

- **Brighten dark/underexposed images:**
  ```bash
  pdfx --image-to-pdf --image-file dark_photo.jpg --image-enhance brightness --image-enhance-strength 1.2 --image-out bright.pdf
  ```

- **Convert all images from a folder (no enhancement):**
  ```bash
  pdfx --image-to-pdf --image-dir /path/to/images
  # Creates: /path/to/images_to_pdf.pdf
  ```

### Merging PDFs

- **Merge all PDFs in a directory into one file:**
  ```bash
  pdfx -m /path/to/pdfs -o "Final Report.pdf"
  # Creates: /path/to/Merged_Doc/Final Report.pdf
  ```

- **Merge using default `filesToMerge` directory:**
  ```bash
  pdfx
  # Creates: Merged_Doc/Merged Document.pdf
  ```

- **Merge and specify custom output name:**
  ```bash
  pdfx -m documents/ -o "Combined.pdf"
  ```

### Splitting PDFs

- **Split all PDFs in a directory into single-page files:**
  ```bash
  pdfx --split
  # Creates: Splitted_Docs/<pdf_name>/page_1.pdf, page_2.pdf, ...
  ```

- **Split a single PDF file:**
  ```bash
  pdfx --split-file report.pdf
  # Creates: Splitted_Docs/report/report_page_1.pdf, report_page_2.pdf, ...
  ```

- **Extract specific pages (pages 1-3 and 5):**
  ```bash
  pdfx --split-file document.pdf --page-range "1-3,5"
  # Creates: Splitted_Docs/document/document_page_1.pdf ... document_page_5.pdf
  ```

- **Extract single page:**
  ```bash
  pdfx --split-file document.pdf --page-range 10
  # Creates: Splitted_Docs/document/document_page_10.pdf
  ```

- **Split and save to custom directory:**
  ```bash
  pdfx --split-file document.pdf --split-out /custom/output/path
  # Creates: /custom/output/path/document_page_1.pdf, ...
  ```

### Color Filtering (Extract Colored Text)

- **Extract only red text from a PDF:**
  ```bash
  pdfx --filter-file document.pdf --color-filter "#FF0000" --filter-out red_text_only.pdf
  # Or use RGB: --color-filter "255,0,0"
  ```

- **Extract text with fuzzy color matching (tolerance):**
  ```bash
  pdfx --filter-file document.pdf --color-filter "#FF0000" --color-tolerance 30 --filter-out similar_red.pdf
  ```

- **Find what colors are in your PDF:**
  ```bash
  pdfx --filter-file document.pdf --dump-colors
  # Prints all detected text colors and samples
  ```

### Recoloring Text

- **Recolor all red text to blue (preserves layout):**
  ```bash
  pdfx --filter-file document.pdf --color-filter "#FF0000" --recolor "#0000FF" --filter-out recolored.pdf
  ```

- **Recolor with fuzzy matching and replacement:**
  ```bash
  pdfx --filter-file document.pdf --color-filter "255,0,0" --recolor "0,0,255" --recolor-tolerance 20 --recolor-replace --filter-out recolored.pdf
  ```

### Page-Level Image Filters (applies to entire page raster)

#### Automatic PDF Enhancement (Smart Scanned Document Enhancement)

- **Auto-enhance a scanned PDF with smart defaults (recommended):**
  ```bash
  pdfx --filter-file scanned.pdf --pdf-enhance
  # Creates: scanned_enhanced.pdf
  # Applies: contrast boost (1.5x) + autocontrast for optimal clarity
  ```

- **Auto-enhance with strong mode (for heavily faded documents):**
  ```bash
  pdfx --filter-file heavily_faded.pdf --pdf-enhance --pdf-enhance-mode strong
  # Creates: heavily_faded_enhanced.pdf with 2.0x contrast boost
  ```

- **Auto-enhance with standard contrast (moderate fading):**
  ```bash
  pdfx --filter-file scanned.pdf --pdf-enhance --pdf-enhance-mode contrast
  # Creates: scanned_enhanced.pdf with 1.5x contrast only
  ```

- **Auto-enhance with autocontrast only (fine-tuning brightness):**
  ```bash
  pdfx --filter-file scanned.pdf --pdf-enhance --pdf-enhance-mode auto
  # Creates: scanned_enhanced.pdf with automatic brightness adjustment
  ```

- **Complete workflow: Auto-enhance then make searchable:**
  ```bash
  # Step 1: Auto-enhance the scanned PDF
  pdfx --filter-file raw_scan.pdf --pdf-enhance --filter-out enhanced.pdf
  
  # Step 2: Make it searchable with OCR
  pdfx --filter-file enhanced.pdf --ocr --filter-out searchable.pdf
  # Creates fully enhanced and searchable PDF
  ```

#### PDF Enhancement & Quality Improvements (Manual Control)

- **Increase contrast for scanned PDFs (light enhancement):**
  ```bash
  pdfx --filter-file scanned.pdf --enhance --image-strength 1.3 --filter-out enhanced_light.pdf
  # 1.3 = 30% contrast increase; good for slightly faded scans
  ```

- **Increase contrast for scanned PDFs (medium enhancement):**
  ```bash
  pdfx --filter-file scanned.pdf --enhance --image-strength 1.6 --filter-out enhanced_medium.pdf
  # 1.6 = 60% contrast increase; standard for most scanned documents
  ```

- **Increase contrast for heavily faded documents (strong enhancement):**
  ```bash
  pdfx --filter-file faded.pdf --enhance --image-strength 2.0 --filter-out enhanced_strong.pdf
  # 2.0 = double the contrast; use for very faded or low-quality scans
  ```

- **Apply auto-contrast (automatic optimization):**
  ```bash
  pdfx --filter-file scanned.pdf --auto --filter-out auto_enhanced.pdf
  # Automatically adjusts contrast for optimal visibility
  ```

#### Color Processing

- **Convert to grayscale (save space, remove colors):**
  ```bash
  pdfx --filter-file colored.pdf --grayscale --filter-out gray.pdf
  # Useful for reducing file size and printing in black & white
  ```

- **Convert PDF pages to black & white (with threshold):**
  ```bash
  pdfx --filter-file document.pdf --bw --image-strength 128 --filter-out bw.pdf
  # Threshold 128 (middle gray) - adjust 0-255 for darker/lighter cutoff
  ```

- **Convert with darker threshold (more text retained):**
  ```bash
  pdfx --filter-file document.pdf --bw --image-strength 100 --filter-out bw_darker.pdf
  # Lower threshold (e.g., 100) keeps more dark areas
  ```

- **Convert with lighter threshold (cleaner appearance):**
  ```bash
  pdfx --filter-file document.pdf --bw --image-strength 150 --filter-out bw_lighter.pdf
  # Higher threshold (e.g., 150) produces cleaner white backgrounds
  ```

- **Invert colors (light → dark, dark → light):**
  ```bash
  pdfx --filter-file document.pdf --invert --filter-out inverted.pdf
  # Useful for converting dark background documents to light backgrounds
  ```

#### Enhancement Workflow Examples

- **Complete scanning workflow (enhance + auto-optimize):**
  ```bash
  # Step 1: Enhance contrast
  pdfx --filter-file raw_scan.pdf --enhance --image-strength 1.8 --filter-out scan_enhanced.pdf
  
  # Step 2: Make searchable with OCR
  pdfx --filter-file scan_enhanced.pdf --ocr --filter-out scan_searchable.pdf
  ```

- **Batch enhance all PDFs in a folder:**
  ```bash
  # Create a script to enhance multiple PDFs
  for pdf in ./documents/*.pdf; do
    pdfx --filter-file "$pdf" --enhance --image-strength 1.5 --filter-out "${pdf%.pdf}_enhanced.pdf"
  done
  ```

- **Compare different enhancement strengths:**
  ```bash
  # Light
  pdfx --filter-file scan.pdf --enhance --image-strength 1.2 --filter-out scan_light.pdf
  
  # Medium
  pdfx --filter-file scan.pdf --enhance --image-strength 1.5 --filter-out scan_medium.pdf
  
  # Strong
  pdfx --filter-file scan.pdf --enhance --image-strength 1.8 --filter-out scan_strong.pdf
  # Then compare the three versions to find the best result
  ```

- **Convert faded document to clean black & white:**
  ```bash
  pdfx --filter-file faded_doc.pdf --bw --image-strength 120 --filter-out clean.pdf
  # Removes gray backgrounds and produces crisp text
  ```

### OCR (Make Searchable)

- **Make a scanned PDF searchable:**
  ```bash
  pdfx --filter-file scanned.pdf --ocr --filter-out searchable.pdf
  # Requires: Tesseract installed
  ```

- **OCR with specific language:**
  ```bash
  pdfx --filter-file french_scan.pdf --ocr --ocr-lang fra --filter-out searchable_fra.pdf
  ```

- **OCR with multiple languages:**
  ```bash
  pdfx --filter-file multilang.pdf --ocr --ocr-lang "eng+fra+deu" --filter-out searchable_multi.pdf
  ```

### Complex Workflows

- **Scan → Enhance → Extract Pages → Merge:**
  ```bash
  # Step 1: Convert scans to PDF with enhancement
  pdfx --image-to-pdf --image-dir ./scans --image-enhance enhance --image-enhance-strength 1.8 --image-out scans.pdf

  # Step 2: Extract pages 1-10
  pdfx --split-file scans.pdf --page-range 1-10 --split-out ./extracted_pages

  # Step 3: Merge extracted pages with other PDFs
  pdfx -m ./extracted_pages -o "Final.pdf"
  ```

- **Process batch of documents (merge → enhance → OCR):**
  ```bash
  # Merge all project PDFs
  pdfx -m ./documents -o "Project.pdf"

  # Enhance for better readability
  pdfx --filter-file Merged_Doc/Project.pdf --enhance --image-strength 1.5 --filter-out Project_Enhanced.pdf

  # Make searchable
  pdfx --filter-file Project_Enhanced.pdf --ocr --filter-out Project_Searchable.pdf
  ```


## Exit codes
- `0` — success (merge or split completed)
- `1` — directory or file not found
- `2` — invalid input or processing error

## Troubleshooting & tips 
- If you see a pypdf error about AES, install `cryptography` in your environment (see Installation section).
- The tool will create `Merged_Doc` and `Splitted_Docs` directories as needed.
- Filenames with spaces are supported; use quotes or escape spaces in shell commands.

## Contributing & Feedback 🤝
Contributions, bug reports, and improvements are very welcome. Please open an issue or submit a pull request with tests and a short description of your change.

## License
This project is provided as-is. Feel free to use and adapt it for your personal or internal projects.


