Metadata-Version: 2.4
Name: textextraction
Version: 0.1.3
Summary: Extract and process text from images and PDFs
Author-email: Nikhil K Singh <nsr.nikhilsingh@gmail.com>
License-Expression: MIT
Project-URL: Documentation, https://textextraction.readthedocs.io/
Project-URL: Bug Reports, https://github.com/Nikhil-K-Singh/textextraction/issues
Project-URL: Source Code, https://github.com/Nikhil-K-Singh/textextraction
Keywords: ocr,text extraction,pdf,image processing,document processing,table detection
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: nltk<4.0,>=3.9
Requires-Dist: Pillow<12.0,>=11.0
Requires-Dist: pdfminer.six>=20221105
Requires-Dist: pytesseract>=0.3.0
Requires-Dist: pdf2image<2.0,>=1.17.0
Requires-Dist: regex>=2024.0
Requires-Dist: easyocr>=1.7.0
Requires-Dist: pymupdf>=1.25.0
Requires-Dist: opencv-python>=4.9.0
Requires-Dist: numpy>=1.24.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: sphinx>=7.0.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.3.0; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints>=1.25.2; extra == "docs"
Requires-Dist: sphinxcontrib-napoleon>=0.7; extra == "docs"
Requires-Dist: docutils>=0.20; extra == "docs"
Dynamic: license-file

# textextraction

A Python library for extracting text from images and PDFs, with support for various OCR engines and advanced table detection.

## Features

- Extract text from images and PDFs
- Support for multiple OCR engines (Tesseract and EasyOCR)
- Advanced table detection and extraction in documents
- Support for mixed content (text and tables) on the same page
- Markdown formatting for output
- Filtering of non-English words
- Table extraction from scanned documents

## Installation

```bash
pip install textextraction
```

### Requirements

- Python 3.8+
- For Tesseract support:
  - Tesseract OCR installed on your system
  - For macOS: `brew install tesseract`
  - For Ubuntu/Debian: `sudo apt-get install tesseract-ocr`
  - For Windows: [Download and install from here](https://github.com/UB-Mannheim/tesseract/wiki)
- For EasyOCR support:
  - No additional installation required - included in package dependencies
- For PDF support:
  - PyMuPDF (installed automatically)
  - OpenCV (installed automatically)

## Basic Usage

### Extracting Text from an Image

```python
from textextraction import ImageText

# Initialize with default engine (EasyOCR)
processor = ImageText()

# Process an image and save to markdown file
processor.process_image(
    image_path="path/to/your/image.jpg",
    output_path="output.md"
)

# Or just get the extracted text
text = processor.extract_from_image("path/to/your/image.jpg")
print(text)
```

### Extracting Text from a Scanned PDF

```python
from textextraction import ScannedPdfText

# Initialize with Tesseract engine (EasyOCR is default)
processor = ScannedPdfText(ocr_engine="tesseract")

# Process a scanned PDF and save to markdown file
processor.process_pdf(
    pdf_path="path/to/your/scanned.pdf",
    output_path="output.md"
)
```

### Extracting Text from a Regular (Non-Scanned) PDF

```python
from textextraction import PdfText

# Initialize
processor = PdfText()

# Process PDF and save to markdown
processor.process_pdf(
    pdf_path="path/to/your/document.pdf",
    output_path="output.md"
)
```

## Advanced Usage

### Working with Tables

The library automatically detects and processes tables in documents, converting them to Markdown format:

```python
from textextraction import ScannedPdfText

# Initialize with table detection enabled (default)
processor = ScannedPdfText(table_detection=True)

# Process PDF with tables
processor.process_pdf(
    pdf_path="path/to/pdf_with_tables.pdf",
    output_path="tables_output.md"
)
```

The library uses EasyOCR for table detection by default, as it provides better accuracy for tabular content. When using Tesseract as the main OCR engine, tables will still be processed with EasyOCR.

### Processing Specific Page Ranges

```python
from textextraction import ScannedPdfText

processor = ScannedPdfText()

# Process only pages 2-4
processor.process_pdf(
    pdf_path="path/to/document.pdf",
    output_path="output.md",
    start_page=2,
    end_page=4
)
```

### Filtering Non-English Words

```python
from textextraction import ScannedPdfText

# Enable word filtering during initialization
processor = ScannedPdfText(filter_words=True)

# Or specify during processing
processor.process_pdf(
    pdf_path="path/to/document.pdf",
    output_path="output.md",
    filter_words=True
)
```

### Adding Page Numbers

```python
from textextraction import ScannedPdfText

# Enable page numbering
processor = ScannedPdfText(add_page_number=True)

# Process the document
processor.process_pdf(
    pdf_path="path/to/document.pdf",
    output_path="output.md"
)
```

## OCR Engine Options

### Tesseract
- Good for general text extraction
- Faster initialization
- Requires system installation

### EasyOCR (Default)
- Better multilingual support
- Superior table detection
- Works well with complex layouts
- Slower initialization but better accuracy

## License

This project is licensed under the MIT License - see the LICENSE file for details.
