Metadata-Version: 2.4
Name: unPDF
Version: 1.0.0rc3
Summary: Quickly extract text characters and character metadata from pdfs using pdfium.
Author-email: Laurens Janssen <digi-deity@laurens.xyz>
License: Apache-2.0
Project-URL: Repository, https://github.com/digi-deity/unPDF
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# PDF Extract

A lightweight Python library for extracting text characters and metadata from PDF files using PDFium. This library provides detailed character-level information including positions, fonts, colors, and transformation matrices without external dependencies.

## Features

- **Character-level extraction**: Get individual characters with precise positioning
- **Font metadata**: Extract font information, sizes, and styling
- **Color information**: Access text colors (RGBA values)
- **Transformation matrices**: Retrieve text transformation data
- **Page boundaries**: Get page dimensions and bounding boxes
- **Zero dependencies**: Uses PDFium directly with no external requirements
- **PyArrow integration**: Optional PyArrow table output for data analysis

## Installation

```bash
pip install pdf-extract
```

## Basic Usage

### Simple Character Extraction

```python
from unpdf import extract

# Extract data from PDF
pages, chars, text_objs, fonts = extract("document.pdf")

# Access character data
print(f"Total characters: {len(chars.arrays['char'])}")
print(f"Total text objects: {len(text_objs.arrays['txt_obj_id'])}")
print(f"Total fonts: {len(fonts.arrays['font_obj_id'])}")

# Get page information
print(f"Page width: {pages.arrays['width'][0]}")
print(f"Page height: {pages.arrays['height'][0]}")
```

### Reconstructing Text

```python
from unpdf import extract

_, chars, _, _ = extract("document.pdf")

# Convert characters to string
text = ''.join(chr(c) for c in chars.arrays['char'])
print(text)
```

### Character Positioning

```python
from unpdf import extract

_, chars, _, _ = extract("document.pdf")

# Access character positions
for i in range(len(chars.arrays['char'])):
    char = chr(chars.arrays['char'][i])
    left = chars.arrays['left'][i]
    top = chars.arrays['top'][i]
    right = chars.arrays['right'][i]
    bottom = chars.arrays['bottom'][i]

    print(f"'{char}' at ({left:.1f}, {top:.1f}) - ({right:.1f}, {bottom:.1f})")
```

## PyArrow / pandas / polars Integration

For data analysis and advanced processing, it is recommended to convert to first convert to PyArrow tables. 
Note that PyArrow must be installed separately. From PyArrow tables, you can easily convert to pandas or polars DataFrames.

```python
from unpdf import extract

pages, chars, text_objs, fonts = extract("document.pdf")

# Convert to PyArrow tables
pages_table = pages.table
chars_table = chars.table
text_objs_table = text_objs.table
fonts_table = fonts.table

# Use PyArrow functionality
print(chars_table.schema)
print(chars_table.select(['char', 'left', 'top']))

# Export to different dataframe libraries
import polars as pl

pandas_char_table = chars_table.to_pandas()  # Convert to pandas DataFrame
polars_char_table = pl.from_arrow(chars_table)  # Convert to polars DataFrame
```

## Data Structure

The library returns four table objects:

- **PageTable**: Page dimensions and boundaries
- **CharTable**: Individual characters with positions and metadata
- **TextObjTable**: Text object properties (font size, color, transformations)
- **FontTable**: Font information and styling

### Available Fields

**PageTable fields:**
- `page`: Page number
- `width`, `height`: Page dimensions
- `left`, `right`, `top`, `bottom`: Page boundaries

**CharTable fields:**
- `page`: Page number
- `char`: Unicode character code
- `is_generated`: Whether character is generated by PDFium
- `txt_obj_id`: Reference to text object
- `left`, `right`, `top`, `bottom`: Character bounding box (precise)
- `loose_left`, `loose_right`, `loose_top`, `loose_bottom`: Loose character bounding box
- `bbox_ok`: Whether precise bounding box is valid
- `loose_bbox_ok`: Whether loose bounding box is valid
- `hyphen`: Hyphen indicator
- `has_unicode_map_error`: Whether character has unicode mapping error

**TextObjTable fields:**
- `txt_obj_id`: Unique text object identifier
- `fontsize`: Font size
- `has_transparency`: Whether text object has transparency
- `font_obj_id`: Reference to font object
- `color_R`, `color_G`, `color_B`, `color_A`: RGBA color values
- `tmatrix_a`, `tmatrix_b`, `tmatrix_c`, `tmatrix_d`, `tmatrix_e`, `tmatrix_f`: Transformation matrix components

**FontTable fields:**
- `font_obj_id`: Unique font identifier
- `flags`: Font flags
- `weight`: Font weight
- `italic_angle`: Italic angle
- `base_fontname`, `family_fontname`: Font names

## Requirements

- Python 3.7+
- PyArrow (optional, for table functionality)

## License

This project is licensed under the Apache 2.0 license.
