Metadata-Version: 2.4
Name: forklift-etl
Version: 0.1.3
Summary: A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support
Project-URL: Homepage, https://github.com/cornyhorse/forklift
Project-URL: Documentation, https://github.com/cornyhorse/forklift/blob/main/docs/
Project-URL: Repository, https://github.com/cornyhorse/forklift
Project-URL: Bug Tracker, https://github.com/cornyhorse/forklift/issues
Project-URL: Changelog, https://github.com/cornyhorse/forklift/blob/main/CHANGELOG.md
Author-email: Matt <matt@mattkingsbury.com>
Maintainer-email: Matt <matt@mattkingsbury.com>
License: MIT License
        
        Copyright (c) 2025 cornyhorse
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: csv,data-processing,etl,excel,parquet,pyarrow,s3,schema-generation,validation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Requires-Dist: boto3>=1.20.0
Requires-Dist: botocore>=1.23.0
Requires-Dist: jsonschema>=4.0.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: pyarrow>=10.0.0
Requires-Dist: python-dateutil>=2.8.0
Requires-Dist: typing-extensions>=4.0.0; python_version < '3.10'
Provides-Extra: all
Requires-Dist: openpyxl>=3.0.0; extra == 'all'
Requires-Dist: polars>=0.18.0; extra == 'all'
Requires-Dist: pyperclip>=1.8.0; extra == 'all'
Requires-Dist: pyspark>=3.0.0; extra == 'all'
Provides-Extra: clipboard
Requires-Dist: pyperclip>=1.8.0; extra == 'clipboard'
Provides-Extra: dev
Requires-Dist: black>=22.0.0; extra == 'dev'
Requires-Dist: flake8>=5.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pre-commit>=2.20.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Provides-Extra: excel
Requires-Dist: openpyxl>=3.0.0; extra == 'excel'
Provides-Extra: polars
Requires-Dist: polars>=0.18.0; extra == 'polars'
Provides-Extra: spark
Requires-Dist: pyspark>=3.0.0; extra == 'spark'
Description-Content-Type: text/markdown

# Forklift

A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support.

![Forklift Logo](FORKLIFT.png)

## Overview

Forklift is a comprehensive data processing tool that provides:

- **High-performance data import** with PyArrow streaming for CSV, Excel, FWF, and SQL sources
- **Intelligent schema generation** that analyzes your data and creates standardized schema definitions  
- **Robust validation** with configurable error handling and constraint validation
- **S3 streaming support** for both input and output operations
- **Multiple output formats** including Parquet, with comprehensive metadata and manifests

## Key Features

### 🚀 **Data Import & Processing**
- Stream large files efficiently with PyArrow
- Support for CSV, Excel, Fixed-Width Files (FWF), and SQL sources
- Configurable batch processing with memory optimization
- Comprehensive validation with detailed error reporting
- S3 integration for cloud-native workflows

### 🔍 **Schema Generation**
- **Intelligent schema inference** from data analysis
- **Privacy-first approach** - no sensitive sample data included by default
- **Multiple file format support** - CSV, Excel, Parquet
- **Flexible output options** - stdout, file, or clipboard
- **Standards-compliant schemas** following JSON Schema with Forklift extensions

### 🛡️ **Validation & Quality**
- JSON Schema validation with custom extensions
- Primary key inference and enforcement
- Constraint validation (unique, not-null, primary key)
- Data type validation and conversion
- Configurable error handling modes (fail-fast, fail-complete, bad-rows)

## Installation

```bash
pip install forklift
```

### Optional Dependencies

```bash
# For Excel support
pip install openpyxl

# For clipboard functionality
pip install pyperclip
```

## Quick Start

### Data Import

```python
import forklift

# Import CSV to Parquet with validation
from forklift import import_csv

results = import_csv(
    source="data.csv",
    destination="./output/",
    schema_path="schema.json"
)

print(f"Import completed successfully!")
```

### Schema Generation

```python
import forklift

# Generate schema from CSV (analyzes entire file by default)
schema = forklift.generate_schema_from_csv("data.csv")

# Generate with limited row analysis
schema = forklift.generate_schema_from_csv("data.csv", nrows=1000)

# Save schema to file
forklift.generate_and_save_schema(
    input_path="data.csv",
    output_path="schema.json",
    file_type="csv"
)

# Generate with primary key inference
schema = forklift.generate_schema_from_csv(
    "data.csv", 
    infer_primary_key_from_metadata=True
)
```

### Reading Data for Analysis

```python
import forklift

# Read CSV into DataFrame for analysis
df = forklift.read_csv("data.csv")

# Read Excel with specific sheet
df = forklift.read_excel("data.xlsx", sheet_name="Sheet1")

# Read Fixed-Width File with schema
df = forklift.read_fwf("data.txt", schema_path="fwf_schema.json")
```

## CLI Usage

### Data Import

```bash
# Import CSV with schema validation
forklift ingest data.csv --dest ./output/ --input-kind csv --schema schema.json

# Import from S3
forklift ingest s3://bucket/data.csv --dest s3://bucket/output/ --input-kind csv

# Import Excel file
forklift ingest data.xlsx --dest ./output/ --input-kind excel --sheet "Sheet1"

# Import Fixed-Width File
forklift ingest data.txt --dest ./output/ --input-kind fwf --fwf-spec schema.json
```

### Schema Generation

```bash
# Generate schema from CSV (analyzes entire file by default)
forklift generate-schema data.csv --file-type csv

# Generate with limited row analysis
forklift generate-schema data.csv --file-type csv --nrows 1000

# Save to file
forklift generate-schema data.csv --file-type csv --output file --output-path schema.json

# Include sample data for development (explicit opt-in)
forklift generate-schema data.csv --file-type csv --include-sample

# Copy to clipboard
forklift generate-schema data.csv --file-type csv --output clipboard

# Excel files
forklift generate-schema data.xlsx --file-type excel --sheet "Sheet1"

# Parquet files
forklift generate-schema data.parquet --file-type parquet

# With primary key inference
forklift generate-schema data.csv --file-type csv --infer-primary-key
```

## Core Components

- **Import Engine**: High-performance data processing with PyArrow
- **Schema Generator**: Intelligent schema inference and generation
- **Validation System**: Constraint validation and error handling
- **Processors**: Pluggable data transformation components
- **I/O Operations**: S3 and local file system support

## Documentation

For detailed documentation, see the [`docs/`](docs/) directory:

- **[Usage Guide](docs/USAGE.md)** - Comprehensive usage examples and workflows
- **[Schema Standards](docs/SCHEMA_STANDARDS.md)** - JSON Schema format and extensions
- **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation
- **[Constraint Validation](docs/CONSTRAINT_VALIDATION_IMPLEMENTATION.md)** - Validation features
- **[S3 Integration](docs/S3_TESTING.md)** - S3 usage and testing

## Examples

See the [`examples/`](examples/) directory for comprehensive examples:

- **calculated_columns_demo.py** - Calculated columns functionality
- **constraint_validation_demo.py** - Constraint validation examples
- **validation_demo.py** - Data validation with bad rows handling
- **datetime_features_example.py** - Date/time processing examples
- And more...

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
