Metadata-Version: 2.4
Name: pysetl
Version: 1.1.0
Summary: A PySpark ETL Framework
Project-URL: Home, https://github.com/JhossePaul/pysetl
Project-URL: Source, https://github.com/JhossePaul/pysetl
Author-email: Jhosse Paul Marquez Ruiz <jpaul.marquez.ruiz@gmail.com>
License: Apache-2.0
License-File: LICENSE
Keywords: aws,etl,spark
Classifier: Development Status :: 5 - Production/Stable
Classifier: Framework :: MkDocs
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: <3.13,>=3.10
Requires-Dist: numpy<3.0.0,>=1.26
Requires-Dist: pyarrow<21.0.0,>=15.0.0
Requires-Dist: pydantic<3.0.0,>=2.4.2
Requires-Dist: typedspark<2.0.0,>=1.5.3
Requires-Dist: typing-extensions<5.0.0,>=4.13.0
Provides-Extra: pyspark
Requires-Dist: pyspark[sql]<4.0.0,>=3.4; extra == 'pyspark'
Description-Content-Type: text/markdown

<p align="center">
  <img src="./docs/assets/images/logo_name.png" alt="PySetl" width="200" />
</p>

[![Build Status](https://github.com/JhossePaul/pysetl/actions/workflows/build.yml/badge.svg)](https://github.com/JhossePaul/pysetl/actions/workflows/build.yml) [![Code Coverage](https://codecov.io/gh/JhossePaul/pysetl/branch/main/graph/badge.svg)](https://codecov.io/gh/JhossePaul/pysetl) [![Documentation Status](https://readthedocs.org/projects/pysetl/badge/?version=latest)](https://pysetl.readthedocs.io/en/latest/?badge=latest)

[![PyPI](https://img.shields.io/pypi/v/pysetl)](https://pypi.org/project/pysetl) [![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![PySpark](https://img.shields.io/badge/PySpark-3.4%2B-orange.svg?logo=apache-spark&logoColor=white)](https://spark.apache.org/docs/latest/) [![Downloads](https://img.shields.io/pypi/dm/pysetl.svg?color=blue&label=Installs&logo=pypi&logoColor=gold)](https://pypi.org/project/pysetl)

[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) [![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff) [![Type checked with mypy](https://img.shields.io/badge/mypy-checked-blue.svg)](http://mypy-lang.org/) [![Pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)

## Overview

PySetl is a framework to improve the readability and structure of PySpark ETL
projects. It is designed to take advantage of Python's typing syntax to reduce
runtime errors through linting tools and verifying types at runtime, effectively
enhancing stability for large ETL pipelines.

To accomplish this, we provide some tools:

- **`pysetl.config`**: Type-safe configuration.
- **`pysetl.storage`**: Agnostic and extensible data sources connections.
- **`pysetl.workflow`**: Pipeline management and dependency injection.

PySetl is designed with Python typing syntax at its core. We strongly suggest
using [typedspark](https://typedspark.readthedocs.io/en/latest/) and
[pydantic](https://docs.pydantic.dev/latest/) for development.

## Why use PySetl?

- Model complex data pipelines.
- Reduce risks at production with type-safe development.
- Improve large project structure and readability.

## Quick Start

```python
from pysetl.config import CsvConfig
from pysetl.workflow import Factory, Stage, Pipeline
from typedspark import DataSet, Schema, Column, create_partially_filled_dataset
from pyspark.sql.types import StringType, IntegerType

# Define your data schema
class Citizen(Schema):
    name: Column[StringType]
    age: Column[IntegerType]
    city: Column[StringType]

# Create a factory
class CitizensFactory(Factory[DataSet[Citizen]]):
    def read(self):
        self.citizens = create_partially_filled_dataset(
            spark, Citizen,
            [{Citizen.name: "Alice", Citizen.age: 30, Citizen.city: "NYC"}]
        )
        return self
    def process(self): return self
    def write(self): return self
    def get(self): return self.citizens

# Build and run pipeline
stage = Stage().add_factory_from_type(CitizensFactory)
pipeline = Pipeline().add_stage(stage).run()
```

## Installation

PySetl is available on PyPI:

```bash
pip install pysetl
```

### Optional Dependencies

PySetl provides several optional dependencies for different use cases:

- **PySpark**: For local development (most production environments come with
their own Spark distribution)

  ```bash
  pip install "pysetl[pyspark]"
  ```

- **Documentation**: For building documentation locally
  ```bash
  pip install "pysetl[docs]"
  ```

## Documentation

- 📖 [User Guide](https://pysetl.readthedocs.io/en/latest/user-guide/configuration.html)
- 🔧 [API Reference](https://pysetl.readthedocs.io/en/latest/api/pysetl.html)
- 🚀 [Getting Started](https://pysetl.readthedocs.io/en/latest/getting-started.html)
- 🤝 [Contributing](https://pysetl.readthedocs.io/en/latest/contributing.html)

## Development

```bash
git clone https://github.com/JhossePaul/pysetl.git
cd pysetl
hatch shell
pre-commit install
```

### Development Commands

- **Run all checks**: `hatch run test:all`
- **Run tests only**: `hatch run test:test`
- **Lint code**: `hatch run test:lint`
- **Format code**: `hatch run test:format`
- **Type checking**: `hatch run test:type`
- **Build documentation**: `hatch run docs:docs`
- **Security checks**: `hatch run security:all`

## Contributing

We welcome contributions! Please see our
[Contributing Guide](https://pysetl.readthedocs.io/en/latest/contributing.html)
for details.

## License

This project is licensed under the Apache License 2.0 - see the
[LICENSE](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) file for
details.

## Acknowledgments

PySetl is a port from [SETL](https://setl-framework.github.io/setl/). We want to
fully recognize this package is heavily inspired by the work of the SETL team.
We just adapted things to work in Python.
