Metadata-Version: 2.4
Name: better-aws
Version: 1.0.0
Summary: Minimal AWS boto3 wrapper
Author-email: Thibault Charbonnier <thibault.charbonnier@ensae.fr>
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: boto3>=1.42.48
Requires-Dist: dotenv>=0.9.9
Requires-Dist: pandas>=2.0.0
Requires-Dist: polars>=1.38.1
Requires-Dist: pyarrow>=23.0.0
Requires-Dist: rich>=14.3.2
Provides-Extra: objects
Requires-Dist: joblib>=1.5.3; extra == 'objects'
Requires-Dist: skops>=0.13.0; extra == 'objects'
Description-Content-Type: text/markdown

# better-aws

[![Python](https://img.shields.io/badge/python-3.11%2B-blue.svg)](#)
[![PyPI](https://img.shields.io/pypi/v/better-aws.svg)](https://pypi.org/project/better-aws/)

A minimal, production-minded wrapper around `boto3` focused on **S3 and tabular data (CSV/Parquet/Excel)**.

- **S3-first**: the handful of operations you use 90% of the time
- **Batch-Native** and **Glob-ready** : same methods for single keys, lists, or glob patterns (\*, \*\*)
- **Ergonomic I/O**: `load()` → Python objects, `download()` → local files, `transfer()` → move trees between local and S3
- **Logging-friendly**: standalone "print-like" logs or plug into your app logger
- **Auth-ready**: designed to support multiple auth modes (profile, custom files, static creds, .env)

---

## Install

```bash
pip install better-aws
```

For object serialization support (pickle/joblib/skops):

```bash
pip install better-aws[objects]
```

---

## Development (uv)

```bash
git clone https://github.com/thibault-charbonnier/better-aws.git
cd better-aws
uv sync
```

---

## Quickstart

```python
from better_aws import AWS

# 1) Create a session (boto3 will use the default credential chain unless you add other auth modes)
aws = AWS(profile="s3admin", region="eu-west-3", verbose=True)

# Optional sanity check
aws.identity(print_info=True)

# 2) Configure S3 defaults
aws.s3.config(
    bucket="my-bucket",
    key_prefix="my-project",   # optional: all keys are relative to this prefix
    output_type="pandas",      # tabular loads -> pandas (or "polars")
    file_type="parquet",       # default tabular format for dataframe uploads without extension
    overwrite=True,
)

# 3) List / load / upload
keys = aws.s3.list(prefix="raw/", limit=10)

df = aws.s3.load("raw/prices.parquet")     # -> pandas DataFrame (by config)
df["ret"] = df["close"].pct_change()

aws.s3.upload(df, "processed/prices_with_returns")  # -> parquet by default (by config)

# 4) Verify existence
print(aws.s3.exists("processed/prices_with_returns.parquet"))
```

---

## Core features

### 1) Authentication

`better-aws` is built to keep auth **clean and modular**:

- AWS profile / default chain (AWS CLI-style)
- static credentials (Python args)
- custom `credentials_file` / optional `config_file`
- `.env` (dotenv)

```python
# Static credentials
aws = AWS("s3admin", aws_access_key_id=AWS_ID_KEY, aws_secret_access_key=AWS_SECRET_KEY)

# .env config
aws = AWS("s3admin", env_file="test.env")

# Custom location for credentials files
aws = AWS("s3admin", credentials_file=r"\...\credentials")

# Classic CLI-like auth (boto3 fallback)
aws = AWS("s3admin")
```

#### Authentication priority

When creating a session, `better-aws` resolves credentials in this order — first match wins:

- **Static credentials** — `aws_access_key_id` + `aws_secret_access_key` parameters passed directly to `AWS()`
- **Env file** — a `.env` file passed via `env_file=`. Must contain `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`. Optionally `AWS_SESSION_TOKEN` and `AWS_REGION` / `AWS_DEFAULT_REGION`.
- **Custom credential files** — `credentials_file` and/or `config_file` pointing to non-default AWS credential file locations
- **boto3 default chain** — falls back to the native boto3 credential resolution. The most common case is the credentials file generated by `aws configure` (`~/.aws/credentials` on Linux/macOS, `%USERPROFILE%\.aws\credentials` on Windows). See the [full boto3 credential chain](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for other sources (env vars, IAM roles, etc.). We

  **For regular use, we recommend installing the AWS CLI and running aws configure once — better-aws will then pick up your credentials automatically with no extra configuration.**

---

### 2) Configure your S3 "workspace"

Call `aws.s3.config()` once to set defaults for all subsequent operations. The main arguments:

- `bucket`: default bucket
- `key_prefix`: optional "root folder" — all keys are resolved relative to it
- `output_type`: tabular `load()` output (`"pandas"` / `"polars"`)
- `file_type`: default format for DataFrame uploads without extension (`"parquet"` / `"csv"` / `"xlsx"`)
- `overwrite`: default overwrite policy

```python
aws.s3.config(bucket="my-bucket", key_prefix="research", output_type="polars", file_type="parquet", overwrite=False)
```

---

### 3) Read from S3

Two ways to read from S3:

- `download()` = **S3 → local files** (returns `Path` or `List[Path]`)
- `load()` = **S3 → Python objects** (JSON → dict, tabular → DataFrame)

```python
path = aws.s3.download("reports/report.pdf", to="downloads/")

cfg = aws.s3.load("configs/pipeline.json")              # -> dict
df  = aws.s3.load("raw/prices.csv")                    # -> pandas/polars (by config)
dfs = aws.s3.load(["raw/a.parquet", "raw/b.parquet"])  # -> List[DataFrame]
```

> Batch native: `load()` and `download()` accept a single key or a list of keys.

Both methods support **glob patterns** including recursive `**`:

```python
# All CSVs directly under raw/
aws.s3.download("raw/*.csv", to="downloads/")

# All parquets recursively
dfs = aws.s3.load("data/**/*.parquet")

# Preserve the full S3 path structure locally (default: preserve relative to the glob root)
aws.s3.download("data/2023/*.csv", to="downloads/", preserve_prefix=True)
# -> downloads/data/2023/file.csv  (instead of downloads/file.csv)
```

---

### 4) Write to S3

`upload()` supports:

- local file path or **glob pattern** → copied as-is, structure preserved
- `dict` → JSON
- `bytes` → raw payload
- pandas/polars DataFrame → CSV/Parquet/Excel (based on key extension or default `file_type`)

```python
aws.s3.upload("local/report.pdf", "reports/report.pdf")
aws.s3.upload({"run_id": 1}, "configs/run")                          # -> configs/run.json
aws.s3.upload(df, "processed/table")                                 # -> processed/table.parquet
aws.s3.upload([df, df], ["processed/a.parquet", "processed/b.parquet"])

# Glob upload: preserve local structure under a single S3 prefix
aws.s3.upload("exports/*.csv", "s3-prefix/exports/")
```

`upload()` returns the final S3 key(s) after upload.

> Batch native: `upload()` accepts a single or list of `src` / `key` pairs.

---

### 5) Transfer trees

`transfer()` moves or copies entire file trees between local filesystems and S3, or between two S3 locations. It auto-infers the direction from the source and destination.

```python
# Local -> S3 (move by default, deletes local files after upload)
aws.s3.transfer("exports/", "s3://my-bucket/archives/exports/")

# S3 -> local (move: deletes the S3 objects after download)
aws.s3.transfer("raw/2023/", "local/backup/2023/", move=True)

# S3 -> S3 (copy within or across buckets)
aws.s3.transfer("s3://bucket-a/data/", "s3://bucket-b/data/", move=False)

# Glob patterns are supported
aws.s3.transfer("raw/**/*.parquet", "archive/parquet/")

# Use explicit buckets when needed
aws.s3.transfer("data/", "archive/", bucket_src="prod-bucket", bucket_dst="archive-bucket")
```

`transfer()` preserves relative directory structure at the destination. Pass `move=False` to copy instead of move.

---

### 6) Utilities

```python
# Check existence
aws.s3.exists("raw/prices.parquet")                    # -> bool

# List objects — returns List[dict] with key, size, last_modified, etag, storage_class
aws.s3.list(prefix="raw/", with_meta=True)

# List keys only
keys = aws.s3.list(prefix="raw/", with_meta=False)    # -> List[str]

# Delete (glob patterns supported, force=True required for patterns)
aws.s3.delete(["tmp/a.parquet", "tmp/b.parquet"])
aws.s3.delete("tmp/**", force=True)

# Pretty-print S3 prefix as a tree (sorted by size)
aws.s3.tree(prefix="data/", max_depth=3, folders_first=True)
```

---

### 7) Object serialization

`better-aws` can serialize arbitrary Python objects (e.g. scikit-learn models) directly to/from S3 using pickle, joblib, or skops.

> **Security:** Requires `allow_unsafe_serialization=True` in `config()`. Deserializing untrusted data is unsafe by design.

```python
aws.s3.config(
    bucket="my-bucket",
    allow_unsafe_serialization=True,
    object_base_format="joblib",    # "pickle" | "joblib" | "skops"
    joblib_compress=3,
)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier().fit(X_train, y_train)

aws.s3.upload(model, "models/rf_classifier")      # -> models/rf_classifier.joblib
model = aws.s3.load("models/rf_classifier.joblib")
```

Supported extensions: `.pkl` / `.pickle`, `.joblib` / `.jl`, `.skops`.

---

### 8) Logging

- `verbose=False` → **no package logs**
- `verbose=True` → a few `info` messages (minimal, no spam)
- Pass your own logger to unify output with your app (e.g., Rich handler)

```python
import logging
from rich.logging import RichHandler
from better_aws import AWS

logger = logging.getLogger("myapp")
logger.setLevel(logging.INFO)
logger.handlers = [RichHandler(rich_tracebacks=True)]
logger.propagate = False

# Custom logger
aws = AWS(profile="s3admin", region="eu-west-3", logger=logger, verbose=True)

# No logs
aws = AWS(profile="s3admin", region="eu-west-3", verbose=False)

# Minimal "print-like" logs
aws = AWS(profile="s3admin", region="eu-west-3", verbose=True)
```

---

## API reference

### `AWS`

```python
AWS(
    profile=None,               # AWS profile name
    region=None,                # AWS region
    logger=None,                # Optional logging.Logger
    verbose=False,              # Enable info-level logs
    retries=3,                  # Max retry attempts (botocore standard mode)
    connect_timeout_s=10,       # Connection timeout in seconds
    read_timeout_s=300,         # Read timeout in seconds
    *,
    credentials_file=None,      # Path to a custom credentials file
    config_file=None,           # Path to a custom config file
    env_file=None,              # Path to a .env file with AWS credentials
    aws_access_key_id=None,     # Static access key ID
    aws_secret_access_key=None, # Static secret access key
    aws_session_token=None,     # Optional session token
)
```

| Method                           | Returns | Description                                                                   |
| -------------------------------- | ------- | ----------------------------------------------------------------------------- |
| `aws.s3`                         | `S3`    | S3 service wrapper (lazy-loaded)                                              |
| `aws.identity(print_info=False)` | `dict`  | Get caller identity via STS (`Arn`, `Account`, `UserId`). Optionally logs it. |
| `aws.info(msg, *args)`           | `None`  | Log a message if `verbose=True`                                               |
| `aws.reset_session()`            | `None`  | Clear the cached boto3 session (forces re-auth on next call)                  |

---

### `S3`

#### `config()`

Sets defaults for all subsequent S3 operations. Must be called before using any S3 method that requires a bucket.

```python
aws.s3.config(
    bucket=None,                        # Default S3 bucket
    *,
    key_prefix="",                      # Prefix prepended to all keys
    output_type="pandas",               # Tabular load output: "pandas" | "polars"
    file_type="parquet",                # Default upload format: "csv" | "parquet" | "xlsx" | "xls" | serialization formats
    overwrite=True,                     # Allow overwriting existing objects
    encoding="utf-8",                   # Encoding for text-based I/O (JSON, CSV)
    csv_sep=",",                        # CSV column separator
    csv_index=False,                    # Include pandas index in CSV uploads
    parquet_index=None,                 # Include pandas index in parquet uploads (None = pandas default)
    excel_index=False,                  # Include pandas index in Excel uploads
    allow_unsafe_serialization=False,   # Enable pickle/joblib/skops serialization
    object_base_format="pickle",        # Default format for Python objects: "pickle" | "joblib" | "skops"
    pickle_protocol=pickle.HIGHEST_PROTOCOL,  # Pickle protocol version
    joblib_compress=3,                  # Joblib compression level (0–9)
    small_payload_threshold=5242880,    # Max in-memory payload size (bytes) before switching to temp-file upload
    multipart_threshold_mb=5,           # File size threshold to trigger multipart upload/download
    multipart_chunksize_mb=5,           # Chunk size for multipart transfers
    max_concurrency=8,                  # Max parallel threads for managed transfers
    use_threads=True,                   # Enable threading for managed transfers
    delete_batch_size=1000,             # Max objects per delete_objects call (S3 hard limit: 1000)
)
```

---

#### `list()`

```python
aws.s3.list(
    prefix="",          # Filter keys by prefix
    *,
    bucket=None,        # Override default bucket
    limit=None,         # Max number of objects to return
    recursive=True,     # If False, list only direct children (non-recursive)
    with_meta=True,     # Include metadata in results
) -> List[dict] | List[str]
```

Returns a list of dicts when `with_meta=True` (fields: `key`, `size`, `last_modified`, `etag`, `storage_class`), or a list of key strings when `with_meta=False`.

---

#### `exists()`

```python
aws.s3.exists(
    key,            # S3 object key
    *,
    bucket=None,    # Override default bucket
) -> bool
```

Returns `True` if the object exists, `False` otherwise.

---

#### `load()`

```python
aws.s3.load(
    key,                    # str, List[str], or glob pattern
    *,
    bucket=None,            # Override default bucket
    output_type=None,       # Override default output type: "pandas" | "polars"
) -> Any | List[Any]
```

Loads one or more S3 objects into Python objects. Format is inferred from the key extension:

| Extension                                     | Output                                                     |
| --------------------------------------------- | ---------------------------------------------------------- |
| `.json`                                       | `dict`                                                     |
| `.csv`, `.parquet`, `.xlsx`, `.xls`           | DataFrame (pandas or polars per `output_type`)             |
| `.pkl`, `.pickle`, `.joblib`, `.jl`, `.skops` | Python object (requires `allow_unsafe_serialization=True`) |
| anything else                                 | `bytes`                                                    |

Supports glob patterns (`*`, `?`, `**`). Returns a single object for a single key, a list otherwise.

---

#### `download()`

```python
aws.s3.download(
    key,                    # str, List[str], or glob pattern
    to=None,                # Local destination path or directory (default: current directory)
    *,
    preserve_prefix=False,  # If True, recreate the full S3 path locally.
                            # If False, preserve structure relative to the glob root.
    bucket=None,            # Override default bucket
) -> Path | List[Path]
```

Downloads one or more S3 objects to the local filesystem. Supports glob patterns including `**` for recursive matching. Parent directories are created automatically.

---

#### `upload()`

```python
aws.s3.upload(
    src,            # UploadInput or List[UploadInput]
                    # Supported types: str/Path (file or glob), dict, bytes, pd.DataFrame, pl.DataFrame
    key,            # str or List[str] — destination S3 key(s) or single prefix for glob sources
    *,
    bucket=None,    # Override default bucket
    overwrite=None, # Override default overwrite setting
) -> str | List[str]
```

Uploads one or more objects to S3. The serialization format is inferred from the key extension, or falls back to `file_type` from config. Returns the final S3 key(s).

---

#### `delete()`

```python
aws.s3.delete(
    key,            # str, List[str], or glob pattern
    *,
    force=False,    # Required when using glob patterns
    bucket=None,    # Override default bucket
) -> None
```

Deletes one or more S3 objects. Glob patterns (including `**`) are supported but require `force=True` as a safety guard. Deletions are batched in groups of up to `delete_batch_size` (default: 1000).

---

#### `transfer()`

```python
aws.s3.transfer(
    src,                # str — source path, S3 key, glob pattern, or s3:// URI
    dst,                # str — destination path, S3 prefix, or s3:// URI
    *,
    move=True,          # If True, delete source after successful transfer
    bucket_src=None,    # Override source bucket for S3 sources
    bucket_dst=None,    # Override destination bucket for S3 destinations
) -> str | List[str] | Path | List[Path]
```

Transfers file trees between local filesystems and S3, or between two S3 locations. The transfer direction is inferred automatically:

| Source            | Destination   | Mode         |
| ----------------- | ------------- | ------------ |
| local path / glob | S3 key or URI | `local → S3` |
| S3 key / URI      | local path    | `S3 → local` |
| S3 key / URI      | S3 key or URI | `S3 → S3`    |

Relative directory structure is always preserved at the destination. Returns the destination path(s) created.

---

#### `tree()`

```python
aws.s3.tree(
    prefix="",              # S3 prefix to display
    *,
    bucket=None,            # Override default bucket
    show_full_path=True,    # Show full S3 key vs. basename only
    max_depth=None,         # Max depth to display (None = unlimited)
    max_children=None,      # Max children per node (None = unlimited)
    folders_first=True,     # Display folders before files at each level
    limit=None,             # Max number of S3 objects to include
) -> None
```

Pretty-prints the S3 object tree under a given prefix using `rich`, sorted by total size at each level.

---

## License

MIT License

Copyright (c) 2026 better-aws Contributors

See LICENSE file for details.
