Metadata-Version: 2.1
Name: fw-dataset
Version: 0.1.0rc2
Summary: A library for working with Flywheel datasets
Author: joshicola
Author-email: joshuajacobs@flywheel.io
Requires-Python: >=3.12,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: adlfs (>=2024.7.0,<2025.0.0)
Requires-Dist: duckdb (>=1.1.1,<2.0.0)
Requires-Dist: flywheel-sdk (>=19.1.0,<20.0.0)
Requires-Dist: fw-client (>=0.8.6,<0.9.0)
Requires-Dist: gcsfs (>=2024.9.0.post1,<2025.0.0)
Requires-Dist: pandas (>=2.2.3,<3.0.0)
Requires-Dist: pyarrow (>=17.0.0,<18.0.0)
Requires-Dist: pydantic (>=2.9.2,<3.0.0)
Requires-Dist: s3fs (>=2024.9.0,<2025.0.0)
Description-Content-Type: text/markdown

# fw-dataset <!-- omit in toc -->

This repository contains classes and functions for creating, managing, and serving
Flywheel Datasets. Flywheel Datasets are a way to organize, share, and query data from
the Flywheel Data Model.

- [Work In Progress](#work-in-progress)
- [Getting started](#getting-started)
  - [Installation](#installation)
  - [Usage](#usage)
    - [Unassociated Datasets](#unassociated-datasets)
- [Flywheel Project Requirements](#flywheel-project-requirements)
  - [Flywheel Project Structure](#flywheel-project-structure)
    - [type](#type)
    - [bucket](#bucket)
    - [prefix](#prefix)
    - [storage\_id](#storage_id)
  - [Dataset Structure](#dataset-structure)
    - [Schema Files](#schema-files)
- [Future Development](#future-development)

## Work In Progress

This is a work in progress. All functionality is not yet implemented.

## Getting started

### Installation

Once the package is published, you can install it with pip:

```bash
pip install fw-dataset
```

or poetry:

```bash
poetry add fw-dataset
```

### Usage

```python
from fw_dataset import FWDatasetClient

# Create a client with a Flywheel API-Key
api_key = "your-api-key"
dataset_client = FWClient(api_key=api_key)

# list existing datasets (see below for Flywheel Project Requirements)
datasets = dataset_client.datasets()

# link to a specific project-associated dataset
# by project id
project_id = "your-project-id"
dataset = dataset_client.dataset(project_id=project_id)

# or by project path
group = "your-group"
project_label = "your-project-label"
dataset = dataset_client.dataset(project_path=f"fw://{group}/{project_label}")

# connect the dataset to all underlying data
conn = dataset.connect()

# query the dataset
SQL = "SELECT * FROM acquisitions"

# get the results
results = conn.execute(SQL)
result_df = results.df()
result_df.head()
```

#### Unassociated Datasets

If you have a dataset that is not associated with a Flywheel project, you can still
use the `FWDatasetClient` to access the dataset. You will need to provide the
`type`,`bucket`, `prefix`, and `credentials` of cloud or local filesystem to instantiate
and query the dataset.

```python
from fw_dataset import FWDatasetClient

# Create a client with a Flywheel API
# TODO: make this work with a client that doesn't require an API key

dataset_client = FWDatasetClient(api_key=api_key)

fs_type = "s3" # or "gcs", "azure", "fs", "local"
bucket = "your-bucket"
prefix = "your-prefix"
credentials = {"url": "{bucket-specific-credential-string}"}
# TODO: make this a class method (e.g. FWDatasetClient.get_dataset_from_filesystem)
dataset = dataset_client.get_dataset_from_filesystem(fs_type, bucket, prefix, credentials)
```

## Flywheel Project Requirements

For the Flywheel Dataset Client and the Dataset objects to function, the following
requirements must be met:

### Flywheel Project Structure

The Flywheel Project must have the following valid custom information metadata:

```json
{
    "dataset": {
        "type": "s3",
        "bucket": "bucket-name",
        "prefix": "path/to/dataset",
        "storage_id": "storage-id-of-fw-storage-object"
    }
}
```

#### type

The `type` field must be one of the following:

- `s3`: The dataset is stored in an S3 bucket.
- `gcs`: The dataset is stored in a Google Cloud Storage bucket.
- `azure`: The dataset is stored in an Azure Blob Storage container.
- `fs`,`local`: The dataset is stored on a local filesystem.

#### bucket

The `bucket` field is the name of the bucket or container where the dataset is stored.

#### prefix

The `prefix` field is the path to the dataset within the bucket or container.

The directory structure beneath the `prefix` should be as described in the
[Dataset Structure](#dataset-structure) section.

#### storage_id

The `storage_id` field is the Flywheel ID of the cloud storage record that describes the
filesystem or cloud storage bucket that the dataset is stored in. This should be a valid
storage object in the Flywheel database.

### Dataset Structure

The dataset should be stored in the bucket or container with the following structure:

```bash
{bucket}/{prefix}/
├── latest_version.json (provenance/dataset_description.json of versions/latest)
└── versions/
    └── latest/
        ├── provenance/
        │   └── dataset_description.json
        ├── tables/
        │   └── {table_name}/ (a directory structure of partitioned parquet files)
        │       └── /{partitions}/{hash}.parquet
        └── schemas/
           └── {table_name}.schema.json
```

The `latest_version.json` file is a copy of the `provenance/dataset_description.json`.
Both of these are minimal descriptions of a dataset version. The `latest` directory
represents the latest version of the dataset. Archived versions of the dataset are also
stored in the `versions` directory for archival purposes. They can be deleted once they
are no longer needed.

The above structure is more completely described in the
[Dataset Definition](docs/Dataset_Definition.md#dataset-components) Document in the
`docs` directory.

#### Schema Files

The schema files are JSON files that describe the schema of the tables in the dataset.
The schema files are stored in the `schemas` directory. The schema files are named
`{table_name}.schema.json` where `{table_name}` is the name of the table that the schema
describes.

Ideally, the schema files should be fully descriptive. However, if a minimal schema is
desired merely to allow the dataset to be queried, the schema file can be as simple as:

```json
{
    "schema": "http://json-schema.org/draft-07/schema#",
    "id": "{table_name}",
    "description": "Table derived from Tabular Data File: conditions.csv",
    "properties": {},
    "required": [],
    "type": "object"
}
```

## Future Development

Future development will include:

- [ ] Dataset creation and management from library
  - Create a new dataset from a Flywheel project
  - Dataset will be structured on local or cloud storage
  - Dataset essentials will be stored in the Flywheel project metadata
  - Dataset versions can be deleted from the storage structure
  - Dataset versions can be archived
  - Dataset can be removed from a Flywheel project

