Metadata-Version: 2.4
Name: snailz
Version: 3.3.0
Summary: Synthetic data generator for snail mutation survey
Project-URL: home, https://github.com/gvwilson/snailz
Author-email: Greg Wilson <gvwilson@third-bit.com>
License-File: LICENSE.md
Keywords: open science,synthetic data
Requires-Python: >=3.12
Requires-Dist: faker>=40.1.2
Requires-Dist: pydantic>=2.12.5
Provides-Extra: dev
Requires-Dist: build>=1.4.0; extra == 'dev'
Requires-Dist: coverage>=7.13.1; extra == 'dev'
Requires-Dist: griffe-fieldz>=0.4.0; extra == 'dev'
Requires-Dist: markdown-include>=0.8.1; extra == 'dev'
Requires-Dist: mkdocs-material>=9.7.1; extra == 'dev'
Requires-Dist: mkdocs>=1.6.1; extra == 'dev'
Requires-Dist: mkdocstrings[python]>=1.0.0; extra == 'dev'
Requires-Dist: pytest>=9.0.2; extra == 'dev'
Requires-Dist: ruff>=0.14.13; extra == 'dev'
Requires-Dist: taskipy>=1.14.1; extra == 'dev'
Requires-Dist: twine>=6.2.0; extra == 'dev'
Description-Content-Type: text/markdown

# Snailz

<img src="https://raw.githubusercontent.com/gvwilson/snailz/refs/heads/main/pages/img/snailz-logo.svg" alt="snail logo" width="200px">

`snailz` is a synthetic data generator
that models a study of snails in the Pacific Northwest
which are growing to unusual size as a result of exposure to pollution.
The package can generate fully-reproducible datasets of varying sizes and with varying statistical properties,
and is intended primarily for classroom use.
For example,
an instructor can give each learner a unique dataset to analyze,
while learners can test their analysis pipelines using datasets they generate themselves.

> *The Story*
>
> Years ago,
> logging companies dumped toxic waste in a remote region of Vancouver Island.
> As the containers leaked and the pollution spread,
> some snails in the region began growing unusually large.
> Your team is now collecting and analyzing specimens from affected regions
> to determine if exposure to pollution is responsible.

`snailz` generates three related sets of data:

-   Grids: the survey grids where pollution levels are measured.
-   Persons: the scientists conducting the study.
-   Samples: the snails collected from the survey sites.

## Usage

1.  `pip install snailz` (or the equivalent command for your Python environment).
1.  `snailz --help` to see available commands.

To generate example data in a fresh directory:

```
# Create and activate Python virtual environment
$ uv venv
$ source .venv/bin/activate

# Install snailz and dependencies
$ uv pip install snailz

# Write default parameter values to the ./params.json file
$ snailz --defaults > params.json

# Generate all output files in the ./data directory
$ snailz --params params.json --outdir data
```

## Parameters

`snailz` reads controlling parameters from a JSON file,
and can generate a file with default parameter values as a starting point.
The parameters, their meanings, and their properties are:

| Name               | Purpose                                   | Default                  |
| ------------------ | ----------------------------------------- | -----------------------: |
| `clumsy_factor`    | personal effect on mass measurement       | 0.5                      |
| `grid_size`        | width and height of (square) survey grids | 11                       |
| `locale`           | locale for person name generation         | et_EE                    |
| `num_grids`        | number of survey grids                    | 3                        |
| `num_persons`      | number of persons                         | 5                        |
| `num_samples`      | number of samples                         | 20                       |
| `pollution_factor` | pollution effect on mass                  | 0.3                      |
| `precision`        | decimal places used to record masses      | 2                        |
| `sample_date`      | min/max sample dates                      | (2025-01-01, 2025-01-01) |
| `sample_mass`      | min/max sample mass                       | (0.5, 1.5)               |
| `seed`             | random number generation seed             | 123456                   |

## Data Dictionary

All of the generated data is stored in CSV files.

### Grids

The pollution readings for each survey grid are stored in a file <code>G<em>nnnn</em>.csv</code> (e.g., `G0003.csv`).
These CSV files do *not* have column headers;
instead, each contains a square integer matrix of pollution readings.
A typical file is:

```
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,1,0,0,0,0
0,0,0,0,0,0,0,0,1,2,1,0,0,0,0
0,0,0,0,0,0,0,0,2,1,0,0,0,0,0
0,0,0,0,0,0,0,1,2,0,0,0,0,0,0
0,0,0,0,0,0,0,1,2,1,0,0,0,0,0
0,0,0,0,0,0,0,0,1,2,0,0,0,0,0
0,0,0,0,0,0,0,2,2,1,0,0,0,0,0
0,0,0,0,0,0,0,1,3,0,0,0,0,0,0
0,0,0,0,0,0,0,1,3,1,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
```

The pollution readings are also stored in tidy format `grids.csv`:

| grid_id | x  | y  | pollution |
| :------ | -: | -: | --------: |
| G0001   |  0 |  0 |         0 |
| G0001   |  0 |  1 |         0 |
| …       | …  | …  | …         |

Its fields are:

| Field       | Purpose                  | Properties             |
| ----------- | -------------            | ---------------------- |
| `grid_id`   | identifier               | text, unique, required |
| `x`         | X coordinate in grid     | integer, required      |
| `y`         | Y coordinate in grid     | integer, required      |
| `pollution` | pollution at that point  | integer, required      |

### Persons

`persons.csv` stores the scientists performing the study in CSV format (with column headers):

| id     | personal | family   |
| :----- | :------- | :------- |
| P06    | Artur    | Aasmäe   |
| P07    | Katrin   | Kool     |
| …      | …        | …        |

Its fields are:

| Field      | Purpose       | Properties             |
| ---------- | ------------- | ---------------------- |
| `id`       | identifier    | text, unique, required |
| `personal` | personal name | text, required         |
| `family`   | family name   | text, required         |

### Samples

`samples.csv` stores information about sampled snails in CSV format (with column headers):

| sample_id | grid_id | x  | y  | pollution | person | when       | mass |
| :-----    | :------ | -: | -: | --------: | -----: | ---------: | ---: |
| S0001     | G0001   | 9  | 8  | 0         | P0004  | 2025-01-16 | 1.02 |
| S0002     | G0001   | 8  | 9  | 1         | P0005  | 2025-03-30 | 2.39 |
| …         | …       | …  | …  | …         | …      | …          | …    |

Its fields are:

| Field       | Purpose                  | Properties             |
| ----------- | ------------------------ | ---------------------- |
| `sample_id` | specimen identifier      | text, unique, required |
| `grid_id`   | grid identifie           | text, required         |
| `x`         | X coordinate in grid     | integer, required      |
| `y`         | Y coordinate in grid     | integer, required      |
| `pollution` | pollution at that point  | integer, required      |
| `person`    | who collected the sample | text, required         |
| `when`      | date sample collected    | date, required         |
| `mass`      | sample weight in grams   | real, required         |

The output directory also contains a file called `changes.json`
that records parameters used to alter data,
such as the daily growth rate of snails
and the ID of the clumsy scientist whose measurements have systematic errors.

## Colophon

`snailz` was inspired by the [Palmer Penguins][penguins] dataset
and by conversations with [Rohan Alexander][alexander-rohan]
about his book [*Telling Stories with Data*][telling-stories].

My thanks to everyone who built the tools this project relies on, including:

-   [`pydantic`][pydantic] for storing and validating data (including parameters).
-   [`pytest`][pytest] and [`faker`][faker] for testing.
-   [`ruff`][ruff] for checking the code.
-   [`uv`][uv] for managing packages and the virtual environment.

The snail logo was created by [sunar.ko][snail-logo].

## Acknowledgments {: #acknowledgments}

-   [*Greg Wilson*][wilson-greg] is a programmer, author, and educator based in Toronto.
    He was the co-founder and first Executive Director of Software Carpentry
    and received ACM SIGSOFT's Influential Educator Award in 2020.

[alexander-rohan]: https://rohanalexander.com/
[faker]: https://faker.readthedocs.io/
[penguins]: https://allisonhorst.github.io/palmerpenguins/
[pydantic]: https://docs.pydantic.dev/
[pyfakefs]: https://pypi.org/project/pyfakefs/
[pytest]: https://docs.pytest.org/
[ruff]: https://docs.astral.sh/ruff/
[snail-logo]: https://www.vecteezy.com/vector-art/7319786-snails-logo-vector-on-white-background
[telling-stories]: https://tellingstorieswithdata.com/
[uv]: https://docs.astral.sh/uv/
[wilson-greg]: https://third-bit.com/
