Snailz
These synthetic data generators model genomic analysis of snails in the Pacific Northwest that are growing to unusual size as a result of exposure to pollution.
- A grid is created to record the pollution levels at a sampling site.
- One or more specimens are collected from the grid. Each specimen has a genome and a mass.
- Laboratory staff design and perform assays of those genomes.
- Each assay is represented by a design file and an assay file.
- Assay files are mangled to create raw files with formatting glitches.
Usage
- Create a fresh Python environment:
uv venv - Activate that environment:
source .venv/bin/activate - Install dependencies and editable version of package:
uv pip install -e '.[dev]' - View available commands:
doit listorsnailz --help - Regenerate all data in
./tmpusing parameters in./params:doit all

Parameters
./params contains the parameter files used to control generation of the reference dataset.
grid.jsondepth: integer range of random values in cellsseed: RNG seedsize: width and height of (square) grid in cells
people.jsonlocale: language and region to use for name generationnumber: number of staff to createseed: RNG seed
specimens.jsonlength: genome length in charactersmax_mass: maximum specimen massmin_mass: minimum specimen massmut_scale: scaling factor for mutated specimensmutations: number of mutations to introducenumber: number of specimens to createseed: RNG seed
assays.jsonbaseline: assay response for unmutated specimensend_date: date of final assaymutant: assay response for mutated specimensnoise: noise to add to control cellsplate_size: width and height of assay plateseed: RNG seedstart_date: date of first assay
Note: there are no parameters for assay file mangling.
Data Dictionary
doit all creates these files in tmp using the sample parameters in params:
assays/NNNNNN_assay.csv: tidy, consistently-formatted CSV file with assay result.NNNNNN_design.csv: tidy, consistently-formatted CSV file with assay design.NNNNNN_raw.csv: CSV file derived fromNNNNNN_assay.csvwith randomly-introduced formatting errors.
assays.csv: CSV file containing summary of assay metadata with columns.ident: assay identifier (integer).specimen_id: specimen identifier (text).performed: assay date (date).performed_by: person identifier (text).
assays.json: all assay data in JSON format.grid.csv: CSV file containing pollution grid values.- This file is a matrix of values with no column IDs or row IDs.
grid.json: grid data as JSON.people.csv: CSV file describing experimental staff members.ident: person identifier (text)personal: personal name (text)family: family name (text)
people.json: staff member data in JSON format.specimens.csv: CSV file containing details of snail specimens.ident: specimen identifier (text)x: X coordinate of collection cell (integer)y: Y coordinate of collection cell (integer)genome: base sequence (text)mass: snail mass (real)
specimens.json: specimen data in JSON format.