Metadata-Version: 2.4
Name: galaxy-datasets
Version: 0.0.25
Summary: Galaxy Zoo datasets for PyTorch/TensorFlow
Home-page: https://github.com/mwalmsley/galaxy-datasets
Author: Mike Walmsley
Author-email: walmsleymk1@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: GPU :: NVIDIA CUDA
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: pyarrow
Requires-Dist: requests
Requires-Dist: scikit-learn
Requires-Dist: omegaconf
Requires-Dist: datasets>=3.6
Requires-Dist: torch>=1.10.1
Requires-Dist: torchvision>=0.11.2
Requires-Dist: torchaudio>=0.10.1
Requires-Dist: pytorch-lightning>=2.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# galaxy-datasets

ML-friendly datasets for major Galaxy Zoo citizen science campaigns.

- PyTorch Datasets and PyTorch Lightning DataModules
- Framework-independent download and augmentation code

See also our [HuggingFace datasets](https://huggingface.co/mwalmsley), which offer faster downloads and more flexible use. This repo was created earlier and may ultimately be replaced by HuggingFace.

| Name      | Method | PyTorch Dataset | Published | Downloadable | Galaxies
| ----------- | ----- | ----------- | --- | ---- | ---- |
| Galaxy Zoo 2 | gz2 | GZ2 | &#x2611; | &#x2611; | ~210k (main sample) |
| GZ UKIDSS | gz_ukidss | GZUKIDSS| &#x2612; | &#x2611; | ~71k |
| GZ Hubble*   | gz_hubble | GZHubble | &#x2611; | &#x2611; | ~106k (main sample) |
| GZ CANDELS   | gz_candels | GZCandels | &#x2611; | &#x2611; | ~50k |
| GZ DECaLS GZD-5 | gz_decals_5 | GZDecals5 | &#x2611; | &#x2611; | ~230k (GZD-5 only)|
| GZ Rings | gz_rings | GZRings | &#x2612; | &#x2611; | ~93k |
| GZ DESI  | gz_desi | GZDesi | &#x2611;| No* (500GB) | 8.7M |
| GZ UKIDSS | gz_ukidss | - | &#x2611; | &#x2611; | ~70k |
| GZ Euclid | gz_euclid | - | &#x2612; | &#x2611; | ~100k |
| GZ H2O (deep HSC) | gz_h2o | GZH2O| &#x2612; | &#x2611; | ~48k |
| GZ JWST (CEERS) | gz_jwst | GZJWST| &#x2612; | &#x2611; | ~7k |
| CFHT Tidal* | tidal | Tidal | &#x2611; | &#x2611; | 1760 (expert) |

Any datasets marked as downloadable but not marked as published are only downloadable internally (for development purposes).

For each dataset, you must cite/acknowledge the GZ data release paper and the original telescope survey from which the images were derived. See [data.galaxyzoo.org](data.galaxyzoo.org) for the data release paper citations to use.

We also include small debugging datasets:

| Name      | Method | PyTorch Dataset | Downloadable | Galaxies
| ----------- | ----- | ----------- |  ---- | ---- |
| Demo Rings (binary) | demo_rings | DemoRings |  &#x2611; | 1000 |
| Galaxy MNIST (four-class)| galaxy_mnist | GalaxyMNIST |  &#x2611; | 10k  |

Galaxy MNIST is also [available](https://github.com/mwalmsley/galaxy_mnist) as a pure torchvision dataset (exactly like MNIST).

*GZ Hubble is also available in "euclidised" form (i.e. with the Euclid PSF applied) to Euclid collaboration members. The method is `gz_hubble_euclidised`. Courtesy of Ben Aussel.

**Mike Smith has shared a replication of the GZ DESI images and labels on [HuggingFace](https://huggingface.co/datasets/Smith42/galaxies) (983GB)

**CFHT Tidal is not a Galaxy Zoo dataset, but rather a small expert-labelled dataset of tidal features from [Atkinson 2013](https://doi.org/10.1088/0004-637X/765/1/28).
MW reproduced and modified the images in [Walmsley 2019](https://doi.org/10.1093/mnras/sty3232). We include it here as a challenging fine-grained morphology classification task with little labelled data.

## Installation

Installing [zoobot](www.github/mwalmsley/zoobot) will automatically install this package as a dependency.

To install directly:

- `pip install galaxy-datasets` (includes PyTorch dependencies)

For local development (e.g. adding a new dataset), you can install this by cloning from github, then running `pip install -e .` in the cloned repo root. This makes changing the code easier than if you don't use the -e, in which case the package is installed under sitepackages.

I suggest either:

- For basic use without changes, installing `zoobot` via pip and allowing pip to manage this dependency
- For development, installing both `zoobot` and `galaxy-datasets` via git

## Usage

Check out the PyTorch quickstart Colab [here](https://colab.research.google.com/drive/1mLXz0tUWO_kDrfWTlxB7JT2AnPPWQODg?usp=sharing), or keep reading for more explanation.

### Framework-Independent

To download a dataset:

    from galaxy_datasets import gz2  # or gz_hubble, gz_candels, ...

    catalog, label_cols = gz2(
        root='your_data_folder/gz2',
        train=True,
        download=True
    )

This will download the images and train/test catalogs to `root`. Each `catalog` is a pandas DataFrame with the column `file_loc` giving absolute image paths and additional columns `label_cols = ['col_a', 'col_b', ...]` giving the labels (usually, the number of volunteers who gave each answer for each galaxy). If `train=True`, the method returns the train catalog, otherwise, the test catalog.

If training Zoobot from scratch, this is all you need. For example, in PyTorch:

    from zoobot.pytorch.training import train_with_pytorch_lightning

    train_with_pytorch_lightning.train_default_zoobot_from_scratch(
        catalog=catalog,
        save_dir=save_dir,
        schema=gz2_schema, # see zoobot/pytorch/examples/minimal_example.py
        ...
    )

Otherwise, you might like to use the classes in this package to load these catalogs into ML-friendly inputs.

### PyTorch

Create a PyTorch Dataset from a catalog like so:

    from galaxy_datasets.pytorch.galaxy_dataset import CatalogDataset  # generic Dataset for galaxies

    dataset = CatalogDataset(
        catalog=catalog.sample(1000),  # from gz2(...) above
        label_cols=['smooth-or-featured-gz2_smooth']
    )

Notice how you can adjust the catalog before creating the Dataset. This gives flexibility to try training on e.g. different catalog subsets.

If you don't want to change anything about the catalog, you can skip the framework-independent download and use a named class from `galaxy_datasets.pytorch`, which takes the same arguments and directly gives a Dataset:

    from galaxy_datasets.pytorch import GZ2

    gz2_dataset = GZ2(
        root='your_data_folder/gz2',
        train=True,
        download=False
    )
    batch = gz2_dataset[0]
    image = batch['image']
    label = batch['smooth-or-featured-gz2_smooth']

You might also find the PyTorch Lightning DataModule under `galaxy_datasets/pytorch/galaxy_datamodule` useful. Zoobot uses this for training and finetuning.

    from galaxy_datasets.pytorch.galaxy_datamodule import CatalogDataModule
    from galaxy_datasets.transforms import get_galaxy_transform, default_view_config

    datamodule = CatalogDataModule(
        label_cols=['smooth-or-featured-gz2_smooth'],
        catalog=catalog
        # optional args to specify augmentations
        train_transform=get_galaxy_transform(default_view_config()),
        test_transform=get_galaxy_transform(default_view_config())
    )

    datamodule.prepare_data()
    datamodule.setup()
    for batch in datamodule.train_dataloader():
        images = batch['image']
        labels = batch['smooth-or-featured-gz2_smooth']
        print(images.shape, labels.shape)
        break

### TensorFlow

*TensorFlow support has now been deprecated. The ML research community has broadly converged on PyTorch. We suggest using PyTorch or, for framework-indepedent data loading, our HuggingFace datasets.*

## Download Notes

Datasets are downloaded like:

- {root}
    - images
        - subfolder (except GZ2)
            - image.jpg
    - {catalog_name(s)}.parquet

The whole dataset is downloaded regardless of whether `train=True` or `train=False`.
