Metadata-Version: 2.1
Name: webdataset
Version: 0.1.3
Summary: Record sequential storage for deep learning.
Home-page: http://github.com/tmbdev/webdataset
Author: Thomas Breuel
Author-email: tmbdev+removeme@gmail.com
License: MIT
Keywords: object store,client,deep learning
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: Pillow
Requires-Dist: simplejson
Requires-Dist: braceexpand
Requires-Dist: msgpack
Requires-Dist: pyyaml
Requires-Dist: numpy
Requires-Dist: torch
Requires-Dist: objio

[![Test](https://github.com/tmbdev/webdataset/workflows/Test/badge.svg)](https://github.com/tmbdev/webdataset/actions?query=workflow%3ATest)
[![TestPip](https://github.com/tmbdev/webdataset/workflows/TestPip/badge.svg)](https://github.com/tmbdev/webdataset/actions?query=workflow%3ATestPip)
[![DeepSource](https://static.deepsource.io/deepsource-badge-light-mini.svg)](https://deepsource.io/gh/tmbdev/webdataset/?ref=repository-badge)

# WebDataset

WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives.

Storing data in POSIX tar archives greatly speeds up I/O operations on rotational storage and on networked file systems because it permits all I/O operations to operate as large sequential reads and writes.

WebDataset fulfills a similar function to Tensorflow's TFRecord/tf.Example classes, but it is much easier to adopt because it does not actually require any kind of data conversion: data is stored in exactly the same format inside tar files as it is on disk, and all preprocessing and data augmentation code remains unchanged.


# Installation

```Bash
    $ pip install webdataset
```

For the Github version:

```Bash
    $ pip install git+https://github.com/tmbdev/webdataset.git
```

# Documentation

[ReadTheDocs](http://webdataset.readthedocs.io)

# Using WebDataset

Here is an example of an Imagenet input pipeline used for training common visual object recognition models. Note that this code is identical to the standard `FileDataset` I/O except for the single call that constructs the `WebDataset`.

        import torch
        from torchvision import transforms
        import webdataset as wds

        normalize = transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225])

        preproc = transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]) 

        path = "http://server/imagenet_train-{0000..0147}.tgz"

        dataset = wds.WebDataset(path,
                                 decoder="pil",
                                 extensions="jpg;png cls",
                                 transforms=[preproc, lambda x: x-1])

        loader = torch.utils.data.DataLoader(dataset, batch_size=16, num_workers=4)
        for xs, ys in loader:
            train_batch(xs, ys)

# Creating WebDataset

In order to permit record sequential access to data, WebDataset only requires that the files comprising a single training samples are stored adjacent to each other inside the tar archive. Such archives can be easily created using GNU tar:

        tar --sorted -cf dataset.tar dir

On BSD and OSX, you can use:

        find dir -type f -print | sort | tar -T - -cf dataset.tar

Very large datasets are best stored as shards, each comprising a number of samples. Shards can be shuffled, read, and processed in parallel. The companion `tarproc` library permits easy sharding, as well as parallel processing of web datsets and shards. The `tarproc` programs simply operate as filters on tar streams, so for sharding, you can use a command like this:

        tar --sorted -cf - dir | tarsplit -s 1e9 -o out


# TODO

 - support `image.*` and `image=jpg,png,jpeg` syntax for extensions


