Metadata-Version: 2.1
Name: swh.loader.tar
Version: 0.0.36
Summary: Software Heritage Tarball Loader
Home-page: https://forge.softwareheritage.org/diffusion/DLDTAR
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-tar
Project-URL: Funding, https://www.softwareheritage.org/donate
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Description-Content-Type: text/markdown
Requires-Dist: vcversioner
Requires-Dist: retrying
Requires-Dist: click
Requires-Dist: python-dateutil
Requires-Dist: swh.core (>=0.0.36)
Requires-Dist: swh.model (>=0.0.27)
Requires-Dist: swh.scheduler (>=0.0.14)
Requires-Dist: swh.storage (>=0.0.83)
Requires-Dist: swh.loader.core (>=0.0.35)
Requires-Dist: swh.loader.dir (>=0.0.33)
Provides-Extra: testing
Requires-Dist: pytest ; extra == 'testing'

# SWH Tarball Loader

The Software Heritage Tarball Loader is a tool and a library to
uncompress a local tarball and inject into the SWH dataset its tree
representation.

## Configuration

This is the loader's (or task's) configuration file.

*`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*:

```YAML
extraction_dir: /home/storage/tmp/
storage:
  cls: remote
  args:
    url: http://localhost:5002/
```

## API

Load tarball directly from code or python3's toplevel:

``` Python
# Fill in those
repo = 'loader-tar.tgz'
tarpath = '/home/storage/tar/%s' % repo
origin = {'url': 'ftp://%s' % repo, 'type': 'tar'}
visit_date = 'Tue, 3 May 2017 17:16:32 +0200'
revision = {
    'author': {'name': 'some', 'fullname': 'one', 'email': 'something'},
    'committer': {'name': 'some', 'fullname': 'one', 'email': 'something'},
    'message': '1.0 Released',
    'date': None,
    'committer_date': None,
    'type': 'tar',
}
import logging
logging.basicConfig(level=logging.DEBUG)

from swh.loader.tar.tasks import LoadTarRepository
l = LoadTarRepository()
l.run_task(tar_path=tarpath, origin=origin, visit_date=visit_date,
           revision=revision, branch_name='master')
```

## Celery

Load tarball using celery.

Providing you have a properly configured celery up and running, the
celery worker configuration file needs to be updated:

*`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/worker.yml*:

``` YAML
task_modules:
  - swh.loader.tar.tasks
task_queues:
  - swh_loader_tar
```

cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md
for more details


## Tar Producer

Its job is to compulse from a file or a folder a list of existing
tarballs. From this list, compute the corresponding messages to send
to the broker.

### Configuration

Message producer's configuration file (`tar.yml`):

``` YAML
# Mirror's root directory holding tarballs to load into swh
mirror_root_directory: /srv/storage/space/mirrors/gnu.org/gnu/
# Url scheme prefix used to create the origin url
url_scheme: http://ftp.gnu.org/gnu/
type: ftp

# File containing a subset list tarballs from mirror_root_directory to load.
# The file's format is one absolute path name to a tarball per line.
# NOTE:
# - This file must contain data consistent with the mirror_root_directory
# - if this option is not provided, the mirror_root_directory is scanned
# completely as usual
# mirror_subset_archives: /home/storage/missing-archives

# Randomize blocks of messages and send for consumption
block_messages: 250
```

### Run

Trigger the message computations:

```Shell
python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.yml
```

This will walk the `mirror_root_directory` folder and send encountered
tarball messages for the swh-loader-tar to uncompress (through
celery).

If the `mirror_subset_archives` is provided, the tarball messages will
be computed from such file (the `mirror_root_directory` is still used
so please be consistent).

If problem arises during tarball message computation, a message will
be output with the tarball that present a problem.

It will displayed the number of tarball messages sent at the end.

### Dry run

``` Shell
python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.yml --dry-run
```

This will do the same as previously described but only display the
number of potential tarball messages computed.

### Help

``` Shell
python3 -m swh.loader.tar.producer --help
```


