Metadata-Version: 2.1
Name: corpusinterface
Version: 0.1.0
Summary: tools for loading corpora
Home-page: https://github.com/DCMLab/CorpusInterface
Author: Robert Lieck
Author-email: robert.lieck@epfl.ch
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown

# Corpus Interface

![build](https://github.com/DCMLab/CorpusInterface/workflows/build/badge.svg)
[![PyPI version](https://badge.fury.io/py/corpusinterface.svg)](https://badge.fury.io/py/corpusinterface)

![tests](https://github.com/DCMLab/CorpusInterface/workflows/tests/badge.svg)
[![codecov](https://codecov.io/gh/DCMLab/CorpusInterface/branch/master/graph/badge.svg?token=BooAiwbcyk)](https://codecov.io/gh/DCMLab/CorpusInterface)

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)

Basic functionality to maintain and load corpora.

## Installation

`pip install corpusinterface`

## Managing Corpora

### Adding your own corpus

Say, you packaged a number of files into a corpus

```
your-corpus
  |- file_1.txt
  |- file_2.txt
  |- dir_1
    |- file_3.txt
    |- file_4.txt
```

and let's assume you made it available as a zip archive at `http://your-website.com/your-corpus.zip`. Using the corpus interface these file can be accessed as follows:

```python
from corpusinterface import config, load

# initialise config
config.init_config()

# add your corpus
config.add_corpus("Your Corpus",
                  access="zip",
                  url="http://your-website.com/your-corpus.zip")

# load the corpus
corpus = load("Your Corpus", download=True)

# access the data (using a file_reader of your choice)
for file in corpus.data(file_reader=lambda file, **kwargs: f"reading: {file}"):
    print(file)
```

This will print

```
reading: ~/corpora/Your Corpus/file_1.txt
reading: ~/corpora/Your Corpus/file_2.txt
reading: ~/corpora/Your Corpus/dir_1/file_3.txt
reading: ~/corpora/Your Corpus/dir_1/file_4.txt
```

with `~` being replaced with your home directory (paths might be displayed differently, depending on your operating system).

### Config

Instead of specifying the necessary information from within Python, you can also put it in a config file:

```ini
[Your Corpus]
access: zip
url: http://your-website.com/your-corpus.zip
```

If you put this file at the default location  `~/corpora/corpora.ini` in your home directory or a file `corpora.ini` in the current working directory, it is automatically loaded when calling `config.init_config()`. Otherwise, you can load any config file by either providing it to `init_config`

```python
config.init_config("your-config-file.ini")
```

or loading it manually later

```python
config.load_config("your-config-file.ini")
```

#### Default config

A default config file is shipped with the `corpusinterface`  package and automatically loaded by `init_config`. It defines some useful defaults that are used for newly added corpora if no corpus-specific values are specified. You can see all the config information associated to your corpus by printing a summary:

```python
print(config.summary(corpus="Your Corpus"))
```

```ini
[Your Corpus]
    access: zip
    url: http://your-website.com/your-corpus.zip
    info: None
    root: ~/corpora
    path: ~/corpora/Your Corpus
    parent: None
    loader: FileCorpus
```

In particular, the default `root` directory `~/corpora` was added and the corpus is stored in a `path` that is a subdirectory `~/corpora/Your Corpus` according to its name (more on `root` and `path` below). Moreover, by default we assume to have a `FileCorpus` consisting of a simple collection of files.

#### Special parameters

The parameters `root`, `path`, `parent`,  `download`, `loader`, `access`, and `url` are special and their values are treated in a particular way.

##### `root`

Root directory to store the corpus in. This should be an absolute path, `~` is expanded to the user home. If a relative path is specified, a warning is issued and it is interpreted relative to the current working directory. If `parent` is non-empty, the value of `root` is ignored and instead the parent's `path` is used. A call to `config.get(Name, 'root')` returns the effective value.

##### `path`

Directory to store the corpus in. This can be

1. an absolute path (`~` is expanded to the user home), in which case `root` is ignored
2. a relative path, in which case it is appended to `root` or
3. be empty, in which case the corpus `[Name]` is appended to `root`.

A call to `config.get(Name, 'path')` returns the effective value. Note that for sub-corpora (with non-empty `parent`) the parent's `path` is used instead of `root`.

##### `parent`

A parent corpus name or empty. If non-emtpy, the parent corpus should be defined separately and the value of `root` is ignored and replaced by the parent's `path`.

Initialisation (e.g. downloading from `url` with `access` method) is delegated to the parent corpus when loading a sub-corpus.

##### `download`

##### `loader`

##### `access`

##### `url`

#### Additional parameters

You can specify additional parameters that are handed over to the loader and (in case of the `FileCorpus` loader) further passed on the your `file_reader` function. For instance, you could specify

```ini
prefix: my prefix
```

in the config file or equivalently

```python
config.add_corpus("Your Corpus",
                  ...,
                  prefix="my prefix")
```

from within Python. Your file reader can then make use of this parameter (provided as a keyword argument, so you have to refer to it by the correct name)

```python
file_reader=lambda file, prefix, **kwargs: f"{prefix}: {file}"
```

```
my prefix: ~/corpora/Your Corpus/file_1.txt
...
```

This is also the reason why we always need  `**kwargs` in a reader function to accept all keyword arguments that are provided, even if we decide to not use them.

The config values can be dynamically overwritten in the `load` function

```python
corpus = load("Your Corpus",
              ...,
              prefix="other prefix")
```

```
other prefix: ~/corpora/Your Corpus/file_1.txt
...
```

or in the `data` function:

```python
for file in corpus.data(..., prefix="still different"):
    ...
```

```
still different: ~/corpora/Your Corpus/file_1.txt
...
```

#### Controlling initialisation

You have full control over how the config is initialised. A call to `config.init_config()` without any arguments will load the default config, look for `corpora.ini` in `~/corpora` and the current working directory and load them, too, if present. This is equivalent to calling

```python
config.init_config(default=None, home=None, local=None)
```

For each of these parameters you may alternatively specify a value of `True` (meaning that you _expect_ the respective config file to be loaded and otherwise an error is raised), or `False` (meaning the the respective config file is _not_ loaded). Additionally, you may specify one or more files that should additionally be loaded

```python
config.init_config("/path/to/file_1.ini", "/path/to/file_2.ini", ...)
```



