Metadata-Version: 2.1
Name: parscival
Version: 0.7.0
Summary: modular framework for parsing, mapping and transforming STS data
Home-page: https://gitlab.com/cortext/cortext-methods/parscival
Author: Cristian Martinez, Lionel Villard
Author-email: nobody@nowhere.com
License: MIT
Project-URL: Source, https://gitlab.com/cortext/cortext-methods/parscival
Platform: any
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8
License-File: LICENSE.txt
Requires-Dist: importlib-metadata==4.11.3; python_version < "3.8"
Requires-Dist: rich==10.16.1
Requires-Dist: parsimonious==0.10.0
Requires-Dist: pyyaml
Requires-Dist: klepto==0.2.1
Requires-Dist: h5py==3.11.0
Requires-Dist: jinja2==3.0.3
Requires-Dist: pysqlite3==0.5.2
Requires-Dist: pluginlib==0.9.0
Requires-Dist: python-dotenv==0.20.0
Requires-Dist: python-dateutil==2.8.2
Requires-Dist: pyquery==2.0.0
Requires-Dist: semver==3.0.2
Requires-Dist: chardet==5.2.0
Requires-Dist: psutil==5.9.8
Provides-Extra: testing
Requires-Dist: setuptools; extra == "testing"
Requires-Dist: pytest; extra == "testing"
Requires-Dist: pytest-cov; extra == "testing"

# Parscival
![Parscival](https://gitlab.com/cortext/cortext-methods/parscival/-/raw/master/docs/_static/logo.png)

## Description

Parscival is a modular framework for ingesting, parsing, mapping, curating,
validating and storing textual data. It is originally designed to
process STS inputs and export them to any arbitrary format.

Data parsing and transforming is performed according to an experimental specification
described in a YAML file. For an example see [here](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/src/parscival_specs/pubmed/pubmed-nbib.yaml).

The output data is saved by default using the
[HDF5](https://www.hdfgroup.org/solutions/hdf5/) binary data format. HDF5
is an open source file format that supports large, complex, heterogeneous data.
It is designed for fast I/O processing and storage.

To enable parallel (on-the-fly) access to the HDF5 data produced, Parscival
uses [klepto](https://github.com/uqfoundation/klepto), a python library that
provides fast and flexible access to large amounts of storage.

In order to define how to transform the data into an arbitrary output
format, Parscival implements a lightweight plugin architecture. For example, by using
the [render-template](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/src/parscival_plugins/storing/plain/render_template.py) plugin, the output
result can be simple described as a [Jinja](https://jinja.palletsprojects.com/en/3.0.x/)
template. For an example on how to transform the data into ``json``
see [here](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/src/parscival_specs/default/assets/cortext.json.tpl).

## Install

```console
pip install parscival
```

## Usage

```console
usage: parscival [-h] [--job-id JOB_ID] [--version] [-v] [-vv] FILE_PARSER_SPEC FILE_OUTPUT FILE_DATASET [FILE_DATASET ...]

A modular framework for ingesting, parsing, mapping, curating, validating and storing heterogeneous data

positional arguments:
  FILE_PARSER_SPEC     parscival specification
  FILE_OUTPUT          processed data output
  FILE_DATASET         input dataset

options:
  -h, --help           show this help message and exit
  --job-id JOB_ID      job identifier for logging
  --version            show program's version number and exit
  -v, --verbose        set loglevel to INFO
  -vv, --very-verbose  set loglevel to DEBUG
```

### Examples

```console
# converts documents from pesticides-s.nbib into pesticides.cortext.json as described by pubmed-nbib.yaml
parscival -v src/parscival_specs/pubmed/pubmed-nbib.yaml /tmp/pesticides.cortext.json tests/datasets/pesticides-s.nbib

# converts documents from both pesticides-s.nbib and hetercat-s.nbib into pesticides.db as described by pubmed-nbib.yaml
parscival -v src/parscival_specs/pubmed/pubmed-nbib.yaml /tmp/pesticides.cortext.db tests/datasets/pesticides-s.nbib tests/datasets/hetercat-s.nbib
```

## Supported formats

### Sources

- ``PubMed (.nbib)`` : PubMed is a free search engine accessing primarily the MEDLINE
database of references and abstracts on life sciences and biomedical topics. The
parsing spec is avalaible [here](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/src/parscival_specs/pubmed/pubmed-nbib.yaml). You can find a more
detailed description in the related [documentation](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/docs/guides/key-value.md).

- ``Europresse (.html)`` : Europresse is a comprehensive database providing access
to a vast range of news and information from various sources. The [parsing specification](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/src/parscival_specs/europresse/europresse-html.yaml)
allows for extracting structured data from Europresse HTML files.

### Intermediate data

The intermediate data is stored usign the CorText Graph format:

| Field       | Value                                | Type             | Description              |
| ----------- | ------------------------------------ | ---------------- | ------------------------ |
| `file`      | `sourceFile(fieldName)`              | `text`           | source file for the data |
| `id`        | `fieldName.doc[0,n-1]`               | `integer`        | ID of each document      |
| `rank`      | `fieldName.doc[id][0,m-1]`           | `integer`        | field cardinal index     |
| `parserank` | `fieldName.doc[id][rank][0,p-1]`     | `integer`        | parsed cardinal index    |
| `data`      | `fieldName.doc[id][rank][parserank]` | `[text,integer]` | parsed data              |

### Output

- ``cortext.json``: intermediate data is converted to ``json`` using the [cortext.json template](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/src/parscival_specs/default/assets/cortext.json.tpl)

- ``cortext.sqlite``: intermediate data is converted to a ``sqlite`` script using
the [cortext.sqlite template](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/src/parscival_specs/default/assets/cortext.sqlite.tpl). If requested by the
processing spec, the resulting ``sqlite`` script can be intepreted and thus converted
to a binary database.

## Requirements

Parscival has been set up using PyScaffold, a project generator for
bootstrapping high-quality Python packages. For details and usage information
on PyScaffold see <https://pyscaffold.org>.

This project uses PyScaffold in combination with Tox, a generic virtualenv management
and test command line tool acting as frontend to Continuous Integration servers.
A list with all the available tasks is obtained via the ``tox -av`` command.

To prepare your environment you will need to install the following dependencies:

```console
pip install -U pip setuptools
pip install -U tox
```

## Deployment

```console
virtualenv .venv
source .venv/bin/activate
# ... edit setup.cfg to add dependencies ...
pip install -e .
tox

# to build distribution
tox -e build
```

## Documentation

In order to compile the Parscival documentation you must type:

```console
tox -e docs
```

## Dependences

- ``libhdf5-dev``: Provides the development files for the HDF5 (Hierarchical Data
Format version 5) library. HDF5 is designed to store and organize large amounts
of data, making it suitable for high-performance data processing applications.

## Environment variables

- ``PARSCIVAL_PLUGINS_PATHS``: Specifies the directories where Parscival should
    look for plugins.

- ``PARSCIVAL_PLUGIN_RENDER_TEMPLATE_DIR``: Specifies the directory where Parscival
    should look for default rendering templates used by plugins.

- ``PARSCIVAL_LOG_PATH`` Specifies the directory where Parscival should keep the logging
    activity.

## Learn more

To learn more about Parscival, compile the documentation by executing the following command: ``tox -e docs``

Alternatively, you may directly refer to some raw documentation pages linked below:

### General

- [How to process HTML documents with Parscival](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/docs/guides/html.md)
- [How to process plain text key-value documents with Parscival](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/docs/guides/key-value.md)

### Plugins

- [Parsing](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/docs/plugins/parsing/index.md)
- [Mapping](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/docs/plugins/mapping/index.md)
- [Curating](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/docs/plugins/curating/index.md)
- [Storing](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/docs/plugins/storing/index.md)

### Parscival specification examples

- [Europresse (.html)](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/src/parscival_specs/europresse/europresse-html.yaml)
- [Pubmed (.nbib)](https://gitlab.com/cortext/cortext-methods/parscival/-/blob/master/src/parscival_specs/pubmed/pubmed-nbib.yaml)

## Credits

Parscival is being developed by the [CorTexT Platform](https://www.cortext.net) and
[Cogniteva SAS](https://cogniteva.com).
