Metadata-Version: 2.1
Name: libhxl
Version: 4.25.1
Summary: Python support library for the Humanitarian Exchange Language (HXL). See http://hxlstandard.org and https://github.com/HXLStandard/libhxl-python
Home-page: http://hxlproject.org
Author: David Megginson
Author-email: megginson@un.org
License: UNKNOWN
Platform: UNKNOWN
Description-Content-Type: text/markdown
License-File: LICENSE.md

libhxl-python
=============

Python support library for the Humanitarian Exchange Language (HXL)
data standard.  The library requires Python 3 (versions prior to 4.6
also supported Python 2.7).

About HXL: http://hxlstandard.org

More-detailed documentation is available in the wiki: https://github.com/HXLStandard/libhxl-python/wiki


# Usage

The _hxl.data()_ function reads HXL from a file object, filename, URL,
or list of arrays and makes it available for processing, much like
``$()`` in JQuery:

```
import sys
import hxl

dataset = hxl.data(sys.stdin)
```

You can add additional methods to process the data.  This example
shows an identity transformation in a pipeline (See "Generators",
below):

```
for line in hxl.data(sys.stdin).gen_csv():
    print(line)
```

This is the Same transformation, but loading the entire dataset into
memory as an intermediate step (see "Filters", below):

```
for line in hxl.data(sys.stdin).cache().gen_csv():
    print(line)
```


## Filters

There are a number of filters that you can apply in a stream after a
HXL dataset.  This example uses the _with_rows()_ filter to find every
row that has a #sector of "WASH" and print the organisation mentioned
in the row:

```
for row in hxl.data(sys.stdin).with_rows('#sector=WASH'):
    print('The organisation is {}'.format(row.get('#org')))
```

This example removes the WASH sector from the results, then counts the
number of times each organisation appears in the remaining rows:

```
url = 'http://example.org/data.csv'
result = hxl.data(url).with_rows('#sector!=WASH').count('#org')
```

The following filters are available:

<table>
  <thead>
    <th>Filter method</th>
    <th>Description</th>
  </thead>
  <tbody>
    <tr>
      <td><code>.append(append_sources, add_columns=True, queries=[])</code></td>
      <td>Append a second HXL dataset to the current one, lining up columns.</td>
    </tr>
    <tr>
      <td><code>.cache()</code></td>
      <td>Cache an in-memory version of the dataset (for processing multiple times).</td>
    </tr>
    <tr>
      <td><code>.dedup(patterns=[], queries=[])</code></td>
      <td>Deduplicate the rows in a dataset, optionally looking only at specific columns.</td>
    </tr>
    <tr>
      <td><code>.with_columns(whitelist)</code></td>
      <td>Include only columns that match the whitelisted tag pattern(s), e.g. "#org+impl".</td>
    </tr>
    <tr>
      <td><code>.without_columns(blacklist)</code></td>
      <td>Include all columns <em>except</em> those that match the blacklisted tag patterns.</td>
    </tr>
    <tr>
      <td><code>.with_rows(queries, mask=[])</code></td>
      <td>Include only rows that match at least one of the queries, e.g. "#sector=WASH". Optionally ignore rows that don't match a mask pattern.</td>
    </tr>
    <tr>
      <td><code>.without_rows(queries, mask=[])</code></td>
      <td>Exclude rows that match at least one of the queries, e.g. "#sector=WASH". Optionally ignore rows that don't match a mask pattern.</td>
    </tr>
    <tr>
      <td><code>.sort(keys=None, reverse=False)</code></td>
      <td>Sort the rows, optionally using the pattern(s) provided as sort keys. Set _reverse_ to True for a descending sort.</td>
    </tr>
    <tr>
      <td><code>.count(patterns=[], aggregators=None, queries=[])</code></td>
      <td>Count the number of value combinations that appear for the pattern(s), e.g. ['#sector', '#org']. Optionally perform other aggregations, such as sums or averages.</td>
    </tr>
    <tr>
      <td><code>.replace_data(original, replacement, pattern=None, use_regex=False, queries=[])</code></td>
      <td>Replace values in a HXL dataset.</td>
    </tr>
    <tr>
      <td><code>.replace_data_map(map_source, queries=[])</code></td>
      <td>Replace values in a HXL dataset using a replacement map in another HXL dataset.</td>
    </tr>
    <tr>
      <td><code>.add_columns(specs, before=False)</code></td>
      <td>Add columns with fixed values to the dataset, e.g. "Country#country=Kenya" to add a new column #country with the text header "Country" and the value "Kenya" in every row.</td>
    </tr>
    <tr>
      <td><code>.rename_columns(specs)</code></td>
      <td>Change the header text and HXL hashtags for one or more columns.</td>
    </tr>
    <tr>
      <td><code>.clean_data(whitespace=[], upper=[], lower=[], date=[], number=[], queries=[])</code></td>
      <td>Clean whitespace, normalise dates and numbers, etc., optionally limited to specific columns.</td>
    </tr>
    <tr>
      <td><code>.merge_data(merge_source, keys, tags, replace=False, overwrite=False, queries=[])</code></td>
      <td>Merge values horizontally from a second dataset, based on shared keys (like a SQL join).</td>
    </tr>
    <tr>
      <td><code>.explode(header_attribute='header', value_attribute='value')</code></td>
      <td>Explode a "wide" dataset into a "long" dataset, using the HXL +label attribute.</td>
    </tr>
  </tbody>
</table>

## Sinks

Sinks take a HXL stream and convert it into something that's not HXL.

### Validation

To validate a HXL dataset against a schema (also in HXL), use the ``validate`` sink:

```
is_valid = hxl.data(url).validate('my-schema.csv')
```

If you don't specify a schema, the library will use a simple, built-in schema:

```
is_valid = hxl.data(url).validate()
```

If you include a callback, you can collect details about the errors and warnings:

```
def my_callback(error_info):
    # error_info is a HXLValidationException
    sys.stderr.write(error_info)

is_valid = hxl.data(url).validate(schema='my-schema.csv', callback=my_callback)
```

### Generators

Generators allow the re-serialising of HXL data, returning something that works like an iterator.  Example:

```
for line in hxl.data(url).gen_csv():
    print(line)
```

The following generators are available (you can use the parameters to turn the text headers and HXL tags on or off):

<table>
  <thead>
    <th>Generator method</th>
    <th>Description</th>
  </thead>
  <tbody>
    <tr>
      <td><code>Dataset.gen_raw(show_headers=True, show_tags=True)</code></td>
      <td>Generate arrays of strings, one row at a time.</td>
    </tr>
    <tr>
      <td><code>Dataset.gen_csv(show_headers=True, show_tags=True)</code></td>
      <td>Generate encoded CSV rows, one row at a time.</td>
    </tr>
    <tr>
      <td><code>Dataset.gen_json(show_headers=True, show_tags=True)</code></td>
      <td>Generate encoded JSON rows, one row at a time.</td>
    </tr>
  </tbody>
</table>


## Caching

libhxl uses the Python
[requests](http://docs.python-requests.org/en/master/) library for
opening URLs. If you want to enable caching (for example, to avoid
beating up on your source with repeated requests), your code can use
the [requests_cache](https://pypi.python.org/pypi/requests-cache)
plugin, like this:

    import requests_cache
    requests_cache.install_cache('demo_cache', expire_after=3600)

The default caching backend is a sqlite database at the location specied.


# Installation

This repository includes a standard Python `setup.py` script for
installing the library and scripts (applications) on your system. In a
Unix-like operating system, you can install using the following
command:

```
python setup.py install
```

If you don't need to install from source, try simply

```
pip install libhxl
```

Once you've installed, you will be able to include the HXL libraries
from any Python application, and will be able to call scripts like
_hxlvalidate_ from the command line.


# Makefile

There is also a generic Makefile that automates many tasks, including
setting up a Python virtual environment for testing. The Python3 venv
module is required for most of the targets.


```
make build-venv
```

Set up a local Python virtual environment for testing, if it doesn't
already exist. Will recreate the virtual environment if setup.py has
changed.

```
make test
```

Set up a virtual environment (if missing) and run all the unit tests

```
make test-install
```

Test a clean installation to verify there are no missing dependencies,
etc.

```
make close-issue
```

Merge the current git issue branch into the dev branch and delete the
issue branch.

```
make push-dev
```

Push the git dev branch to upstream.

```
make merge-test
```

Merge the git dev branch into the test branch and push to upstream.

```
make merge-master
```

Merge the git test branch into the master branch and push to upstream.

```
make etags
```

(Re)build the TAGS file that Emacs uses for global search and replace.


