Metadata-Version: 2.1
Name: e2e-Dutch
Version: 0.4.0
Summary: Coreference resolution with e2e for Dutch
Home-page: https://github.com/Filter-Bubble/e2e-Dutch
Author: Dafne van Kuppevelt
Author-email: d.vankuppevelt@esciencecenter.nl
License: UNKNOWN
Description: ![Python package](https://github.com/Filter-Bubble/e2e-Dutch/workflows/Python%20package/badge.svg)
        [![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/Filter-Bubble/e2e-Dutch/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/Filter-Bubble/e2e-Dutch/?branch=master)
        [![codecov](https://codecov.io/gh/Filter-Bubble/e2e-coref/branch/master/graph/badge.svg)](https://codecov.io/gh/Filter-Bubble/e2e-coref)
        [![DOI](https://zenodo.org/badge/276878416.svg)](https://zenodo.org/badge/latestdoi/276878416)
        
        
        # e2e-Dutch
        Code for e2e coref model in Dutch. The code is based on the [original e2e model for English](https://github.com/kentonl/e2e-coref), and modified to work for Dutch.
        If you make use of this code, please [cite it](#citing-this-code) and also cite [the original e2e paper](https://arxiv.org/abs/1804.05392).
        
        ## Installation
        Requirements:
        - Python 3.6 or 3.7
        - pip
        
        In this repository, run:
        ```
        pip install -r requirements.txt
        ./scripts/setup_all.sh
        pip install .
        ```
        
        The `setup_all` script downloads the word vector files to the `data` directories. It also builds the application-specific tensorflow kernels.
        
        ## Quick start - Stanza
        
        e2edutch can be used as part of a Stanza pipeline.
        
        Coreferences are added similarly to Stanza's entities:
         * a ___Document___ has an attribute ___clusters___ that is a List of coreference clusters;
         * a coreference cluster is a List of Stanza ___Spans___.
        
        ```
        import stanza
        import e2edutch.stanza
        
        nlp = stanza.Pipeline(lang='nl', processors='tokenize,coref')
        
        doc = nlp('Dit is een test document. Dit document bevat coreferenties.')
        print ([[span.text for span in cluster] for cluster in doc.clusters])
        ```
        
        
        ## Quick start
        A pretrained model is available to download:
        ```
        python -m e2edutch.download
        ```
        This downloads the model files, the default location is the `data` directory inside the python package location.
        It can also be set manually by specifying the enviornment vairable `E2E_HOME` or through the config file (see below).
        
        
        
        The pretrained model can be used to predict coreferences on a conll 2012 files, jsonlines files, [NAF files](https://github.com/newsreader/NAF) or plain text files (in the latter case, the nltk package will be used for tokenization).
        ```
        python -m e2edutch.predict [-h] [-o OUTPUT_FILE] [-f {conll,jsonlines,naf}]
                          [-c WORD_COL] [--cfg_file CFG_FILE] [-v]
                          config input_filename
        
        positional arguments:
          config: name of the model to use for prediction ('final' for the pretrained)
          input_filename
        
        optional arguments:
          -h, --help            show this help message and exit
          -o OUTPUT_FILE, --output_file OUTPUT_FILE
          -f {conll,jsonlines,naf}, --format_out {conll,jsonlines,naf}
          -c WORD_COL, --word_col WORD_COL
          --cfg_file CFG_FILE   config file
          -v, --verbose
        
        
        ```
        The user-specific configurations (such as data directory, data files, etc) can be provided in a separate config file, the defaults are specified in `cfg/defaults.conf`.
        
        
        ## Train your own model
        To train a new model:
        - Make sure the model config file (default: `e2edutch/cfg/models.conf`) describes the model you wish to train
        - Make sure your config file (default: `e2edutch/cfg/defaults.conf`) includes the data files you want to use for training
        - Run `scripts/setup_train.sh e2edutch/cfg/defaults.conf`. This script converts the conll2012 data to jsonlines files, and caches the word and contextualized embeddings.
        - If you want to enable the use of a GPU, set the environment variable:
        ```bash
        export GPU=0
        ```
        - Run the training script:
        ```bash
        python -m e2edutch.train <model-name>
        ```
        ## Citing this code
        If you use this code in your research, please cite it as follows:
        ```
        @misc{YourReferenceHere,
        author = {
                    Dafne van Kuppevelt and
                    Jisk Attema
                 },
        title  = {e2e-Dutch},
        doi    = {10.5281/zenodo.4146960},
        url    = {https://github.com/Filter-Bubble/e2e-Dutch}
        }
        ```
        As the code is largely based on [original e2e model for English](https://github.com/kentonl/e2e-coref), please make sure to also cite [the original e2e paper](https://arxiv.org/abs/1804.05392).
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
