Metadata-Version: 2.1
Name: NTAP
Version: 1.0.2
Summary: NTAP - CSSL
Home-page: https://github.com/USC-CSSL/NTAP
Author: Praveen Patil
Author-email: pspatil@usc.edu
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: bleach (==1.5.0)
Requires-Dist: numpy (==1.16.0)
Requires-Dist: tensorflow
Requires-Dist: tensorflow-tensorboard (==0.4.0)
Requires-Dist: Markdown (==3.0.1)
Requires-Dist: html5lib (==0.9999999)
Requires-Dist: nltk (==3.4)
Requires-Dist: pandas (==0.24.2)
Requires-Dist: backports.weakref (==1.0.post1)
Requires-Dist: boto (==2.49.0)
Requires-Dist: boto3 (==1.9.60)
Requires-Dist: botocore (==1.12.60)
Requires-Dist: bz2file (==0.98)
Requires-Dist: certifi (==2018.11.29)
Requires-Dist: chardet (==3.0.4)
Requires-Dist: Cython (==0.29.1)
Requires-Dist: docutils (==0.14)
Requires-Dist: emoji (==0.5.1)
Requires-Dist: emot (==2.0)
Requires-Dist: enum34 (==1.1.6)
Requires-Dist: funcsigs (==1.0.2)
Requires-Dist: future (==0.17.1)
Requires-Dist: gensim (==3.6.0)
Requires-Dist: idna (==2.7)
Requires-Dist: jmespath (==0.9.3)
Requires-Dist: mock (==2.0.0)
Requires-Dist: pbr (==5.1.1)
Requires-Dist: protobuf (==3.6.1)
Requires-Dist: python-dateutil (==2.7.5)
Requires-Dist: pytz (==2018.7)
Requires-Dist: requests (==2.20.1)
Requires-Dist: s3transfer (==0.1.13)
Requires-Dist: scikit-learn (==0.20.1)
Requires-Dist: scipy (==1.1.0)
Requires-Dist: singledispatch (==3.4.0.3)
Requires-Dist: six (==1.11.0)
Requires-Dist: sklearn (==0.0)
Requires-Dist: sklearn-pandas (==1.8.0)
Requires-Dist: smart-open (==1.7.1)
Requires-Dist: urllib3 (==1.24.1)
Requires-Dist: Werkzeug (==0.14.1)
Requires-Dist: stanfordcorenlp
Requires-Dist: progressbar2
Requires-Dist: tensorboard (==1.14.0)

## Neural Text Analysis Pipeline (NTAP)


(Project is currently in pre-release phase)

A python-based pipeline for applying neural methods to text analysis.

### Overview

This pipeline is for the wider application of advanced methodologies for text analysis. It uses python packages _sklearn_, _gensim_, and _tensorflow_ for the development of established and cutting-edge machine learning methods, respectively. 

### Installation

1. NTAP requires python3.(4-6) to be installed [download 3.6](https://www.python.org/downloads/release/python-367/)
2. It is recommended to use a virtual environment to manage python libraries and dependencies (but not required)
To install with pip:
```$ pip install virtualenv```
or
```$ sudo pip install virtualenv```

Set up a virtualenv environment and install packages from `requirements.txt` (or `requirements-gpu.txt`):
```
$ virtualenv myenv -p path/to/python/interpreter
$ source myenv/bin/activate
$ pip install -r requirements.txt
```


### External Data

NTAP makes use of a number of external resources, such as word vectors and Stanford's CoreNLP. Download them first, and set the appropriate environment variables (see below)

1. Word2vec [download](https://github.com/mmihaltz/word2vec-GoogleNews-vectors)
Set environment variable (bash):
		```
		export WORD2VEC_PATH=path/to/GoogleNews-vectors-negative300.bin.gz
		```
2. GloVe Vectors [download](https://nlp.stanford.edu/projects/glove/)
		```
		export GLOVE_PATH=path/to/glovefile.txt
		```
3. CoreNLP [download](https://stanfordnlp.github.io/CoreNLP/download.html)
		```
		export CORENLP=path/to/stanford-corenlp-full-YYYY-MM-DD/
		```
4. Dictionaries
Set up a directory to contain any directories you want to use in NTAP, such as Moral Foundations Dictionary (MFD) or LIWC categories. 
        ```
        export DICTIONARIES=path/to/dictionaries/directory/
        ```
5. Access Keys
To use the entity-tagging and linking system (provided by [Tagme](https://tagme.d4science.org/tagme/)) sign up for the service and set up your access key:
		```
		export TAGME="<my_access_key"
		```

### Processing Pipeline

This component takes raw text as input and produces data ready for entry into either baseline or other machine learning methods. 

* cleaning
Functionality includes text cleaning, 

### Baseline Pipeline

The baseline implements methods which separately generate features for supervised models. 

Implemented feature methods:

* TFIDF
* LDA
* Dictionary (Word Count)
* Distributed Dictionary Representations
* Bag of Means (averaged word embeddings)
	* Word2Vec (skipgram)
	* Glove (300-d trained on Wikipedia)
	* FastText (currently not supported)

Given the prediction task (classification or regression) two baseline methods are implemented:

* SVM classification
* ElasticNet Regression


### Methods Pipeline


### Evaluation and Analysis Pipeline





