Metadata-Version: 2.1
Name: medtop
Version: 0.0.2
Summary: https://cctrbic.github.io/medtop/
Home-page: https://github.com/etfrenchvcu/medtop
Author: Amy Olex
Author-email: etfrench@vcu.edu
License: Apache Software License 2.0
Keywords: nlp
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: gensim
Requires-Dist: matplotlib
Requires-Dist: nltk
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: plotly
Requires-Dist: scipy
Requires-Dist: sklearn
Requires-Dist: textblob
Requires-Dist: umap-learn

![CI](https://github.com/AmyOlex/medtop/workflows/CI/badge.svg) 

Documentation is available at https://amyolex.github.io/medtop/.

# MedTop
> Extracting topics from reflective medical writings.


## Requirements
`pip install medtop`

`python -m nltk.downloader all`

## How to use

A template pipeline is provided below using a test dataset. You can read more about the test_data dataset [here](https://github.com/cctrbic/medtop/blob/master/test_data/README.md)

Each step of the pipeline has configuration options for experimenting with various methods. These are detailed in the documentation for each method. Notably, the `import_docs`, `get_cluster_topics`, `visualize_clustering`, and `evaluate` methods all include the option to save results to a file.

## Example Pipeline
### Import data
Import and pre-process documents from a text file containing a list of all documents.

```python
from medtop.core import *
data, doc_df = import_docs('test_data/corpus_file_list.txt', save_results = True)
```

    Results saved to output/DocumentSentenceList.txt


### Transform data
Create word vectors from the most expressive phrase in each sentence of the imported documents.

NOTE: If `doc_df` is NOT passed to `create_tfidf`, you must set `include_input_in_tfidf=False` in `get_phrases`.

```python
tfidf, dictionary = create_tfidf(doc_df, 'test_data/seed_topics_file_list.txt')
data = get_phrases(data, dictionary.token2id, tfidf, include_input_in_tfidf = True)
data = get_vectors("tfidf", data, dictionary = dictionary, tfidf = tfidf)
```

    Removed 43 sentences without phrases.


### Cluster data
Cluster the sentences into groups expressing similar ideas or topics. If you aren't sure how many true clusters exist in the data, try running `assign_clusters` with the optional parameter `show_chart = True` to visual cluster quality with varying numbers of clusters. When using `method='hac'`, you can also use `show_dendrogram = True` see the cluster dendrogram.

```python
data = assign_clusters(data, method = "kmeans", k=4)
cluster_df = get_cluster_topics(data, doc_df, save_results = True)
visualize_clustering(data, method = "svd", show_chart = False)
```

    Results saved to output/TopicClusterResults.txt


### Evaluate results

```python
gold_file = "test_data/gold.txt"
results_df = evaluate(data, gold_file="test_data/gold.txt", save_results = False)
```


