Metadata-Version: 2.2
Name: textplumber
Version: 0.0.9
Summary: Pipeline components for Sci-kit learn to extract relevant features from text data, including tokens, parts of speech, lexicon scores, document-level statistics and embeddings.
Home-page: https://github.com/polsci/textplumber
Author: Geoff Ford
Author-email: geoffrey.ford@canterbury.ac.nz
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: spacy
Requires-Dist: model2vec
Requires-Dist: scikit-learn
Requires-Dist: lxml
Requires-Dist: textstat
Requires-Dist: pandas
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: nltk
Requires-Dist: imbalanced-learn
Requires-Dist: supertree
Requires-Dist: datasets
Requires-Dist: fastcore
Requires-Dist: vaderSentiment
Provides-Extra: dev
Requires-Dist: jupyterlab; extra == "dev"
Requires-Dist: nbdev; extra == "dev"
Requires-Dist: jupyterlab-quarto; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Textplumber


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

![](https://geoffford.nz/assets/images/opt/textplumber-image-16x9.jpg)

## Introduction to Textplumber

The Textplumber library is intended to make it easier to build text
classification pipelines with Sci-kit learn. Sci-kit learn provides a
powerful suite of tools for machine learning, including in-built
[support for
text](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).
Textplumber adds to Sci-kit learn’s functionality, leveraging libraries
like [spaCy](https://spacy.io/) and new feature extraction techniques
like [Model2Vec](https://github.com/MinishLab/model2vec), and provides
easy access to a range of text feature types.

## Development status

Textplumber is in active development. It is currently
[released](https://pypi.org/project/textplumber/) for beta testing. The
Github site may be ahead of the Pypi version, so for latest
functionality install from Github (see below). The Github code is
pre-release and may change. For the latest release, install from Pypi
(`pip install textplumber`). The
[documentation](https://geoffford.nz/textplumber/) reflects the most
recent functionality. See the
[CHANGELOG](https://github.com/polsci/textplumber/blob/main/CHANGELOG.md)
for notes on releases.

## Development Team

The developers of Textplumber are:

- [Dr Geoff Ford](https://geoffford.nz/), Senior Lecturer, Faculty of
  Arts, University of Canterbury  
- [Dr Christopher
  Thomson](https://profiles.canterbury.ac.nz/Christopher-Thomson),
  Senior Lecturer in English and Digital Humanities, University of
  Canterbury  
- [Karin
  Stahel](https://www.canterbury.ac.nz/about-uc/contact-us/postgrad-directory/karin-stahel),
  PhD Candidate, Data Science, University of Canterbury

Dr Geoff Ford is leading development of Textplumber and is the main
contributor to date.

Some Textplumber functionality has been created through collaborations
of team members to develop teaching resources for DIGI405, *Text,
Discourses and Data*, a course offered through the Digital Humanities
and Master of Applied Data Science programmes at the University of
Canterbury. The entire team are contributing to testing and will
contribute to the development of Textplumber documentation.

## Acknowledgements

Dr Ford’s work on Textplumber has been made possible by funding from the
Royal Society of New Zealand’s Marsden Fund, Grant 22-UOC-059 “Into the
Deep: Analysing the Actors and Controversies Driving the Adoption of the
World’s First Deep Sea Mining Governance”. Textplumber is an output of
that project.

The developers of Textplumber are researchers with [Te Pokapū Aronui
ā-Matihiko \| UC Arts Digital Lab
(ADL)](https://artsdigitallab.canterbury.ac.nz/). Thanks to the ADL team
and the ongoing support of the University of Canterbury’s Faculty of
Arts who make work like this possible.

## Installation

### Install via pip

You can install Textplumber from
[pypi](https://pypi.org/project/textplumber/) using this command:

``` sh
$ pip install textplumber
```

To install the latest development version of Textplumber, which may be
ahead of the version on Pypi, you can install from the
[repository](https://github.com/polsci/textplumber):

``` sh
$ pip install git+https://github.com/polsci/textplumber.git
```

### Install a language model

Many of Textplumber’s pipeline components require a SpaCy language
model. After installing textplumber, install a model. Here’s an example
of how to install SpaCy’s small English model:

``` sh
python -m spacy download en_core_web_sm
```

If you are working with a different language or want to use a different
‘en’ model, check the [SpaCy models
documentation](https://spacy.io/models/) for the relevant model name.

## Using Textplumber

A good place to start is the [quick
introduction](basic-introduction.ipynb) and an [example
notebook](example.ipynb), which allows you to use Textplumber with
different datasets and different kinds of text classification problems.

The [documentation site](https://geoffford.nz/textplumber/) provides a
reference for Textplumber functionality and examples of how to use the
various components. The current Textplumber components are listed below.

| Component | Functionality | Requires Preprocessor |
|----|----|----|
| [`TextCleaner`](https://geoffford.nz/textplumber/clean.html#textcleaner) | Cleans text data | \- |
| [`SpacyPreprocessor`](https://geoffford.nz/textplumber/preprocess.html#spacypreprocessor) | **Preprocessor**, uses [spaCy](https://spacy.io) | \- |
| [`NLTKPreprocessor`](https://geoffford.nz/textplumber/preprocess.html#nltkpreprocessor) | **Preprocessor**, uses [NLTK](https://www.nltk.org/) | \- |
| [`TokensVectorizer`](https://geoffford.nz/textplumber/tokens.html#tokensvectorizer) | Extract individual tokens or token ngram features | Yes |
| [`POSVectorizer`](https://geoffford.nz/textplumber/pos.html#posvectorizer) | Extract individual part of speech or POS ngram features | Yes |
| [`TextstatsTransformer`](https://geoffford.nz/textplumber/textstats.html#textstatstransformer) | Extract document-level statistics | Yes |
| [`LexiconCountVectorizer`](https://geoffford.nz/textplumber/lexicons.html#lexiconcountvectorizer) | Extract features based on lexicons (i.e. counts of lists of words) | Yes |
| [`VaderSentimentExtractor`](https://geoffford.nz/textplumber/vader.html#vadersentimentextractor) | Extract sentiment features using [VADER](https://github.com/cjhutto/vaderSentiment) | \- |
| [`VaderSentimentEstimator`](https://geoffford.nz/textplumber/vader.html#vadersentimentestimator) | Predict sentiment using [VADER](https://github.com/cjhutto/vaderSentiment) | \- |
| [`Model2VecEmbedder`](https://geoffford.nz/textplumber/embeddings.html#model2vecembedder) | Extract embeddings using [Model2Vec](https://github.com/MinishLab/model2vec) | \- |
| [`CharNgramVectorizer`](https://geoffford.nz/textplumber/chars.html#charngramvectorizer) | Extract character ngrams | \- |

Here are some helpful functions for working with text pipelines …

| Function | Functionality |
|----|----|
| [`preview_dataset`](https://geoffford.nz/textplumber/report.html#preview_dataset) | Output information about a Huggingface dataset |
| [`plot_confusion_matrix`](https://geoffford.nz/textplumber/report.html#plot_confusion_matrix) | SVG confusion matrix with counts and row-wise proportions and appropriate labels |
| [`plot_logistic_regression_features_from_pipeline`](https://geoffford.nz/textplumber/report.html#plot_logistic_regression_features_from_pipeline) | Plot the most discriminative features for a logistic regression classifier |
| [`plot_decision_tree_from_pipeline`](https://geoffford.nz/textplumber/report.html#plot_decision_tree_from_pipeline) | Plot the decision tree of the classifier from a pipeline using [SuperTree](https://github.com/mljar/supertree) |
| [`preview_pipeline_features`](https://geoffford.nz/textplumber/report.html#preview_pipeline_features) | Output the features at each step in a pipeline |

## Developer Guide

The instructions below are only relevant if you want to contribute to
Textplumber. The [nbdev](https://nbdev.fast.ai/) library is being used
for development. If you are new to using nbdevc, here are some useful
pointers to get you started (or visit the [nbdev
website](https://nbdev.fast.ai/)).

### Install textplumber in Development mode

``` sh
# make sure textplumber package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to textplumber
$ nbdev_prepare
```
