Metadata-Version: 2.2
Name: conc
Version: 0.1.0
Summary: A Python library for efficient corpus analysis, enabling corpus linguistic analysis in Jupyter notebooks.
Home-page: https://github.com/polsci/conc
Author: polsci
Author-email: geoffrey.ford@canterbury.ac.nz
License: MIT License
Keywords: corpus corpora nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastcore
Requires-Dist: numpy
Requires-Dist: polars
Requires-Dist: scipy
Requires-Dist: msgspec
Requires-Dist: great_tables
Requires-Dist: spacy
Requires-Dist: python-slugify
Requires-Dist: plotly
Requires-Dist: jupyterlab
Requires-Dist: ipywidgets
Provides-Extra: dev
Requires-Dist: nbdev; extra == "dev"
Requires-Dist: memory_profiler; extra == "dev"
Requires-Dist: line_profiler; extra == "dev"
Requires-Dist: nltk; extra == "dev"
Requires-Dist: datasets; extra == "dev"
Requires-Dist: requests; extra == "dev"
Requires-Dist: jupyterlab-quarto; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Conc


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Introduction to Conc

Conc is a Python library that brings corpus linguistic analysis to
Jupyter notebooks. A staple of data science, Jupyter notebooks are a
great model for presenting analysis that combines code, reporting and
discussion in a way that can be reproduced. Conc aims to allow
researchers to analyse large corpora in efficient ways using standard
hardware, with the ability to produce clear, publication-ready reports
and extend analysis where required using standard Python libraries.

Conc uses [spaCy](https://spacy.io/) for tokenising texts. More spaCy
functionality will be supported in future releases.

### Conc Principles

- use standard Python libraries for data analysis (i.e. Numpy, Scipy,
  Jupyterlab)
- use vector operations where possible  
- use fast code libraries over slow code libraries (i.e. Conc uses
  [Polars vs Pandas](https://pola.rs/posts/benchmarks/) - you can still
  output Pandas dataframes if you want to use them)  
- provide important information when reporting results  
- pre-compute time-intensive and repeatedly used views of the data  
- work with smaller slices of the data where possible  
- cache specific anaysis during a session to reduce computation for
  repeated calls  
- document corpus representations so that they can be worked with
  directly  
- provide a way to work with access Conc results for further processing
  with standard Python libraries

## Development Status

Conc is in active development. It is currently
[released](https://pypi.org/project/conc/) for beta testing. The Github
site may be ahead of the Pypi version, so for latest functionality
install from Github (see below). The Github code is pre-release and may
change. For the latest release, install from Pypi (`pip install conc`).
The [documentation](https://geoffford.nz/conc/) reflects the most recent
functionality. See the
[CHANGELOG](https://github.com/polsci/conc/blob/main/CHANGELOG.md) for
notes on releases and the Roadmap below for upcoming features.

## Acknowledgements

Conc is developed by [Dr Geoff Ford](https://geoffford.nz/).

Conc originated in my PhD research, which included development of a
web-based corpus browser to handle analysis of large corpora. I’ve been
developing Conc through my subsequent research.

Work to create this Python library has been made possible by
funding/support from:

- “Mapping LAWS: Issue Mapping and Analyzing the Lethal Autonomous
  Weapons Debate” (Royal Society of New Zealand’s Marsden Fund Grant
  19-UOC-068)  
- “Into the Deep: Analysing the Actors and Controversies Driving the
  Adoption of the World’s First Deep Sea Mining Governance” (Royal
  Society of New Zealand’s Marsden Fund Grant 22-UOC-059)
- Sabbatical, University of Canterbury, Semester 1 2025.

Thanks to the Mapping LAWS project team for their support and feedback
as first users of ConText (a web-based application built on an earlier
version of Conc).

Dr Ford is a researcher with [Te Pokapū Aronui ā-Matihiko \| UC Arts
Digital Lab (ADL)](https://artsdigitallab.canterbury.ac.nz/). Thanks to
the ADL team and the ongoing support of the University of Canterbury’s
Faculty of Arts who make work like this possible.

## Installation

### Install via pip

You can install Conc from [pypi](https://pypi.org/project/conc/) using
this command:

``` sh
$ pip install conc
```

To install the latest development version of Conc, which may be ahead of
the version on Pypi, you can install from the
[repository](https://github.com/polsci/conc):

``` sh
$ pip install git+https://github.com/polsci/conc.git
```

### Install a language model

The first releases of Conc require a SpaCy language model for
tokenization. After installing Conc, install a model. Here’s an example
of how to install SpaCy’s small English model, which is Conc’s default
language model:

``` sh
python -m spacy download en_core_web_sm
```

If you are working with a different language or want to use a different
‘en’ model, check the [SpaCy models
documentation](https://spacy.io/models/) for the relevant model name.

### Install optional dependencies

Conc has some optional dependencies you can install to download source
texts to create sample corpora. These are primarily intended for
creating corpora for development. To minimize Conc’s requirements these
are not installed by default. If you want to get sample corpora to test
out Conc’s functionality you can install these with the following
command.

``` sh
$ pip install nltk requests datasets
```

### Pre-2013 CPU? Install Polars with support for older machines

Polars is optimized for modern CPUs with support for AVX2 instructions.
If you get kernel crashes running Conc on an older machine (probably
pre-2013), this is likely to be an issue with Polars. Polars has an
[alternate installation option to support older
machines](https://docs.pola.rs/user-guide/installation/), which installs
a Polars build compiled without AVX2 support. Replace the standard
Polars package with the legacy-support package to use Conc on older
machines.

``` sh
$ pip uninstall polars
$ pip install polars-lts-cpu
```

## Using Conc

A good place to start is TODO, which demonstrates how to build a corpus
and output Conc reports.

The [documentation site](https://geoffford.nz/conc/) provides a
reference for Conc functionality and examples of how to create reports
for analysis. The current Conc components are listed below.

| Class / Function | Module | Functionality | Note |
|----|----|----|----|
| [`Corpus`](https://geoffford.nz/conc/corpus.html#corpus) | conc.corpus | Build and load and get information on a corpus, methods to work with a corpus | Required |
| [`Conc`](https://geoffford.nz/conc/conc.html#conc) | conc.conc | Inferface to Conc reports for corpus analysis | Recommended way to access reports for analysis, requires a corpus created by Corpus module |
| [`Text`](https://geoffford.nz/conc/text.html#text) | conc.text | Output text from the corpus | Access via Corpus |
| [`Frequency`](https://geoffford.nz/conc/frequency.html#frequency) | conc.frequency | Frequency reporting | Access via Conc |
| [`Ngrams`](https://geoffford.nz/conc/ngrams.html#ngrams) | conc.ngrams | Reporting on `ngram_frequencies` across corpus and `ngrams` containing specific tokens | Access via Conc |
| [`Concordance`](https://geoffford.nz/conc/concordance.html#concordance) | conc.concordance | Concordancing | Access via Conc |
| [`Keyness`](https://geoffford.nz/conc/keyness.html#keyness) | conc.keyness | Reporting for keyness analysis | Access via Conc |
| [`Collocates`](https://geoffford.nz/conc/collocates.html#collocates) | conc.collocates | Reporting for collocation analysis | Access via Conc |
| [`Result`](https://geoffford.nz/conc/result.html#result) | conc.result | Handles report results, output result as table or get dataframe | Used by all reports |
| [`ConcLogger`](https://geoffford.nz/conc/core.html#conclogger) | conc.core | Logger | Logging implemented in all modules |
| [`CorpusMetadata`](https://geoffford.nz/conc/core.html#corpusmetadata) | conc.core | Class to validate Corpus Metadata JSON | Used by Corpus class |

The conc.core module implements a number of helpful functions …

| Function | Functionality |
|----|----|
| [`list_corpora`](https://geoffford.nz/conc/core.html#list_corpora) | Scan a directory for corpora and return a summary |
| [`get_stop_words`](https://geoffford.nz/conc/core.html#get_stop_words) | Get a spaCy stop word list list for a specific model |
| Various - see `Get data sources` | Functions to download source texts to create sample corpora. Primarily intended for development/testing. To minimize requirements not all libraries are installed by default. Functions will raise errors with information on installing required libraries. |

## Roadmap

### Short-term

- [ ] add tutorial / getting started notebook
- [ ] add citation information
- [ ] extend caching support to all intensive reports, revise storage of
  cached results for in-memory/disk option
- [ ] relegate some logger warnings to debug level and audit logger
  messages for consistency and clarity for users
- [ ] add support for build from datasets library
- [ ] anatomy - explain token2doc_index -1 and has_spaces on tokens
  display and various other fields for vocab.
- [ ] Corpus tokenize support for functionality from earlier versions of
  Conc for wildcards, multiple strings, case insensitive tokenization
- [ ] ngrams method - implement case handling
- [ ] get_ngrams_by_index - implement case handling
- [ ] improve concordance ordering so not fixed options e.g. include
  3R1R2R
- [ ] improve ngram support for ngram token position beyond LEFT/RIGHT
  (i.e. define positions relative to ngram, or ANY)
- [ ] concordancing - add in ordering by metadata columns or doc
- [ ] annotations support for spaCy POS, TAG, SENT_START, LEMMA
- [ ] move tokens sort order to build process - takes \> 1 second for
  large corpora, but not needed for all results
- [ ] shift more processing from in-memory to polars with support for
  streaming or in-memory processing
- [ ] revisit polars streaming - potentially implement a batched write
  for very large files i.e. splitting vocab/tokens files into smaller
  chunks to reduce memory usage.

### Medium-term

- [ ] Support for processing backends other than spaCy (i.e. other
  tokenizers)

## Developer Guide

The instructions below are only relevant if you want to contribute to
Conc. The [nbdev](https://nbdev.fast.ai/) library is being used for
development. If you are new to using nbdevc, here are some useful
pointers to get you started (or visit the [nbdev
website](https://nbdev.fast.ai/)).

### Install conc in Development mode

``` sh
# make sure conc package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to conc
$ nbdev_prepare
```
