Metadata-Version: 2.4
Name: poisson-topicmodels
Version: 0.1.1
Summary: Poisson topic modeling with Bayesian inference using JAX and NumPyro
License: MIT
License-File: LICENSE
Keywords: topic-modeling,bayesian-inference,jax,numpyro,probabilistic
Author: Bernd Prostmaier
Author-email: b.prostmaier@icloud.com
Requires-Python: >=3.11,<3.14
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Provides-Extra: dev
Provides-Extra: docs
Requires-Dist: black (>=23.0) ; extra == "dev"
Requires-Dist: flake8 (>=6.0) ; extra == "dev"
Requires-Dist: flax (>=0.12.0)
Requires-Dist: gensim (>=4.3.0,<5.0.0)
Requires-Dist: isort (>=5.12) ; extra == "dev"
Requires-Dist: jax (==0.8.0)
Requires-Dist: jaxlib (==0.8.0)
Requires-Dist: matplotlib (>=3.10.0,<4.0.0)
Requires-Dist: mypy (>=1.0) ; extra == "dev"
Requires-Dist: myst-parser (>=1.0) ; extra == "docs"
Requires-Dist: numpy (>=2.2.0,<3.0.0)
Requires-Dist: numpyro (==0.19.0)
Requires-Dist: optax (==0.2.6)
Requires-Dist: pandas (>=2.2.0,<3.0.0)
Requires-Dist: pre-commit (>=4.5.0,<5.0.0) ; extra == "dev"
Requires-Dist: pylint (>=2.17) ; extra == "dev"
Requires-Dist: pytest (>=9.0.1,<10.0.0) ; extra == "dev"
Requires-Dist: pytest-cov (>=4.0) ; extra == "dev"
Requires-Dist: scikit-learn (>=1.6.0,<2.0.0)
Requires-Dist: scipy (>=1.15.0,<2.0.0)
Requires-Dist: seaborn (>=0.13.2,<0.14.0)
Requires-Dist: sphinx (>=6.0) ; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints (>=1.22) ; extra == "docs"
Requires-Dist: sphinx-rtd-theme (>=1.2) ; extra == "docs"
Requires-Dist: tqdm (>=4.66.0,<5.0.0)
Requires-Dist: wordcloud (>=1.9.0,<2.0.0)
Project-URL: Documentation, https://topicmodels.readthedocs.io
Project-URL: Repository, https://github.com/BPro2410/topicmodels_package
Project-URL: issues, https://github.com/BPro2410/topicmodels_package/issues
Description-Content-Type: text/markdown

<div align="center">
  <img src="https://raw.githubusercontent.com/BPro2410/poisson_topicmodels/5b9c5b887fc2c61063223e5af35aea85e0525f40/data/figures/logo.svg" alt="poisson-topicmodels" width="400" style="margin-bottom: 20px;"/>
</div>


# poisson-topicmodels: Probabilistic Topic Modeling with Bayesian Inference

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://img.shields.io/pypi/v/topicmodels.svg)](https://pypi.org/project/poisson-topicmodels/)
[![codecov](https://codecov.io/gh/BPro2410/topicmodels_package/branch/main/graph/badge.svg)](https://app.codecov.io/github/bpro2410/poisson_topicmodels)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

**poisson-topicmodels** is a modern Python package for probabilistic topic modeling using Bayesian inference, built on [JAX](https://github.com/google/jax) and [NumPyro](https://github.com/pyro-ppl/numpyro).

## Package documentation

There is a full package documentation available [here](https://poisson-topicmodels.readthedocs.io/en/latest/).

## Statement of Need

Traditional topic modeling packages (e.g., Gensim, scikit-learn's LDA) use older inference methods and lack flexibility for emerging research needs. **poisson-topicmodels** addresses key gaps:

1. **Modern Probabilistic Inference**: Built on NumPyro, enabling automatic differentiation, probabilistic programming, and integration with cutting-edge Bayesian methods.

2. **Advanced Topic Models**: Goes beyond LDA with guided topic discovery (keyword priors), covariate effects, ideal point estimation, and embeddings—all with principled Bayesian inference.

3. **GPU Acceleration**: Leverages JAX for transparent GPU computation, essential for large-scale corpus analysis and enabling research that would be prohibitively slow on CPU.

4. **Scalability & Reproducibility**: Optimized for mini-batch SVI training with built-in seed control for exact reproducibility—critical for research validation and publication.

5. **Research-Friendly API**: Purpose-built for computational social science and NLP researchers who need interpretable, flexible models beyond black-box approaches.

Whether analyzing legislative text, social media discourse, or scientific abstracts, **poisson-topicmodels** enables researchers to extract interpretable semantic structure with confidence in results.

## Features

**poisson-topicmodels** provides multiple topic modeling approaches:

| Model | Use Case | Key Feature |
|-------|----------|------------|
| **Poisson Factorization (PF)** | Unsupervised baseline | Fast, interpretable word-topic associations |
| **Seeded PF (SPF)** | Guided discovery | Incorporate domain knowledge via keyword priors |
| **Covariate PF (CPF)** | Covariate effects | Model topics influenced by document metadata |
| **Covariate Seeded PF (CSPF)** | Guided + covariates | Combine keyword guidance with external factors |
| **Text-Based Ideal Points (TBIP)** | Ideal point estimation | Estimate author positions from legislative/social text |
| **Embedded Topic Models (ETM)** | Modern embeddings | Integrate pre-trained word embeddings |

**Core Capabilities**:
- ✨ Stochastic Variational Inference (SVI) with mini-batch training
- ✨ Transparent GPU acceleration via JAX
- ✨ Reproducible results with seed control
- ✨ Type hints and comprehensive API documentation
- ✨ >70% test coverage with continuous integration
- ✨ Clear error messages and input validation

## Quick Start

Get started in 5 minutes:

```python
import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF

# Prepare data: document-term matrix and vocabulary
counts = csr_matrix(np.random.poisson(2, (100, 500)).astype(np.float32))
vocab = np.array([f'word_{i}' for i in range(500)])

# Initialize and train model
model = PF(counts, vocab, num_topics=10, batch_size=32)
params = model.train_step(num_steps=100, lr=0.01, random_seed=42)

# Extract results
topics, _ = model.return_topics()
top_words = model.return_top_words_per_topic(n=10)
print(f"Found {topics.shape} topics")
print(f"Top words: {top_words[:3]}")
```

See `examples/` directory for detailed notebooks.

## Installation

### From PyPI (recommended)
```bash
pip install poisson-topicmodels
```

### From Source
```bash
git clone https://github.com/BPro2410/topicmodels_package.git
cd topicmodels_package
pip install -e .
```

### Development Setup
```bash
git clone https://github.com/BPro2410/topicmodels_package.git
cd topicmodels_package
pip install -e ".[dev]"
pytest tests/  # Verify installation
```

## Requirements

- Python ≥ 3.11
- JAX ≥ 0.4.35 (with optional GPU support)
- NumPyro ≥ 0.15.3
- NumPy, SciPy, scikit-learn, pandas

See `pyproject.toml` for complete dependency list.

## Documentation

- **[API Reference](https://topicmodels.readthedocs.io)** – Complete model and method documentation
- **[User Guide](docs/intro/user_guide.rst)** – Detailed tutorials and workflows
- **[Examples](examples/)** – Jupyter notebooks demonstrating all features
- **[Contributing](CONTRIBUTING.md)** – How to contribute improvements

## Basic Usage Examples

### 1. Unsupervised Topic Discovery (PF)

```python
from poisson_topicmodels import PF

model = PF(counts, vocab, num_topics=10, batch_size=64)
model.train_step(num_steps=500, lr=0.001, random_seed=42)

# Extract topics
topics, topic_probs = model.return_topics()
top_words = model.return_top_words_per_topic(n=15)
```

### 2. Guided Topic Modeling with Keywords (SPF)

```python
from poisson_topicmodels import SPF

keywords = {
    0: ['climate', 'environment', 'carbon'],
    1: ['economy', 'growth', 'trade'],
}

model = SPF(counts, vocab, keywords, residual_topics=5, batch_size=64)
model.train_step(num_steps=500, lr=0.001, random_seed=42)
```

### 3. Covariate Effects (CPF)

```python
from poisson_topicmodels import CPF

# Include document-level covariates
covariates = np.random.randn(100, 3)  # 100 documents, 3 covariates

model = CPF(counts, vocab, covariates, num_topics=10, batch_size=64)
model.train_step(num_steps=500, lr=0.001, random_seed=42)
```


## Custom Model Extension

Due to its modular structure it is easy to implement your own custom models with **poisson-topicmodels**. Below you can see a short example.

```python
from poisson_topicmodels import NumpyroModel
import numpyro
from numpyro import plate, sample
import numpyro.distributions as dist

class MyModel(NumpyroModel):
    def _model(self, Y_batch, d_batch):
        with plate("n", len(Y_batch)):
            mu = sample("mu", dist.Normal(0, 1))
            sample("obs", dist.Normal(mu, 1), obs=Y_batch)

    def _guide(self, Y_batch, d_batch):
        mu_loc = numpyro.param("mu_loc", 0.0)
        mu_scale = numpyro.param("mu_scale", 1.0)
        with plate("n", len(Y_batch)):
            sample("mu", dist.Normal(mu_loc, mu_scale))
```

To implement a custom model, one has to only define the high-level model. The backbone of **poisson-topicmodels** handles training and inference.

<div align="center">
<img src="data/figures/architecture5.svg" width="50%">
</div>


## Example Data

The repository includes `data/10k_amazon.csv` with ~10,000 Amazon product reviews for quick experimentation. See `examples/01_getting_started.ipynb` for a complete walkthrough.

## Docker Setup (Optional)

For a reproducible, isolated environment with JupyterLab:

```bash
# Build image
docker build -t topicmodels-jupyter .

# Run container (Linux/macOS)
docker run --rm -p 8888:8888 -v "$(pwd)":/workspace topicmodels-jupyter

# Then open http://localhost:8888 in your browser
```

## Citation

If you use **poisson_topicmodels** in your research, please cite:

```bibtex
@software{topicmodels2026,
  title = {Poisson-topicmodels: Probabilistic Topic Modeling with Bayesian Inference},
  author = {Prostmaier, Bernd and Grün, Bettina and Hofmarcher, Paul},
  year = {2026},
  url = {https://github.com/BPro2410/topicmodels_package},
}
```

See `CITATION.cff` for additional citation formats.

## Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on:
- Reporting bugs
- Submitting pull requests
- Code style and testing requirements
- Documentation standards

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.

## Support

- **Issues & Bug Reports**: [GitHub Issues](https://github.com/BPro2410/topicmodels_package/issues)
- **Discussions**: [GitHub Discussions](https://github.com/BPro2410/topicmodels_package/discussions)
- **Documentation**: [ReadTheDocs](https://topicmodels.readthedocs.io)

---

**Built with ❤️ for researchers and practitioners in computational social science and NLP**

