Metadata-Version: 2.4
Name: ppmi-calc
Version: 0.1.1
Summary: A Python library for calculating Pointwise Mutual Information (PMI)
Home-page: https://github.com/guls-1/ppmi
Author: Gulshat Kossymova
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.19.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# PPMI - Pointwise Mutual Information Library

A Python library for calculating Pointwise Mutual Information (PMI), Positive PMI (PPMI), and Normalized PMI (NPMI) between events or words.

## Overview

Pointwise Mutual Information (PMI) is a measure of association used in information theory and statistics. It measures how much more two events (such as words) co-occur than we would expect them to by chance if they were independent.

**Intuition**: PMI asks "how much more do these two words appear together in our data than we would expect if they were unrelated?" A high PMI means the words appear together much more often than random chance would predict, indicating a strong association.

**Common Use**: PMI is particularly useful for building **term-term matrices** (word-word co-occurrence matrices), where vector dimensions correspond to words rather than documents. This makes it valuable for discovering semantic relationships and word associations.

### Formula

PMI(x, y) = log(P(x,y) / (P(x) × P(y)))

Where:
- P(x,y) is the joint probability of x and y
- P(x) is the marginal probability of x
- P(y) is the marginal probability of y

## Features

- **PMI Calculation**: Standard pointwise mutual information
- **PPMI**: Positive PMI (max(0, PMI)) to handle negative associations
- **Weighted PPMI**: Context distribution smoothing with alpha parameter (recommended: α=0.75)
- **NPMI**: Normalized PMI ranging from -1 to 1
- **Flexible Input**: Calculate from probabilities, counts, or observations
- **Batch Processing**: Add multiple observations at once
- **Matrix Export**: Convert results to numpy matrices
- **Multiple Logarithm Bases**: Support for base 2, 10, e, or any custom base

## Installation

### From PyPI

```bash
pip install ppmi-calc
```

### From source

```bash
git clone https://github.com/yourusername/ppmi.git
cd ppmi
pip install -e .
```

### For development

```bash
pip install -e ".[dev]"
```

## Quick Start

### Using the PMI Class

```python
from ppmi import PMI

# Create a PMI calculator
pmi_calc = PMI()

# Add observations
pmi_calc.add_observation("word1", "word2")
pmi_calc.add_observation("word1", "word3")
pmi_calc.add_observation("word2", "word3", count=5)

# Calculate PMI for a specific pair
pmi_value = pmi_calc.calculate_pmi("word1", "word2")
print(f"PMI(word1, word2) = {pmi_value}")

# Calculate PPMI
ppmi_value = pmi_calc.calculate_ppmi("word1", "word2")
print(f"PPMI(word1, word2) = {ppmi_value}")

# Get all PMI scores
all_pmi = pmi_calc.get_all_pmi()
for (x, y), score in all_pmi.items():
    print(f"PMI({x}, {y}) = {score}")
```

### Using Standalone Functions

```python
from ppmi import calculate_pmi_from_probabilities, calculate_ppmi_from_probabilities

# Calculate from probabilities
pmi = calculate_pmi_from_probabilities(p_xy=0.1, p_x=0.3, p_y=0.4, base=2)
ppmi = calculate_ppmi_from_probabilities(p_xy=0.1, p_x=0.3, p_y=0.4, base=2)

print(f"PMI = {pmi}")
print(f"PPMI = {ppmi}")
```

### Calculate from Counts

```python
from ppmi.pmi import calculate_pmi_from_counts

# Raw counts
co_occurrence_count = 10  # times x and y appeared together
x_count = 50              # total occurrences of x
y_count = 40              # total occurrences of y
total_observations = 1000

pmi = calculate_pmi_from_counts(
    co_count=co_occurrence_count,
    x_count=x_count,
    y_count=y_count,
    total=total_observations
)
print(f"PMI = {pmi}")
```

### Batch Processing

```python
from ppmi import PMI

pmi_calc = PMI()

# Add multiple observations at once
pairs = [
    ("apple", "fruit"),
    ("banana", "fruit"),
    ("carrot", "vegetable"),
    ("apple", "red", 3),  # with count
]

pmi_calc.add_observations_batch(pairs)

# Get all PPMI scores
all_ppmi = pmi_calc.get_all_ppmi()
```

### Export to Matrix

```python
import numpy as np
from ppmi import PMI

pmi_calc = PMI()
# ... add observations ...

# Get PPMI as a matrix
matrix, x_labels, y_labels = pmi_calc.to_matrix(metric='ppmi', base=2)

print("PPMI Matrix:")
print(matrix)
print(f"X items: {x_labels}")
print(f"Y items: {y_labels}")
```

### Weighted PPMI with Context Distribution Smoothing

The library supports weighted PPMI with the `alpha` parameter for context distribution smoothing. This is particularly useful for word embeddings and reduces bias toward rare contexts.

**Formula:**
```
PPMIα(w,c) = max(0, log(P(w,c) / (P(w) × Pα(c))))
where Pα(c) = count(c)α / Σ count(c)α
```

**Usage:**

```python
from ppmi import PMI

pmi_calc = PMI()
# ... add observations ...

# Standard PPMI (alpha=1.0, default)
ppmi_standard = pmi_calc.calculate_ppmi("word", "context", alpha=1.0)

# Weighted PPMI with recommended alpha=0.75
ppmi_weighted = pmi_calc.calculate_ppmi("word", "context", alpha=0.75)

# Get all PPMI scores with smoothing
all_ppmi = pmi_calc.get_all_ppmi(alpha=0.75)

# Export to matrix with smoothing
matrix, words, contexts = pmi_calc.to_matrix(metric='ppmi', alpha=0.75)
```

**Effect of alpha:**
- `alpha = 1.0`: Standard PPMI (no smoothing)
- `alpha < 1.0` (e.g., 0.75): Increases probability of rare contexts, reducing their PPMI
- `alpha > 1.0`: Decreases probability of rare contexts, increasing their PPMI

**Recommendation:** Levy et al. (2015) found that `alpha=0.75` improves performance of embeddings on a wide range of tasks. This works by reducing the bias toward rare co-occurrences.

## Use Cases

### Natural Language Processing
- Word co-occurrence analysis
- Collocation detection
- Feature extraction for word embeddings
- Measuring word associations

### Information Retrieval
- Query expansion
- Document similarity
- Term weighting

### Data Mining
- Association rule mining
- Feature selection
- Pattern discovery

## API Reference

### PMI Class

#### Methods

- `__init__()`: Initialize the PMI calculator
- `add_observation(x, y, count=1)`: Add a single observation
- `add_observations_batch(pairs)`: Add multiple observations
- `calculate_pmi(x, y, base=2)`: Calculate PMI for a pair
- `calculate_ppmi(x, y, base=2, alpha=1.0)`: Calculate Positive PMI with optional context smoothing
- `calculate_npmi(x, y, base=2)`: Calculate Normalized PMI
- `get_all_pmi(base=2)`: Get PMI for all pairs
- `get_all_ppmi(base=2, alpha=1.0)`: Get PPMI for all pairs with optional smoothing
- `to_matrix(metric='ppmi', base=2, alpha=1.0)`: Export to numpy matrix

### Standalone Functions

- `calculate_pmi_from_probabilities(p_xy, p_x, p_y, base=2)`: Calculate PMI from probabilities
- `calculate_ppmi_from_probabilities(p_xy, p_x, p_y, base=2)`: Calculate PPMI from probabilities
- `calculate_pmi_from_counts(co_count, x_count, y_count, total, base=2)`: Calculate PMI from counts
- `calculate_ppmi_from_counts(co_count, x_count, y_count, total, base=2)`: Calculate PPMI from counts

### Parameters

- **base**: Logarithm base. Can be:
  - `2`: Binary logarithm (default)
  - `10`: Common logarithm
  - `'e'` or `math.e`: Natural logarithm
  - Any positive number: Custom base

- **alpha**: Context distribution smoothing parameter (for PPMI). Can be:
  - `1.0`: No smoothing, standard PPMI (default)
  - `< 1.0` (e.g., `0.75`): Recommended for word embeddings. Increases probability of rare contexts.
  - `> 1.0`: Decreases probability of rare contexts.
  
  Reference: Levy et al. (2015) "Improving Distributional Similarity with Lessons Learned from Word Embeddings"

## Examples

See the `examples/` directory for more detailed examples:
- `basic_usage.py`: Basic PMI calculations
- `text_analysis.py`: Word co-occurrence analysis
- `weighted_ppmi_example.py`: Weighted PPMI with context smoothing

## Testing

Run tests with pytest:

```bash
pytest tests/
```

With coverage:

```bash
pytest --cov=ppmi tests/
```

## Requirements

- Python >= 3.7
- numpy >= 1.19.0

## License

MIT License - see LICENSE file for details

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Citation

If you use this library in your research, please cite:

```
@software{ppmi,
  author = {Kossymova, Gulshat},
  title = {PPMI: A Python Library for Pointwise Mutual Information},
  year = {2026},
  url = {https://github.com/guls-1/ppmi}
}
```

## References

- Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22-29.
- Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, 31-40.
