Metadata-Version: 2.2
Name: equitas-benchmark
Version: 0.1.2
Summary: A corruption robustness benchmark for multi-LLM committees with hierarchical aggregation
License: MIT
Project-URL: Homepage, https://github.com/akshan-main/Equitas
Project-URL: Repository, https://github.com/akshan-main/Equitas
Project-URL: Dataset, https://huggingface.co/datasets/akshan-main/Equitas
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: matplotlib>=3.8
Requires-Dist: pyyaml>=6.0
Requires-Dist: openai>=1.0
Requires-Dist: python-dotenv>=1.0

# Equitas

**Corruption-Robust Aggregation for Multi-LLM Governance Committees**

A benchmark for evaluating aggregation strategies in hierarchical multi-LLM
committees under adversarial corruption.

## Quick Start

```bash
pip install equitas-benchmark          # from PyPI
# or for local development:
pip install -e .
python -m equitas --config configs/governance_sweep_fh.yaml
```

## Aggregation Methods (8 baselines + oracle)

| Method | Key Idea |
|--------|----------|
| Oracle | Hindsight-optimal action (upper bound) |
| Multiplicative Weights | `w *= exp(-eta * loss)`, adapts to corruption |
| Supervisor Rerank | Follow-the-leader: re-rank by best recent agent |
| Confidence-Weighted | Weight by self-reported confidence |
| EMA Trust | Exponential moving average of past performance |
| Trimmed Vote | Drop outlier agents, then majority |
| Majority Vote | Equal-weight plurality |
| Oracle Upper Bound | Best single agent in hindsight |
| Random Dictator | Uniformly random agent each round |

## Experiments

1. **Corruption sweep**: rate x adversary type x aggregator
2. **Pareto sweep**: welfare-fairness tradeoff via (alpha, beta)
3. **Recovery**: mid-run corruption onset, track MW weight recovery
4. **Scaling**: committee size N in {3, 5, 7, 10}
5. **Hierarchical vs flat**: architecture comparison

## Reproducibility

Raw experiment outputs in `outputs/` include historical runs with all methods
tested during development (including `self_consistency`). The reported benchmark
results exclude `self_consistency` at the analysis layer: table-generation
scripts (`scripts/generate_benchmark_tables.py`, `scripts/generate_go_vs_fh_tables.py`)
and figure-generation (`regenerate_figures.py`) filter it out on read. The
`self_consistency` aggregator is also hard-disabled in the codebase
(`equitas/config.py` raises `ValueError` if used) because it implements a
committee-level subsampled majority vote, not canonical within-agent
self-consistency sampling. See the future-work discussion in the paper.

To regenerate all artifacts from raw data:

```bash
python scripts/generate_benchmark_tables.py   # tables/benchmark/
python scripts/generate_go_vs_fh_tables.py    # tables/
python regenerate_figures.py                   # paper/figures/
python -m pytest tests/ -q                    # 88 tests
```

## Project Structure

```
equitas/          # pip-installable package
  agents/         # LLM client, member/leader/judge/governor agents
  aggregators/    # 8 aggregation strategies (registry pattern)
  adversaries/    # 4 adversary types (selfish, coordinated, scheduled, deceptive)
  metrics/        # fairness, welfare, Pareto, robust statistics
  simulation/     # hierarchical + flat engine
  experiments/    # sweep, recovery, scaling, pareto, hier-vs-flat
  plotting/       # paper-quality matplotlib figures
configs/          # YAML experiment configs
scripts/          # table generation, analysis
paper/            # LaTeX source + figures
tests/            # 88 unit + integration tests
```

## Links

- [PyPI](https://pypi.org/project/equitas-benchmark/)
- [HuggingFace Dataset](https://huggingface.co/datasets/akshan-main/Equitas)
- [GitHub](https://github.com/akshan-main/Equitas)
