Metadata-Version: 2.1
Name: gbert
Version: 0.5.6
Summary: Research-oriented multilingual manifesto analysis with comparative and corpus-level inference.
Project-URL: Homepage, https://pypi.org/project/gbert/
Keywords: nlp,bert,manifesto-analysis,comparative-politics,policy-analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: huggingface-hub<2,>=0.30
Requires-Dist: joblib<2,>=1.4
Requires-Dist: numpy<3,>=1.26
Requires-Dist: pandas<3,>=2.2
Requires-Dist: scikit-learn<2,>=1.5
Requires-Dist: torch<3,>=2.4
Requires-Dist: transformers<5,>=4.46
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: twine<7,>=5; extra == "dev"

# gbert

`gbert` is a research-oriented package for multilingual manifesto analysis. It provides a complete workflow for single-text inference, batch prediction, CMP lookup, corpus-level profiling, group comparison, country-year panel construction, UMAP projection, and publication-oriented plotting.

## Installation

Install the base package:

```bash
pip install gbert
```

For UMAP and plotting, install the optional analysis stack:

```bash
pip install umap-learn matplotlib seaborn
```

## Coverage Note

The package is designed for country-year inference with temporal coverage extending through 2023 in standard use. The built-in demo corpus remains a compact illustrative dataset for 1996-2018.

## End-to-End Example

The following example mirrors the full workflow supported by the package.

### 1. Install the package

```python
!pip -q install gbert
```

### 2. Import and initialize the model

Use `GbertClassifier` to load the runtime model and metadata. The `model_repo_id` argument points to the Hugging Face repository containing the model weights.

```python
from gbert import GbertClassifier, load_demo_corpus_ja_us_de_1996_2018

model = GbertClassifier(
    model_repo_id="X-Li/gbert",
    # device="cpu",
)
```

### 3. Load the built-in demo corpus

`load_demo_corpus_ja_us_de_1996_2018()` returns an illustrative multilingual test corpus covering Japan, the United States, and Germany for each year from 1996 to 2018. The returned `DataFrame` includes `text`, `country`, `year`, and `party_family`.

```python
demo = load_demo_corpus_ja_us_de_1996_2018()
texts = demo["text"].tolist()
countries = demo["country"].tolist()
years = demo["year"].tolist()
party_family = demo["party_family"].tolist()
```

### 4. Single prediction

Use `predict()` for one sentence at a time. The returned object contains ranked CMP predictions and macroeconomic covariates used during inference.

```python
single = model.predict(
    "政府は先端産業への投資を強化する。",
    country="Japan",
    year=2026,
)
print(single["predictions"][:3])
```

### 5. Batch prediction

Use `predict_batch()` for multiple texts. Set `return_df=True` if you want a compact `DataFrame` instead of the full nested output.

```python
batch = model.predict_batch(
    texts=texts,
    country=countries,
    year=years,
    return_df=True,
)
print(batch.head())
```

### 6. CMP code lookup

Use `get_cmp_info()` to inspect a specific CMP code and `list_cmp_codes()` to see the available label space.

```python
print(model.get_cmp_info(305))
print(model.list_cmp_codes()[:10])
```

### 7. Corpus analysis

`analyze_corpus()` is the primary text-level analysis API. It returns a `DataFrame` with top predictions, entropy, confidence margin, and optional full posterior columns.

```python
analysis = model.analyze_corpus(
    texts=texts,
    country=countries,
    year=years,
    include_probabilities=True,
)
print(analysis.columns[:20])
print(analysis[["country", "year", "top_cmp_code", "top_cmp_title", "top_probability", "entropy"]].head())
```

### 8. Topic profile

`compute_topic_profile()` aggregates sentence-level posterior probabilities into a corpus-level CMP profile.

```python
profile = model.compute_topic_profile(
    texts=texts,
    country=countries,
    year=years,
)
print(profile.head(10))
```

### 9. Bootstrap topic profile

`bootstrap_topic_profile()` produces interval estimates for the corpus profile through repeated resampling.

```python
bootstrap = model.bootstrap_topic_profile(
    texts=texts,
    country=countries,
    year=years,
    n_bootstrap=100,
)
print(bootstrap.head(10))
```

### 10. Group comparison

`compare_groups()` compares any user-defined grouping variable. Here the grouping variable is `party_family`.

```python
comparison = model.compare_groups(
    texts=texts,
    country=countries,
    year=years,
    group=party_family,
)
print(comparison["group_summary"])
print(comparison["pairwise_divergence"].head())
```

### 11. Country profile comparison

`compare_country_profiles()` is a convenience wrapper for country-level comparison.

```python
country_comparison = model.compare_country_profiles(
    texts=texts,
    country=countries,
    year=years,
)
print(country_comparison["group_summary"].head())
```

### 12. Country-year panel

`panelize_country_year()` converts text-level predictions into a country-year panel with topic scores and summary indicators.

```python
panel = model.panelize_country_year(
    texts=texts,
    country=countries,
    year=years,
)
print(panel.head())
```

### 13. Methods-summary export

`export_methods_summary()` generates a compact dictionary for manuscript drafting, including sample statistics, prediction-quality summaries, top topics, and a ready-to-edit methods paragraph.

```python
methods = model.export_methods_summary(
    texts=texts,
    country=countries,
    year=years,
)
print(methods.keys())
print(methods["methods_text"])
```

### 14. Raw UMAP projection

`project_umap()` returns the low-dimensional projection coordinates as a `DataFrame`. By default it uses the context-conditioned text representation (`cls_film`), and it also supports raw BERT CLS vectors through `representation="cls"`.

```python
umap_df = model.project_umap(
    texts=texts,
    country=countries,
    year=years,
)
print(umap_df.head())
```

### 15. UMAP plot

`plot_umap()` directly produces a seaborn-based figure and returns the figure, axes, and projected frame. It uses the context-conditioned text representation by default.

```python
fig, ax, umap_frame = model.plot_umap(
    texts=texts,
    country=countries,
    year=years,
    color_by="country",
    annotate=False,
)
```

### 16. Topic profile bar plot

`plot_topic_profile()` visualizes the aggregated CMP profile.

```python
fig, ax = model.plot_topic_profile(
    profile,
    top_n=12,
    title="Corpus Topic Profile",
)
```

### 17. Group divergence heatmap

`plot_group_divergence()` visualizes the pairwise Jensen-Shannon divergence returned by `compare_groups()`.

```python
fig, ax, divergence_matrix = model.plot_group_divergence(
    comparison["pairwise_divergence"],
    title="Party Family Divergence",
)
```

### 18. Topic heatmap from the country-year panel

`plot_topic_heatmap()` uses the panel output to visualize topic intensity over country-year cells or selected CMP codes.

```python
fig, ax, heatmap = model.plot_topic_heatmap(
    panel,
    topics=[401, 504, 106],
    title="Selected Topic Heatmap",
)
```

### 19. Temporal trend plot

`plot_temporal_trends()` visualizes topic trajectories across years using the country-year panel.

```python
fig, ax, trend_df = model.plot_temporal_trends(
    panel,
    countries=["Japan", "United States", "Germany"],
    title="Temporal Topic Trends",
)
```

### 20. Ridgeplot from text-level analysis

`plot_topic_ridgeplot()` visualizes the distribution of posterior probabilities for selected CMP topics.

```python
g = model.plot_topic_ridgeplot(
    analysis,
    topics=[504],
    title="Topic Probability Ridgeplot",
)
```

## Main Interfaces

- `predict(...)`
- `predict_batch(...)`
- `analyze_corpus(...)`
- `compute_topic_profile(...)`
- `bootstrap_topic_profile(...)`
- `compare_groups(...)`
- `compare_country_profiles(...)`
- `panelize_country_year(...)`
- `project_umap(...)`
- `plot_umap(...)`
- `plot_topic_profile(...)`
- `plot_group_divergence(...)`
- `plot_topic_heatmap(...)`
- `plot_topic_ridgeplot(...)`
- `plot_temporal_trends(...)`
- `export_methods_summary(...)`
- `get_cmp_info(...)`
- `list_cmp_codes(...)`
- `load_demo_corpus_ja_us_de_1996_2018()`

The default text backbone is `bert-base-multilingual-cased`.

## CMP Code Reference

The CMP labels exposed by `get_cmp_info()` and `list_cmp_codes()` follow the Comparative Manifesto Project coding scheme. For code definitions and the underlying coding framework, cite the Comparative Manifesto Project dataset and codebook in substantive applications.
