Metadata-Version: 2.1
Name: gbert
Version: 0.6.3
Summary: Research-oriented multilingual manifesto analysis with comparative and corpus-level inference.
Project-URL: Homepage, https://pypi.org/project/gbert/
Keywords: nlp,bert,manifesto-analysis,comparative-politics,policy-analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: huggingface-hub<2,>=0.30
Requires-Dist: joblib<2,>=1.4
Requires-Dist: numpy<3,>=1.26
Requires-Dist: pandas<3,>=2.2
Requires-Dist: scikit-learn<2,>=1.5
Requires-Dist: torch<3,>=2.4
Requires-Dist: transformers<5,>=4.46
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: twine<7,>=5; extra == "dev"

# gbert

`gbert` is a research-oriented package for multilingual manifesto analysis. It provides a complete workflow for single-text inference, batch prediction, CMP lookup, corpus-level profiling, group comparison, country-year panel construction, UMAP projection, and publication-oriented plotting.

## Installation

Install the base package:

```bash
pip install gbert
```

For UMAP and plotting, install the optional analysis stack:

```bash
pip install umap-learn matplotlib seaborn
```

## Coverage Note

The package is designed for country-year inference with temporal coverage extending through 2023 in standard use. The built-in demo corpus remains a compact illustrative dataset for 1996-2018.

## End-to-End Example

The following example mirrors the full workflow supported by the package.

### 1. Install the package

```python
!pip -q install gbert
```

### 2. Import and initialize the model

Use `GbertClassifier` to load the runtime model and metadata. The `model_repo_id` argument points to the Hugging Face repository containing the model weights.

```python
from gbert import GbertClassifier, load_demo_corpus_ja_us_de_1996_2018

model = GbertClassifier(
    model_repo_id="X-Li/gbert",
    # device="cpu",
)
```

### 3. Load the built-in demo corpus

`load_demo_corpus_ja_us_de_1996_2018()` returns an illustrative multilingual test corpus covering Japan, the United States, and Germany for each year from 1996 to 2018. The returned `DataFrame` includes `text`, `country`, `year`, and `party_family`.

```python
demo = load_demo_corpus_ja_us_de_1996_2018()
texts = demo["text"].tolist()
countries = demo["country"].tolist()
years = demo["year"].tolist()
party_family = demo["party_family"].tolist()
```

### 4. Single prediction

Use `predict()` for one sentence at a time. The returned object contains ranked CMP predictions and macroeconomic covariates used during inference.

```python
single = model.predict(
    "政府は先端産業への投資を強化する。",
    country="Japan",
    year=2026,
)
print(single["predictions"][:3])
```

### 5. Batch prediction

Use `predict_batch()` for multiple texts. Set `return_df=True` if you want a compact `DataFrame` instead of the full nested output.

```python
batch = model.predict_batch(
    texts=texts,
    country=countries,
    year=years,
    return_df=True,
)
print(batch.head())
```

### 6. CMP code lookup

Use `get_cmp_info()` to inspect a specific CMP code and `list_cmp_codes()` to see the available label space.

```python
print(model.get_cmp_info(305))
print(model.list_cmp_codes()[:10])
```

### 7. Corpus analysis

`analyze_corpus()` is the primary text-level analysis API. It returns a `DataFrame` with top predictions, entropy, confidence margin, and optional full posterior columns.

```python
analysis = model.analyze_corpus(
    texts=texts,
    country=countries,
    year=years,
    include_probabilities=True,
)
print(analysis.columns[:20])
print(analysis[["country", "year", "top_cmp_code", "top_cmp_title", "top_probability", "entropy"]].head())
```

### 8. Topic profile

`compute_topic_profile()` aggregates sentence-level posterior probabilities into a corpus-level CMP profile.

```python
profile = model.compute_topic_profile(
    texts=texts,
    country=countries,
    year=years,
)
print(profile.head(10))
```

### 9. Bootstrap topic profile

`bootstrap_topic_profile()` produces interval estimates for the corpus profile through repeated resampling.

```python
bootstrap = model.bootstrap_topic_profile(
    texts=texts,
    country=countries,
    year=years,
    n_bootstrap=100,
)
print(bootstrap.head(10))
```

### 10. Group comparison

`compare_groups()` compares any user-defined grouping variable. Here the grouping variable is `party_family`.

```python
comparison = model.compare_groups(
    texts=texts,
    country=countries,
    year=years,
    group=party_family,
)
print(comparison["group_summary"])
print(comparison["pairwise_divergence"].head())
```

### 11. Country profile comparison

`compare_country_profiles()` is a convenience wrapper for country-level comparison.

```python
country_comparison = model.compare_country_profiles(
    texts=texts,
    country=countries,
    year=years,
)
print(country_comparison["group_summary"].head())
```

### 12. Country-year panel

`panelize_country_year()` converts text-level predictions into a country-year panel with topic scores and summary indicators.

```python
panel = model.panelize_country_year(
    texts=texts,
    country=countries,
    year=years,
)
print(panel.head())
```

### 13. Methods-summary export

`export_methods_summary()` generates a compact dictionary for manuscript drafting, including sample statistics, prediction-quality summaries, top topics, and a ready-to-edit methods paragraph.

```python
methods = model.export_methods_summary(
    texts=texts,
    country=countries,
    year=years,
)
print(methods.keys())
print(methods["methods_text"])
```

### 14. Raw UMAP projection

`project_umap()` returns the low-dimensional projection coordinates as a `DataFrame`. By default it uses the context-conditioned text representation (`cls_film`), and it also supports raw BERT CLS vectors through `representation="cls"`.

```python
umap_df = model.project_umap(
    texts=texts,
    country=countries,
    year=years,
)
print(umap_df.head())
```

### 15. UMAP plot

`plot_umap()` directly produces a seaborn-based figure and returns the figure, axes, and projected frame. It uses the context-conditioned text representation by default.

```python
fig, ax, umap_frame = model.plot_umap(
    texts=texts,
    country=countries,
    year=years,
    color_by="country",
    annotate=False,
)
```

### 16. Topic profile bar plot

`plot_topic_profile()` visualizes the aggregated CMP profile.

```python
fig, ax = model.plot_topic_profile(
    profile,
    top_n=12,
    title="Corpus Topic Profile",
)
```

### 17. Group divergence heatmap

`plot_group_divergence()` visualizes the pairwise Jensen-Shannon divergence returned by `compare_groups()`.

```python
fig, ax, divergence_matrix = model.plot_group_divergence(
    comparison["pairwise_divergence"],
    title="Party Family Divergence",
)
```

### 18. Topic heatmap from the country-year panel

`plot_topic_heatmap()` uses the panel output to visualize topic intensity over country-year cells or selected CMP codes.

```python
fig, ax, heatmap = model.plot_topic_heatmap(
    panel,
    topics=[401, 504, 106],
    title="Selected Topic Heatmap",
)
```

### 19. Temporal trend plot

`plot_temporal_trends()` visualizes topic trajectories across years using the country-year panel.

```python
fig, ax, trend_df = model.plot_temporal_trends(
    panel,
    countries=["Japan", "United States", "Germany"],
    title="Temporal Topic Trends",
)
```

### 20. Ridgeplot from text-level analysis

`plot_topic_ridgeplot()` visualizes the distribution of posterior probabilities for selected CMP topics.

```python
g = model.plot_topic_ridgeplot(
    analysis,
    topics=[504],
    title="Topic Probability Ridgeplot",
)
```

## Main Interfaces

- `predict(...)`
- `predict_batch(...)`
- `analyze_corpus(...)`
- `compute_topic_profile(...)`
- `bootstrap_topic_profile(...)`
- `compare_groups(...)`
- `compare_country_profiles(...)`
- `panelize_country_year(...)`
- `project_umap(...)`
- `plot_umap(...)`
- `plot_topic_profile(...)`
- `plot_group_divergence(...)`
- `plot_topic_heatmap(...)`
- `plot_topic_ridgeplot(...)`
- `plot_temporal_trends(...)`
- `export_methods_summary(...)`
- `get_cmp_info(...)`
- `list_cmp_codes(...)`
- `load_demo_corpus_ja_us_de_1996_2018()`

The default text backbone is `bert-base-multilingual-cased`.

## CMP Code Reference

The CMP labels exposed by `get_cmp_info()` and `list_cmp_codes()` follow the Comparative Manifesto Project coding scheme. For code definitions and the underlying coding framework, cite the Comparative Manifesto Project dataset and codebook in substantive applications.

## Advanced Research APIs

The package also includes additional analysis layers for comparative inference, uncertainty and robustness assessment, and causal or explanatory diagnostics. These methods are designed to build on the standard outputs shown above, especially `analysis` and `panel`.

### Comparative Inference

#### `estimate_dynamic_topic_trends(panel, ...)`

Use this method when you want smoothed country-level topic trajectories rather than raw year-to-year values. It accepts the `panel` returned by `panelize_country_year()`.

Key arguments:

- `panel`: a country-year panel with `score_<cmp_code>` columns.
- `topics`: optional list of CMP codes to keep. If omitted, all topic-score columns are used.
- `countries`: optional country filter.
- `smoothing_window`: rolling window length used to compute `smoothed_score`.
- `min_periods`: minimum number of observations required for each rolling estimate.

Returns a long-format `DataFrame` with:

- `country`
- `year`
- `cmp_code`
- `cmp_title`
- `score`
- `smoothed_score`
- `delta_from_previous`

```python
dynamic_trends = model.estimate_dynamic_topic_trends(
    panel,
    topics=[401, 504, 106],
    countries=["Japan", "United States", "Germany"],
    smoothing_window=3,
)
print(dynamic_trends.head())

fig, ax, trend_plot_data = model.plot_dynamic_topic_trends(dynamic_trends)
```

#### `compare_topic_shifts(panel, ...)`

Use this method to compare start-to-end change within each country or another panel grouping variable. It summarizes how much each topic moved between the first and last available year in the filtered panel.

Key arguments:

- `panel`: the country-year panel.
- `topics`: optional CMP codes.
- `group_col`: grouping variable inside the panel, usually `country`.
- `start_year` and `end_year`: optional year bounds.

Returns a `DataFrame` with:

- group identifier column such as `country`
- `cmp_code`
- `cmp_title`
- `start_year`
- `end_year`
- `start_score`
- `end_score`
- `absolute_change`
- `relative_change`

```python
topic_shifts = model.compare_topic_shifts(
    panel,
    topics=[401, 504, 106],
    group_col="country",
    start_year=1996,
    end_year=2018,
)
print(topic_shifts.head())

fig, ax, topic_shift_plot = model.plot_topic_shifts(topic_shifts)
```

#### `compute_topic_convergence(panel, ...)`

Use this method to study whether countries are becoming more similar or more different over time. It combines topic-wise cross-country dispersion with pairwise Jensen-Shannon divergence.

Key arguments:

- `panel`: the country-year panel.
- `topics`: optional CMP code subset.

Returns a dictionary with:

- `yearly_summary`: topic-level yearly dispersion measures such as `cross_country_std` and `cross_country_range`, plus an `All Topics` row carrying `mean_pairwise_js`.
- `pairwise_divergence`: pairwise country divergence by year.

```python
convergence = model.compute_topic_convergence(
    panel,
    topics=[401, 504, 106],
)
print(convergence["yearly_summary"].head())
print(convergence["pairwise_divergence"].head())

fig, axes, convergence_plot = model.plot_topic_convergence(convergence)
```

### Uncertainty And Robustness

#### `summarize_prediction_uncertainty(texts, country, year, ...)`

Use this method to identify uncertain predictions at the text level. It relies on predictive entropy and top-class confidence margin and marks texts as high-uncertainty when either criterion is exceeded.

Key arguments:

- `texts`, `country`, `year`: the corpus to evaluate.
- `group`: optional grouping variable such as `party_family`.
- `entropy_quantile`: quantile used to define the high-entropy threshold.
- `margin_quantile`: quantile used to define the low-margin threshold.

Returns a dictionary with:

- `summary`: overall uncertainty statistics and thresholds.
- `text_level`: a text-level `DataFrame` including `high_entropy_flag`, `low_margin_flag`, and `high_uncertainty`.
- `group_summary`: group-level uncertainty summary if `group` is provided.

```python
uncertainty = model.summarize_prediction_uncertainty(
    texts=texts,
    country=countries,
    year=years,
    group=party_family,
    entropy_quantile=0.75,
    margin_quantile=0.25,
)
print(uncertainty["summary"])
print(uncertainty["group_summary"].head())

fig, ax, uncertainty_plot = model.plot_prediction_uncertainty(uncertainty)
```

#### `assess_topic_stability(texts, country, year, ...)`

Use this method to evaluate how stable topic rankings are under bootstrap resampling. This is helpful when you want to know whether the leading topics in a corpus are robust or highly sample-dependent.

Key arguments:

- `texts`, `country`, `year`: the corpus to evaluate.
- `n_bootstrap`: number of bootstrap draws.
- `top_k`: target rank threshold used for `top_k_frequency`.
- `random_state`: random seed.

Returns a `DataFrame` with:

- `cmp_code`
- `cmp_title`
- `mean_score`
- `score_std`
- `rank_mean`
- `rank_std`
- `top_k_frequency`

```python
stability = model.assess_topic_stability(
    texts=texts,
    country=countries,
    year=years,
    n_bootstrap=200,
    top_k=10,
)
print(stability.head())

fig, ax, stability_plot = model.plot_topic_stability(stability)
```

### Causal And Explanatory Diagnostics

#### `decompose_prediction_components(texts, country, year, ...)`

Use this method to inspect how the model’s total logit for each predicted CMP topic is assembled from different sources: text, macro conditions, country embedding, year embedding, and interaction terms.

Key arguments:

- `texts`, `country`, `year`: the texts to decompose.
- `top_k`: number of ranked CMP predictions to keep for each text.

Returns a long-format `DataFrame` with:

- `text`
- `country`
- `year`
- `rank`
- `cmp_code`
- `cmp_title`
- `probability`
- `logit_total`
- `logit_text`
- `logit_macro`
- `logit_country`
- `logit_year`
- `logit_interaction`

```python
components = model.decompose_prediction_components(
    texts=texts[:5],
    country=countries[:5],
    year=years[:5],
    top_k=3,
)
print(components.head())

fig, ax, component_plot = model.plot_prediction_components(components, text_index=0, rank=1)
```

#### `simulate_counterfactuals(text, country, year, ...)`

Use this method to hold the text constant while changing political context. This is useful when you want to ask how the same sentence would be classified under another country or another year.

Key arguments:

- `text`: one text string.
- `country` and `year`: the observed context.
- `counterfactual_countries`: optional alternative countries.
- `counterfactual_years`: optional alternative years.
- `top_k`: number of topics to keep per scenario.

Returns a `DataFrame` with:

- `scenario`
- `country`
- `year`
- `rank`
- `cmp_code`
- `cmp_title`
- `probability`
- `delta_vs_observed`

```python
counterfactuals = model.simulate_counterfactuals(
    text="The government should expand industrial investment and modern infrastructure.",
    country="Japan",
    year=2018,
    counterfactual_countries=["Germany", "United States"],
    counterfactual_years=[2000, 2010],
    top_k=3,
)
print(counterfactuals.head(12))

fig, ax, counterfactual_plot = model.plot_counterfactuals(counterfactuals, rank=1)
```

#### `estimate_macro_effects(texts, country, year, ...)`

Use this method to estimate topic sensitivity to perturbations in the macro variables used by the model. The method shifts each macro feature up and down in standardized space and reports how topic probabilities respond on average.

Key arguments:

- `texts`, `country`, `year`: the corpus to evaluate.
- `topic_codes`: optional CMP codes to focus on. If omitted, the method uses the leading topics in the observed corpus.
- `step_size`: perturbation size in standardized macro space.

Returns a `DataFrame` with:

- `macro_variable`
- `cmp_code`
- `cmp_title`
- `baseline_probability`
- `delta_plus`
- `delta_minus`
- `symmetric_effect`

```python
macro_effects = model.estimate_macro_effects(
    texts=texts,
    country=countries,
    year=years,
    topic_codes=[401, 504, 106],
    step_size=0.5,
)
print(macro_effects.head())

fig, ax, macro_effect_plot = model.plot_macro_effects(macro_effects)
```

### Short-to-Long Core Analysis

If your research question is whether a long text is centrally expressing the proposition contained in a short text, use the short-long core analysis API. The method segments the long text, compares the short text to each segment in representation space, aligns their CMP posterior distributions, and aggregates the evidence into a document-level core match score.

#### `analyze_short_long_core(short_text, long_text, short_country, short_year, long_country, long_year, ...)`

```python
core_result = model.analyze_short_long_core(
    short_text="The government should expand industrial investment.",
    long_text=long_document,
    short_country="Japan",
    short_year=2018,
    long_country="Japan",
    long_year=2018,
    representation="cls_film",
    top_k_segments=5,
)

print(core_result["core_match_score"])
print(core_result["core_match_label"])
print(core_result["top_segments"][:3])
```

Returns:

- document-level summary scores such as `core_match_score`, `overall_topic_alignment`, `best_segment_score`, and `support_coverage`
- short-text and long-text top CMP categories
- top supporting segments
- full segment-level evidence when `return_segments=True`

#### `analyze_short_long_core_batch(short_texts, long_texts, short_country, short_year, long_country, long_year, ...)`

```python
core_batch = model.analyze_short_long_core_batch(
    short_texts=short_texts,
    long_texts=long_texts,
    short_country=short_countries,
    short_year=short_years,
    long_country=long_countries,
    long_year=long_years,
    representation="cls_film",
    return_df=True,
)

print(core_batch.head())
```

This is the recommended interface for large comparative datasets containing many short-long pairs.
