Metadata-Version: 2.4
Name: bundestag-protocol-extractor
Version: 0.1.12
Summary: Package to extract parliamentary protocols from the German Bundestag in structured form.
Author: Max Harrer
License: MIT
Project-URL: Homepage, https://github.com/maxboettinger/bundestag-protocol-extractor
Project-URL: Issues, https://github.com/maxboettinger/bundestag-protocol-extractor/issues
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: beautifulsoup4>=4.11.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: tqdm>=4.64.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=6.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
Dynamic: license-file

# Bundestag Protocol Extractor

[![PyPI version](https://img.shields.io/pypi/v/bundestag-protocol-extractor.svg)](https://pypi.org/project/bundestag-protocol-extractor/)
[![Python Versions](https://img.shields.io/pypi/pyversions/bundestag-protocol-extractor.svg)](https://pypi.org/project/bundestag-protocol-extractor/)
[![License](https://img.shields.io/pypi/l/bundestag-protocol-extractor.svg)](https://github.com/maxboettinger/bundestag-protocol-extractor/blob/main/LICENSE)
[![Tests](https://github.com/maxboettinger/bundestag-protocol-extractor/actions/workflows/test.yml/badge.svg)](https://github.com/maxboettinger/bundestag-protocol-extractor/actions/workflows/test.yml)

Extract and structure data from the German Bundestag's parliamentary protocols using the official DIP API.

## 🚀 Quick Start

### Installation

```bash
pip install bundestag-protocol-extractor
```

### Command Line Usage

```bash
# Extract protocols from the 20th legislative period, limit to 5
bpe --period 20 --limit 5 --output-dir ./data

# Export to both CSV and JSON format
bpe --period 20 --limit 5 --format both

# Use a specific API key (optional, package includes a public key)
bpe --api-key YOUR_API_KEY --period 20 --limit 5
```

## 🔍 Overview

This package allows researchers, journalists, and political analysts to access German parliamentary protocols (plenarprotokolle) in a structured format suitable for analysis. It extracts speeches, speaker metadata, topics, and related information from the Bundestag's official API.

## ✨ Features

- **Extract Protocols**: Access plenarprotokolle from all legislative periods
- **Structure Content**: Extract individual speeches with rich metadata
- **Speaker Metadata**: Get information about speakers (name, party, role)
- **Topic Analysis**: Access topic information and related proceedings
- **Multiple Export Formats**: Export to CSV and JSON for easy analysis
- **Automatic Rate Limiting**: Robust handling of API limits with exponential backoff
- **Progress Tracking**: Resume long-running extractions if interrupted
- **Flexible Configuration**: Fine-tune extraction parameters based on your needs

## 📋 Detailed Usage

### Command Line Interface

```bash
# Basic usage
bpe --period 20 --limit 5 --output-dir ./data

# List help and all available options
bpe --help

# Extract all protocols from the current legislative period
bpe --period 20 --output-dir ./data

# Use XML parsing for more detailed extraction (default)
bpe --period 20 --use-xml

# Disable XML parsing (faster but less detailed)
bpe --period 20 --no-xml
```

#### Control Output Format

```bash
# Export to CSV (default)
bpe --period 20 --format csv

# Export to JSON
bpe --period 20 --format json

# Export to both formats
bpe --period 20 --format both

# Exclude full speech text to reduce file size
bpe --period 20 --exclude-speech-text

# Include full protocol text (large files)
bpe --period 20 --include-full-protocols
```

#### Logging Options

```bash
# Enable verbose output (INFO to console, DEBUG to log file)
bpe --period 20 --verbose

# Enable full debug logging (DEBUG to both console and log file)
bpe --period 20 --debug

# Quiet mode (WARNING to console, INFO to log file)
bpe --period 20 --quiet

# Specify custom log file
bpe --period 20 --log-file /path/to/custom/log/file.log
```

#### Progress Tracking & Resumption

```bash
# List available progress files
bpe --list-progress

# Resume from a specific progress file
bpe --resume /path/to/progress_file.json

# Resume from a specific protocol
bpe --resume-from "20/123"

# Skip first N protocols
bpe --offset 25
```

### Python API

You can also use the package directly in your Python code:

```python
from bundestag_protocol_extractor import BundestagExtractor
import logging

# Initialize the extractor (uses default public API key)
extractor = BundestagExtractor()

# Or with your own API key
# extractor = BundestagExtractor(api_key="YOUR_API_KEY")

# Fetch protocols for a specific legislative period (20th Bundestag)
protocols = extractor.get_protocols(period=20, limit=5)

# Export to CSV (creates separate files for protocols, speeches, etc.)
exported_files = extractor.export_to_csv(
    protocols,
    output_dir="./data",
    include_speech_text=True
)

# Export to JSON (creates a single JSON file with all data)
json_path = extractor.export_to_json(protocols, output_dir="./data")
```

## 📊 Data Structure

The extracted data is organized in a relational format with multiple CSV files:

### Core Files

1. **protocols.csv**: Basic protocol metadata (date, title, etc.)
2. **speeches.csv**: Individual speeches with speaker references
3. **persons.csv**: Speaker information (name, party, role)
4. **proceedings.csv**: Related parliamentary proceedings
5. **speech_topics.csv**: Topics associated with each speech

### Detailed Files (XML-based)

6. **paragraphs.csv**: Individual paragraphs for detailed text analysis
7. **comments.csv**: Comments and interjections
8. **agenda_items.csv**: Agenda items for each session
9. **toc.csv**: Table of contents with document structure

## 🔑 API Key

The package includes a public API key with limited rate allowance. For extensive usage, register for your own API key from the Bundestag DIP API:

Visit: [Dokumentations- und Informationssystems für Parlamentsmaterialien (DIP) API Documentation](https://dip.bundestag.de/über-dip/hilfe/api)

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

### Development Setup

1. Clone the repository:
   ```bash
   git clone https://github.com/maxboettinger/bundestag-protocol-extractor.git
   cd bundestag-protocol-extractor
   ```

2. Create a conda environment:
   ```bash
   conda env create -f environment.yml
   conda activate bundestag-protocol-extractor
   ```

3. Install the package in development mode:
   ```bash
   pip install -e ".[dev]"
   ```

4. Run tests:
   ```bash
   pytest
   ```

### Making a Release

The package includes a comprehensive release script that verifies package integrity:

```bash
python scripts/release.py [major|minor|patch]
```

The release process:
1. Runs all tests including import verification
2. Builds distribution packages
3. Verifies the built package in a virtual environment
4. Ensures critical modules like utils.logging are included
5. Uploads to PyPI (with confirmation)

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
