Metadata-Version: 2.4
Name: customer-question-extractor
Version: 0.1.0
Summary: Extract and deduplicate customer questions from text data
Project-URL: Homepage, https://github.com/dan-shah/question-extractor
Project-URL: Repository, https://github.com/dan-shah/question-extractor
Project-URL: Issues, https://github.com/dan-shah/question-extractor/issues
Author: Dan Shah
License-Expression: MIT
License-File: LICENSE
Keywords: call-center,clustering,deduplication,nlp,questions
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Requires-Dist: jinja2>=3.0
Requires-Dist: numpy>=1.24
Requires-Dist: openai>=1.0
Requires-Dist: pandas>=2.0
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: rapidfuzz>=3.0
Requires-Dist: rich>=13.0
Requires-Dist: scikit-learn>=1.0
Requires-Dist: streamlit>=1.12.0
Requires-Dist: tiktoken>=0.5
Requires-Dist: typer>=0.9
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Description-Content-Type: text/markdown

# Question Extractor

A Python package for extracting and deduplicating customer questions from text data (e.g., call center transcripts).

## Features

- **Text Summarization**: Remove fillers and extract key content from conversations
- **Question Extraction**: Identify distinct customer questions and concerns using GPT-4o
- **Two-Tier Deduplication**:
  - Fuzzy matching with RapidFuzz (75% similarity threshold)
  - Semantic clustering with OpenAI embeddings
- **Cost Estimation**: Get estimated API costs and processing time before running
- **Analysis Reports**: Auto-generated markdown and HTML reports with insights
- **Interactive Dashboard**: Streamlit app for exploring results

## Installation

```bash
pip install customer-question-extractor
```

Or install from source:

```bash
git clone https://github.com/dan-shah/question-extractor.git
cd question-extractor
pip install -e ".[dev]"
```

## Quick Start

### CLI Usage

```bash
# Set your OpenAI API key
export OPENAI_API_KEY=your-api-key

# Estimate cost and time
question-extractor estimate data.csv --text-col transcript --id-col call_id

# Run the full pipeline
question-extractor run data.csv --text-col transcript --id-col call_id --output-dir output/

# Launch the dashboard
question-extractor serve output/
```

### Python API

```python
from question_extractor import Config, Pipeline

# Configure
config = Config(
    openai_api_key="your-api-key",
    text_column="transcript",
    id_column="call_id",
    output_dir="output",
)

# Initialize and run
pipeline = Pipeline(config)

# Get cost estimate first
estimate = pipeline.estimate(input_path="data.csv")
print(estimate.format_summary())

# Run the pipeline
result = pipeline.run(input_path="data.csv")

# Access results
print(f"Processed {result.stats.input_rows} rows")
print(f"Extracted {result.stats.questions_extracted} questions")
print(f"Final unique questions: {result.stats.final_unique}")
```

## Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `text_column` | `"text"` | Column containing text to process |
| `id_column` | `"row_id"` | Column with unique row identifier |
| `fuzzy_threshold` | `0.75` | RapidFuzz similarity threshold (0-1) |
| `semantic_distance_threshold` | `0.7` | Agglomerative clustering distance |
| `top_questions_percentile` | `0.75` | % of questions by volume for semantic clustering |

## Output Files

After running the pipeline, you'll find these files in the output directory:

- `clean.csv` - Processed data with row_id, question, and semantic_cluster columns
- `cleansing.md` - Summary statistics about the processing
- `analysis.md` / `analysis.html` - LLM-generated insights report
- `app.py` - Streamlit dashboard for interactive exploration

## Requirements

- Python 3.9+
- OpenAI API key
- Dependencies: openai, pandas, rapidfuzz, scikit-learn, streamlit, typer, rich

## License

MIT License
