Metadata-Version: 2.4
Name: semantic-comparer
Version: 0.1.1
Summary: Advanced text alignment and semantic containment analysis tool
Requires-Python: >=3.12
Requires-Dist: aiofiles>=23.0.0
Requires-Dist: asyncio-mqtt>=0.16.0
Requires-Dist: numpy>=2.3.1
Requires-Dist: rich>=13.0.0
Requires-Dist: sentence-transformers>=5.0.0
Requires-Dist: typer>=0.9.0
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: isort>=5.12.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# Semantic Comparer

Advanced text alignment and semantic containment analysis tool using modern Python practices.

## Features

- **Semantic Alignment**: Uses Smith-Waterman algorithm with sentence transformers for intelligent text comparison
- **Modern CLI**: Built with Typer and Rich for beautiful, user-friendly interface
- **Async Processing**: High-performance asynchronous operations
- **File Support**: Direct text input or file-based processing
- **Rich Output**: Colorized, formatted results with detailed statistics
- **Type Safety**: Full type annotations and modern Python practices

## Installation

```bash
# Install dependencies
uv add rich typer aiofiles

# Install the package
uv pip install -e .

# Or run directly as a module
python -m semantic_comparer compare "text1" "text2"
```

## Usage

### Basic Comparison

```bash
# Compare two texts directly
python -m semantic_comparer compare "This is the first text." "This is the second text."

# Compare with custom parameters
python -m semantic_comparer compare \
  "First text content" \
  "Second text content" \
  --model all-MiniLM-L6-v2 \
  --gap-penalty 0.3 \
  --similarity-threshold 0.5
```

### File-based Comparison

```bash
# Compare text files (prefix with @)
python -m semantic_comparer compare @file1.txt @file2.txt

# Mix direct text and file
python -m semantic_comparer compare "Direct text" @file.txt
```

### Output Options

```bash
# Quiet mode (summary only)
python -m semantic_comparer compare text1 text2 --quiet

# Save detailed results to JSON
python -m semantic_comparer compare text1 text2 --output results.json
```

## Command Line Options

| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--model` | `-m` | Sentence transformer model | `all-MiniLM-L6-v2` |
| `--gap-penalty` | `-g` | Penalty for gaps (0.0-1.0) | `0.3` |
| `--similarity-threshold` | `-t` | Minimum similarity for matches (0.0-1.0) | `0.5` |
| `--output` | `-o` | Output file for JSON results | None |
| `--quiet` | `-q` | Suppress detailed output | False |

## Understanding the Results

### Alignment Types

- **✓ Match**: Paragraphs that are semantically similar
- **⚠ Only in A/B**: Paragraphs present in one text but not the other
- **✗ Unaligned**: Paragraphs that couldn't be matched

### Containment Score

The semantic containment score measures how much of text A's semantic content is found in text B:

- **0.0-0.4**: Low similarity (Red)
- **0.4-0.7**: Moderate similarity (Yellow)  
- **0.7-1.0**: High similarity (Green)

## Advanced Usage

### Custom Models

```bash
# Use a different sentence transformer model
python -m semantic_comparer compare text1 text2 --model all-mpnet-base-v2
```

### Fine-tuning Parameters

```bash
# Stricter matching (higher threshold)
python -m semantic_comparer compare text1 text2 --similarity-threshold 0.8

# More lenient gap handling (lower penalty)
python -m semantic_comparer compare text1 text2 --gap-penalty 0.1
```

## Development

### Project Structure

```
semantic_comparer/
├── __init__.py          # Package initialization
├── core.py              # Core alignment logic
├── cli.py               # Command-line interface
└── utils.py             # Utility functions
```

### Running Tests

```bash
# Install dev dependencies
uv add --dev pytest pytest-asyncio black isort mypy ruff

# Run tests
pytest

# Format code
black .
isort .

# Type checking
mypy .

# Linting
ruff check .
```

## Technical Details

### Algorithm

The tool uses the Smith-Waterman algorithm adapted for semantic similarity:

1. **Text Segmentation**: Split texts into paragraphs
2. **Embedding Generation**: Convert paragraphs to semantic vectors
3. **Similarity Calculation**: Compute cosine similarity between vectors
4. **Dynamic Programming**: Apply Smith-Waterman for optimal alignment
5. **Score Calculation**: Weighted containment score based on matches

### Performance

- **Async Processing**: Non-blocking I/O operations
- **Memory Efficient**: Streaming file processing for large texts
- **Progress Tracking**: Real-time progress indicators
- **Error Handling**: Robust error handling with user-friendly messages

### Security

- **Input Validation**: Comprehensive parameter validation
- **File Safety**: Secure file operations with size limits
- **Text Sanitization**: Removal of problematic characters
- **Error Isolation**: Graceful error handling without data exposure

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes with proper type annotations
4. Add tests for new functionality
5. Ensure code passes linting and type checking
6. Submit a pull request

## License

MIT License - see LICENSE file for details.
