Metadata-Version: 2.4
Name: wosecopy
Version: 0.1.2
Summary: WOrd SEntence COoccurrence - Extract and analyze concept associations in creative stories
Home-page: https://github.com/owen-saunders/woseco
Author: Owen Saunders, Edith Haim
Author-email: Owen Saunders <56914238+owen-saunders@users.noreply.github.com>
License: MIT
Keywords: nlp,natural language processing,network analysis,creativity,semantic analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: spacy>=3.0.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: networkx>=2.6.0
Requires-Dist: matplotlib>=3.4.0
Requires-Dist: seaborn>=0.11.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: scipy>=1.7.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: click>=8.0.0
Provides-Extra: dev
Requires-Dist: pytest>=6.2.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: flake8>=3.9.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# wosecopy: WOrd SEntence COoccurrence

[![PyPI version](https://badge.fury.io/py/wosecopy.svg)](https://badge.fury.io/py/wosecopy)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

wosecopy is a Python package for extracting and analyzing concept associations in creative stories using natural language processing and network analysis. It identifies semantic relationships between concepts across consecutive sentences to understand narrative flow and creativity.

## Features

- **Bilingual Support**: Process both English and German text
- **Concept Extraction**: Extract noun-based concepts from sentences using spaCy
- **Semantic Matching**: Link concepts across sentences using word embeddings similarity
- **Network Analysis**: Build and analyze concept networks with 6 key metrics:
  - Average Shortest Path Length (ASPL)
  - Mean Local Clustering Coefficient (MLCC)
  - Modularity (community structure)
  - Number of Connected Components
  - Average Component Size
  - Giant Connected Component Size
- **Creativity Analysis**: Measure "unexpectedness" of concepts relative to prompts
- **Visualization**: Generate network graphs and statistical plots
- **CLI Tools**: Complete command-line interface for all operations
- **Export Formats**: GraphML, GEXF, GML for network visualization tools

## Installation

### From PyPI (once published)

```bash
pip install wosecopy
```

### From Source

```bash
git clone https://github.com/owen-saunders/wosecopy.git
cd wosecopy
pip install -e .
```

### Download Language Models

After installation, download the required spaCy models:

```bash
# For English (required)
wosecopy download-model --language en --size lg

# For German (optional)
wosecopy download-model --language de --size lg
```

Or manually:

```bash
python -m spacy download en_core_web_lg
python -m spacy download de_core_news_lg
```

## Quick Start

### Command Line Usage

#### 1. Extract wosecopy Concepts from Stories

```bash
wosecopy extract stories.csv -o results.csv --language en --text-column Story
```

#### 2. Calculate Network Metrics

```bash
wosecopy metrics results.csv -o metrics.json
```

#### 3. Analyze by Creativity Ratings

```bash
wosecopy analyze-ratings results.csv --rating-column rating_h -o grouped.json
```

#### 4. Export Concept Graphs

```bash
wosecopy export results.csv -o graphs/ --format graphml
```

#### 5. Visualize a Concept Network

```bash
wosecopy visualize results.csv -o graph.png --index 0
```

#### 6. Compare Multiple Raters

```bash
wosecopy compare results.csv -r rating_h -r rating_j -r rating_k -o plots/
```

#### 7. Calculate Unexpectedness Scores

```bash
wosecopy unexpectedness results.csv --prompt-column prompt -o unexpected.csv
```

### Python API Usage

#### Basic Concept Extraction

```python
from wosecopy import wosecopyExtractor

# Initialize extractor
extractor = wosecopyExtractor(language='en', model_size='lg')

# Process a CSV file
df = extractor.process_csv(
    'stories.csv',
    text_column='Story',
    output_path='results.csv'
)

# Or process directly from DataFrame
import pandas as pd
df = pd.DataFrame({'Story': ['Your story here...']})
concepts = extractor.get_wosecopy(df)
```

#### Build and Analyze Graphs

```python
from wosecopy.graph import build_graph, export_graph
from wosecopy.metrics import calculate_all_metrics

# Build a concept graph
concepts = ['belief', 'faith', 'church', 'prayer', 'hope']
graph = build_graph(concepts, graph_type='chain')

# Calculate metrics
metrics = calculate_all_metrics(graph)
print(f"Clustering coefficient: {metrics['mlcc']:.3f}")
print(f"Modularity: {metrics['modularity']:.3f}")

# Export graph
export_graph(graph, 'my_graph.graphml', format='graphml')
```

#### Analyze Creativity by Ratings

```python
from wosecopy.analysis import group_by_rating, compare_raters
from wosecopy.metrics import stats

# Group concepts by rating
grouped = group_by_rating(df, rating_column='rating_h')

# Calculate metrics for each rating group
for rating, concept_lists in grouped.items():
    metrics = stats(concept_lists)
    print(f"Rating {rating}: ASPL = {sum(metrics['aspl'])/len(metrics['aspl']):.3f}")

# Compare across multiple raters
rater_metrics = compare_raters(
    df,
    rating_columns=['rating_h', 'rating_j', 'rating_k'],
    concepts_column='wosecopy_concepts'
)
```

#### Measure Unexpectedness

```python
from wosecopy.analysis import unexpectedness_arrays

# Calculate unexpectedness scores
scores = unexpectedness_arrays(
    df,
    prompt_column='prompt',
    concepts_column='wosecopy_concepts',
    language='en'
)

df['unexpectedness'] = scores
```

#### Visualization

```python
from wosecopy.visualization import plot_graph, plot_network_stats

# Visualize a concept network
fig = plot_graph(graph, layout='spring', save_path='network.png')

# Plot network statistics
from wosecopy.metrics import stats
metrics = stats(concept_lists)
fig = plot_network_stats(
    metrics,
    title='Network Metrics by Rating',
    save_path='metrics.png'
)
```

## Data Format

### Input CSV Format

Your input CSV should have at minimum:

```csv
Story,prompt,rating_h
"Once upon a time there was a belief. The faith was strong...","belief-faith-sing",4
"Another story about payment and gloom...","gloom-payment-exist",3
```

Required columns:
- Text column (default: `Story`): Contains the story/text to analyze
- Optional: `prompt` - Prompt words used to generate the story
- Optional: Rating columns (e.g., `rating_h`, `rating_j`) - Creativity ratings

### Output Format

After extraction, the output CSV includes:

```csv
Story,prompt,rating_h,wosecopy_concepts
"Once upon a time...","belief-faith-sing",4,"['belief', 'faith', 'church', 'prayer']"
```

The `wosecopy_concepts` column contains the extracted concept chain as a list.

## CLI Commands Reference

| Command | Description |
|---------|-------------|
| `extract` | Extract wosecopy concepts from text |
| `metrics` | Calculate network metrics |
| `analyze-ratings` | Group and analyze by ratings |
| `export` | Export concept graphs to files |
| `visualize` | Visualize a single concept network |
| `compare` | Compare metrics across raters |
| `unexpectedness` | Calculate semantic distance from prompts |
| `download-model` | Download spaCy language models |

Run `wosecopy --help` or `wosecopy <command> --help` for detailed options.

## How It Works

1. **Sentence Splitting**: Stories are split into sentences (by periods)
2. **Noun Extraction**: Nouns are extracted and lemmatized from each sentence
3. **Concept Matching**: Between consecutive sentences, the most semantically similar noun pair is identified using spaCy's word embeddings
4. **Chain Building**: Matched concepts form a chain representing narrative flow
5. **Graph Construction**: Concepts become nodes, matched pairs become edges
6. **Metric Calculation**: Network metrics quantify narrative structure and creativity

### Key Metrics

- **ASPL** (Average Shortest Path Length): Measures how efficiently concepts connect. Lower values suggest tighter narratives.
- **MLCC** (Mean Local Clustering Coefficient): Measures concept clustering. Higher values indicate tightly grouped concept clusters.
- **Modularity**: Measures community structure. Higher values suggest distinct conceptual modules.
- **Connected Components**: Number of separate concept clusters.
- **GCC Size**: Size of the largest connected concept cluster.

## Use Cases

- **Creativity Research**: Analyze how creative stories differ in their conceptual structure
- **Narrative Analysis**: Study how concepts flow through narratives
- **Educational Assessment**: Evaluate student writing for conceptual connectivity
- **Content Analysis**: Compare semantic structure across different text types
- **Computational Linguistics**: Study semantic relationships in discourse

## Examples

See the `examples/` directory for Jupyter notebooks demonstrating:

- Basic concept extraction
- Rating-based analysis
- Multi-rater comparison
- Prompt group analysis
- Visualization techniques

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## Citation

If you use wosecopy in your research, please cite:

```bibtex
@software{wosecopy,
  title = {wosecopy: WOrd SEntence COoccurrence Analysis},
  author = {Owen Saunders, Edith Haim},
  year = {2024},
  url = {https://github.com/owen-saunders/wosecopy}
}
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Built with [spaCy](https://spacy.io/) for NLP processing
- [NetworkX](https://networkx.org/) for graph analysis
- Inspired by research on creativity and semantic networks

## Support

- Documentation: [https://wosecopy.readthedocs.io](https://wosecopy.readthedocs.io)
- Issues: [https://github.com/owen-saunders/wosecopy/issues](https://github.com/owen-saunders/wosecopy/issues)
- Email: your.email@example.com

## Roadmap

- [ ] Add more language support (Spanish, French)
- [ ] Implement additional network metrics
- [ ] Add interactive visualization with Plotly
- [ ] Support for custom similarity functions
- [ ] Parallel processing for large datasets
- [ ] Web API for remote processing
