Metadata-Version: 2.4
Name: wosecopy
Version: 0.1.1
Summary: WOrd SEntence COoccurrence - Extract and analyze concept associations in creative stories
Home-page: https://github.com/owen-saunders/woseco
Author: Owen Saunders, Edith Haim
Author-email: Owen Saunders <56914238+owen-saunders@users.noreply.github.com>
License: MIT
Keywords: nlp,natural language processing,network analysis,creativity,semantic analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: spacy>=3.0.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: networkx>=2.6.0
Requires-Dist: matplotlib>=3.4.0
Requires-Dist: seaborn>=0.11.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: scipy>=1.7.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: click>=8.0.0
Provides-Extra: dev
Requires-Dist: pytest>=6.2.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: flake8>=3.9.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Wosecopypy: WOrd SEntence COoccurrence

[![PyPI version](https://badge.fury.io/py/Wosecopypy.svg)](https://badge.fury.io/py/Wosecopypy)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Wosecopypy is a Python package for extracting and analyzing concept associations in creative stories using natural language processing and network analysis. It identifies semantic relationships between concepts across consecutive sentences to understand narrative flow and creativity.

## Features

- **Bilingual Support**: Process both English and German text
- **Concept Extraction**: Extract noun-based concepts from sentences using spaCy
- **Semantic Matching**: Link concepts across sentences using word embeddings similarity
- **Network Analysis**: Build and analyze concept networks with 6 key metrics:
  - Average Shortest Path Length (ASPL)
  - Mean Local Clustering Coefficient (MLCC)
  - Modularity (community structure)
  - Number of Connected Components
  - Average Component Size
  - Giant Connected Component Size
- **Creativity Analysis**: Measure "unexpectedness" of concepts relative to prompts
- **Visualization**: Generate network graphs and statistical plots
- **CLI Tools**: Complete command-line interface for all operations
- **Export Formats**: GraphML, GEXF, GML for network visualization tools

## Installation

### From PyPI (once published)

```bash
pip install Wosecopypy
```

### From Source

```bash
git clone https://github.com/owen-saunders/Wosecopypy.git
cd Wosecopypy
pip install -e .
```

### Download Language Models

After installation, download the required spaCy models:

```bash
# For English (required)
Wosecopypy download-model --language en --size lg

# For German (optional)
Wosecopypy download-model --language de --size lg
```

Or manually:

```bash
python -m spacy download en_core_web_lg
python -m spacy download de_core_news_lg
```

## Quick Start

### Command Line Usage

#### 1. Extract Wosecopypy Concepts from Stories

```bash
Wosecopypy extract stories.csv -o results.csv --language en --text-column Story
```

#### 2. Calculate Network Metrics

```bash
Wosecopypy metrics results.csv -o metrics.json
```

#### 3. Analyze by Creativity Ratings

```bash
Wosecopypy analyze-ratings results.csv --rating-column rating_h -o grouped.json
```

#### 4. Export Concept Graphs

```bash
Wosecopypy export results.csv -o graphs/ --format graphml
```

#### 5. Visualize a Concept Network

```bash
Wosecopypy visualize results.csv -o graph.png --index 0
```

#### 6. Compare Multiple Raters

```bash
Wosecopypy compare results.csv -r rating_h -r rating_j -r rating_k -o plots/
```

#### 7. Calculate Unexpectedness Scores

```bash
Wosecopypy unexpectedness results.csv --prompt-column prompt -o unexpected.csv
```

### Python API Usage

#### Basic Concept Extraction

```python
from Wosecopypy import WosecopypyExtractor

# Initialize extractor
extractor = WosecopypyExtractor(language='en', model_size='lg')

# Process a CSV file
df = extractor.process_csv(
    'stories.csv',
    text_column='Story',
    output_path='results.csv'
)

# Or process directly from DataFrame
import pandas as pd
df = pd.DataFrame({'Story': ['Your story here...']})
concepts = extractor.get_Wosecopypy(df)
```

#### Build and Analyze Graphs

```python
from Wosecopypy.graph import build_graph, export_graph
from Wosecopypy.metrics import calculate_all_metrics

# Build a concept graph
concepts = ['belief', 'faith', 'church', 'prayer', 'hope']
graph = build_graph(concepts, graph_type='chain')

# Calculate metrics
metrics = calculate_all_metrics(graph)
print(f"Clustering coefficient: {metrics['mlcc']:.3f}")
print(f"Modularity: {metrics['modularity']:.3f}")

# Export graph
export_graph(graph, 'my_graph.graphml', format='graphml')
```

#### Analyze Creativity by Ratings

```python
from Wosecopypy.analysis import group_by_rating, compare_raters
from Wosecopypy.metrics import stats

# Group concepts by rating
grouped = group_by_rating(df, rating_column='rating_h')

# Calculate metrics for each rating group
for rating, concept_lists in grouped.items():
    metrics = stats(concept_lists)
    print(f"Rating {rating}: ASPL = {sum(metrics['aspl'])/len(metrics['aspl']):.3f}")

# Compare across multiple raters
rater_metrics = compare_raters(
    df,
    rating_columns=['rating_h', 'rating_j', 'rating_k'],
    concepts_column='Wosecopypy_concepts'
)
```

#### Measure Unexpectedness

```python
from Wosecopypy.analysis import unexpectedness_arrays

# Calculate unexpectedness scores
scores = unexpectedness_arrays(
    df,
    prompt_column='prompt',
    concepts_column='Wosecopypy_concepts',
    language='en'
)

df['unexpectedness'] = scores
```

#### Visualization

```python
from Wosecopypy.visualization import plot_graph, plot_network_stats

# Visualize a concept network
fig = plot_graph(graph, layout='spring', save_path='network.png')

# Plot network statistics
from Wosecopypy.metrics import stats
metrics = stats(concept_lists)
fig = plot_network_stats(
    metrics,
    title='Network Metrics by Rating',
    save_path='metrics.png'
)
```

## Data Format

### Input CSV Format

Your input CSV should have at minimum:

```csv
Story,prompt,rating_h
"Once upon a time there was a belief. The faith was strong...","belief-faith-sing",4
"Another story about payment and gloom...","gloom-payment-exist",3
```

Required columns:
- Text column (default: `Story`): Contains the story/text to analyze
- Optional: `prompt` - Prompt words used to generate the story
- Optional: Rating columns (e.g., `rating_h`, `rating_j`) - Creativity ratings

### Output Format

After extraction, the output CSV includes:

```csv
Story,prompt,rating_h,Wosecopypy_concepts
"Once upon a time...","belief-faith-sing",4,"['belief', 'faith', 'church', 'prayer']"
```

The `Wosecopypy_concepts` column contains the extracted concept chain as a list.

## CLI Commands Reference

| Command | Description |
|---------|-------------|
| `extract` | Extract Wosecopypy concepts from text |
| `metrics` | Calculate network metrics |
| `analyze-ratings` | Group and analyze by ratings |
| `export` | Export concept graphs to files |
| `visualize` | Visualize a single concept network |
| `compare` | Compare metrics across raters |
| `unexpectedness` | Calculate semantic distance from prompts |
| `download-model` | Download spaCy language models |

Run `Wosecopypy --help` or `Wosecopypy <command> --help` for detailed options.

## How It Works

1. **Sentence Splitting**: Stories are split into sentences (by periods)
2. **Noun Extraction**: Nouns are extracted and lemmatized from each sentence
3. **Concept Matching**: Between consecutive sentences, the most semantically similar noun pair is identified using spaCy's word embeddings
4. **Chain Building**: Matched concepts form a chain representing narrative flow
5. **Graph Construction**: Concepts become nodes, matched pairs become edges
6. **Metric Calculation**: Network metrics quantify narrative structure and creativity

### Key Metrics

- **ASPL** (Average Shortest Path Length): Measures how efficiently concepts connect. Lower values suggest tighter narratives.
- **MLCC** (Mean Local Clustering Coefficient): Measures concept clustering. Higher values indicate tightly grouped concept clusters.
- **Modularity**: Measures community structure. Higher values suggest distinct conceptual modules.
- **Connected Components**: Number of separate concept clusters.
- **GCC Size**: Size of the largest connected concept cluster.

## Use Cases

- **Creativity Research**: Analyze how creative stories differ in their conceptual structure
- **Narrative Analysis**: Study how concepts flow through narratives
- **Educational Assessment**: Evaluate student writing for conceptual connectivity
- **Content Analysis**: Compare semantic structure across different text types
- **Computational Linguistics**: Study semantic relationships in discourse

## Examples

See the `examples/` directory for Jupyter notebooks demonstrating:

- Basic concept extraction
- Rating-based analysis
- Multi-rater comparison
- Prompt group analysis
- Visualization techniques

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## Citation

If you use Wosecopypy in your research, please cite:

```bibtex
@software{Wosecopypy,
  title = {Wosecopypy: WOrd SEntence COoccurrence Analysis},
  author = {Owen Saunders, Edith Haim},
  year = {2024},
  url = {https://github.com/owen-saunders/Wosecopypy}
}
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Built with [spaCy](https://spacy.io/) for NLP processing
- [NetworkX](https://networkx.org/) for graph analysis
- Inspired by research on creativity and semantic networks

## Support

- Documentation: [https://Wosecopypy.readthedocs.io](https://Wosecopypy.readthedocs.io)
- Issues: [https://github.com/owen-saunders/Wosecopypy/issues](https://github.com/owen-saunders/Wosecopypy/issues)
- Email: your.email@example.com

## Roadmap

- [ ] Add more language support (Spanish, French)
- [ ] Implement additional network metrics
- [ ] Add interactive visualization with Plotly
- [ ] Support for custom similarity functions
- [ ] Parallel processing for large datasets
- [ ] Web API for remote processing
