Metadata-Version: 2.4
Name: extracta
Version: 0.2.1
Summary: Modular content analysis platform for research, assessment, and academic integrity checking
Author-email: Michael Borck <michael@borck.dev>
License: MIT
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: click>=8.1.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: pydantic>=2.5.0
Provides-Extra: all
Requires-Dist: ast>=3.8; extra == 'all'
Requires-Dist: fastapi>=0.104.0; extra == 'all'
Requires-Dist: faster-whisper>=1.0.0; extra == 'all'
Requires-Dist: ffmpeg-python>=0.2.0; extra == 'all'
Requires-Dist: google-generativeai>=0.3.0; extra == 'all'
Requires-Dist: librosa>=0.10.0; extra == 'all'
Requires-Dist: nltk>=3.8.0; extra == 'all'
Requires-Dist: opencv-python>=4.8.0; extra == 'all'
Requires-Dist: pdfplumber>=0.10.0; extra == 'all'
Requires-Dist: pillow>=10.0.0; extra == 'all'
Requires-Dist: pydantic>=2.5.0; extra == 'all'
Requires-Dist: pymupdf>=1.23.0; extra == 'all'
Requires-Dist: pypdf2>=3.0.0; extra == 'all'
Requires-Dist: pytesseract>=0.3.0; extra == 'all'
Requires-Dist: python-docx>=1.1.0; extra == 'all'
Requires-Dist: python-pptx>=0.6.0; extra == 'all'
Requires-Dist: radon>=6.0.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Requires-Dist: spacy>=3.7.0; extra == 'all'
Requires-Dist: textstat>=0.7.0; extra == 'all'
Requires-Dist: torch>=2.0.0; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.24.0; extra == 'all'
Provides-Extra: api
Requires-Dist: fastapi>=0.104.0; extra == 'api'
Requires-Dist: uvicorn[standard]>=0.24.0; extra == 'api'
Provides-Extra: audio
Requires-Dist: faster-whisper>=1.0.0; extra == 'audio'
Requires-Dist: librosa>=0.10.0; extra == 'audio'
Provides-Extra: code
Requires-Dist: ast>=3.8; extra == 'code'
Requires-Dist: radon>=6.0.0; extra == 'code'
Requires-Dist: ruff>=0.1.0; extra == 'code'
Provides-Extra: conversation
Requires-Dist: google-generativeai>=0.3.0; extra == 'conversation'
Provides-Extra: dev
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Provides-Extra: documents
Requires-Dist: pdfplumber>=0.10.0; extra == 'documents'
Requires-Dist: pypdf2>=3.0.0; extra == 'documents'
Requires-Dist: python-docx>=1.1.0; extra == 'documents'
Provides-Extra: enhanced
Requires-Dist: nltk>=3.8.0; extra == 'enhanced'
Requires-Dist: pdfplumber>=0.10.0; extra == 'enhanced'
Requires-Dist: pymupdf>=1.23.0; extra == 'enhanced'
Requires-Dist: pypdf2>=3.0.0; extra == 'enhanced'
Requires-Dist: python-docx>=1.1.0; extra == 'enhanced'
Requires-Dist: python-pptx>=0.6.0; extra == 'enhanced'
Requires-Dist: spacy>=3.7.0; extra == 'enhanced'
Requires-Dist: textstat>=0.7.0; extra == 'enhanced'
Provides-Extra: image
Requires-Dist: pillow>=10.0.0; extra == 'image'
Requires-Dist: pytesseract>=0.3.0; extra == 'image'
Requires-Dist: torch>=2.0.0; extra == 'image'
Provides-Extra: presentations
Requires-Dist: pymupdf>=1.23.0; extra == 'presentations'
Requires-Dist: python-pptx>=0.6.0; extra == 'presentations'
Provides-Extra: text
Requires-Dist: nltk>=3.8.0; extra == 'text'
Requires-Dist: spacy>=3.7.0; extra == 'text'
Requires-Dist: textstat>=0.7.0; extra == 'text'
Provides-Extra: video
Requires-Dist: ffmpeg-python>=0.2.0; extra == 'video'
Requires-Dist: opencv-python>=4.8.0; extra == 'video'
Requires-Dist: pydantic>=2.5.0; extra == 'video'
Description-Content-Type: text/markdown

# Extracta

**Modular Content Analysis Platform** for research, assessment, and academic integrity checking.

Extracta provides a unified interface for extracting and analyzing content from diverse media types including documents, images, repositories, and web content. It supports both research-focused deep analysis and assessment-oriented quality evaluation, with specialized tools for academic integrity validation.

## ✨ Key Features

- **🧩 Modular Architecture**: Pluggable lenses and analyzers for different content types
- **📚 Academic Integrity**: Citation-reference validation, bibliography checking, URL verification, AI conversation analysis
- **🤖 AI Conversation Analysis**: Cognitive intent classification for AI-assisted learning assessment
- **🔍 Multiple Analysis Modes**: Research and assessment workflows
- **📄 Rich Content Support**: Text, images, documents, repositories, presentations, spreadsheets, AI conversations
- **🎯 Rubric-Based Assessment**: Custom rubrics for structured evaluation
- **🧠 Intelligent Analysis**: Pattern detection, quality scoring, integrity validation, learning pattern recognition
- **💻 Multiple Interfaces**: CLI, Python API, and Web API
- **🔧 Modern Python**: Built with uv, ruff, mypy, and pytest

## Installation

### From PyPI

```bash
pip install extracta
```

### From Source

```bash
git clone https://github.com/michaelborck-education/extracta.git
cd extracta
pip install -e .
```

### Optional Dependencies

Install with specific feature support:

```bash
pip install extracta[audio]     # Audio processing (faster-whisper for Apple Silicon)
pip install extracta[video]     # Video processing
pip install extracta[text]      # Enhanced text analysis (spaCy, NLTK)
pip install extracta[image]     # Image analysis with OCR
pip install extracta[code]      # Code analysis
pip install extracta[citation]  # Academic integrity (CrossRef, URL validation)
pip install extracta[api]       # Web API server (FastAPI, Uvicorn)
pip install extracta[all]       # All features
```

## Usage

### Command Line

#### Basic Content Analysis
```bash
# Analyze document for research insights
extracta analyze research_paper.pdf --mode research --output analysis.json

# Assess student submission quality
extracta analyze essay.docx --mode assessment --output feedback.json

# Analyze repository structure and content
extracta analyze https://github.com/user/repo --mode assessment
```

#### Academic Integrity Checking
```bash
# Comprehensive citation and reference validation
extracta citation analyze student_paper.pdf --output integrity_check.json

# AI conversation cognitive intent analysis
extracta citation conversation chatgpt_export.json --output learning_analysis.json

# Results include:
# - Citation-reference relationship validation
# - Bibliography padding detection
# - URL accessibility and domain reputation
# - AI conversation learning pattern analysis
# - Academic integrity scoring
```

### Python API

#### Basic Content Analysis
```python
from extracta import TextAnalyzer

analyzer = TextAnalyzer()
result = analyzer.analyze(text_content, mode="research")
print(result)
```

#### Academic Integrity Analysis
```python
from extracta.analyzers import CitationAnalyzer, ReferenceAnalyzer, URLAnalyzer, ConversationAnalyzer

# Citation-reference validation
citation_analyzer = CitationAnalyzer()
citation_result = citation_analyzer.analyze(document_text)

# Bibliography quality assessment
reference_analyzer = ReferenceAnalyzer()
reference_result = reference_analyzer.analyze(document_text)

# URL validation and reputation checking
url_analyzer = URLAnalyzer()
url_result = url_analyzer.analyze(document_text)

# AI conversation cognitive intent analysis
conversation_analyzer = ConversationAnalyzer()
conversation_result = conversation_analyzer.analyze(conversation_json_data)

# Combined integrity score
integrity_score = citation_result['citation_analysis']['academic_integrity_score']
learning_quality = conversation_result['conversation_analysis']['learning_assessment']['learning_quality_score']
print(f"Academic Integrity Score: {integrity_score}/100")
print(f"AI Learning Quality Score: {learning_quality}/100")
```

### Grading and Assessment

```python
from extracta.grading.rubric_manager import RubricRepository, get_default_rubric
from extracta.grading.feedback_generator import FeedbackGenerator

# Load or create a rubric
repo = RubricRepository("rubrics")
rubric = get_default_rubric("academic")  # or repo.load("my-rubric")

# Generate feedback based on analysis results
generator = FeedbackGenerator()
feedback = generator.generate_feedback(
    rubric=rubric,
    analysis_data=analysis_result,
    audience="student",
    detail="detailed"
)
```

## 🎓 Academic Integrity Features

Extracta provides comprehensive tools for detecting academic integrity issues and validating scholarly work:

### Citation Analysis
- **Citation-Reference Validation**: Ensures all references have corresponding in-text citations
- **Bibliography Padding Detection**: Identifies references without citations
- **Citation Stuffing Detection**: Flags excessive citations in single sentences
- **Style Recognition**: Supports APA, MLA, Chicago, Harvard, and Numeric styles

### Reference Validation
- **DOI Verification**: Validates Digital Object Identifiers with CrossRef API
- **URL Accessibility**: Checks if referenced URLs are accessible (404 detection)
- **Domain Reputation**: Analyzes source credibility (academic vs. commercial domains)
- **Format Validation**: Ensures proper reference formatting and completeness

### AI Conversation Analysis
- **Cognitive Intent Classification**: Uses Gemini LLM to classify user prompts as Delegation vs. Scaffolding
- **Learning Pattern Recognition**: Analyzes conversation flow for active learning behaviors
- **Session Quality Scoring**: Provides learning quality assessment (0-100)
- **Platform Support**: ChatGPT, Claude, Bard, and generic conversation formats

### Repository Analysis
- **WordPress Detection**: Identifies WordPress projects and analyzes themes/plugins
- **Code Quality Assessment**: Evaluates repository structure and practices
- **File Type Analysis**: Comprehensive analysis of all repository contents

### Integrity Scoring
- **Academic Integrity Score**: 0-100 scale based on multiple validation criteria
- **Detailed Reporting**: Specific issues and recommendations
- **Pattern Detection**: Identifies suspicious citation and reference patterns

## Development

### Setup

```bash
# Clone repository
git clone https://github.com/michaelborck-education/extracta.git
cd extracta

# Create virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -e ".[dev]"
```

### Testing

```bash
# Run tests
pytest

# With coverage
pytest --cov=extracta
```

### Linting and Type Checking

```bash
# Lint with ruff
ruff check .

# Type check with mypy
mypy extracta

# Format code
ruff format .
```

### Building and Publishing

```bash
# Build package
uv build

# Publish to PyPI
uv venv  # if not already
source .venv/bin/activate
uv pip install twine
twine upload dist/* --repository pypi
```

## Project Structure

```
extracta/
├── extracta/
│   ├── lenses/              # Content extraction modules
│   │   ├── audio_lens/      # Audio file processing
│   │   ├── video_lens/      # Video file processing
│   │   ├── image_lens/      # Image processing with OCR
│   │   ├── document_lens/   # Text & Office document processing
│   │   ├── presentation_lens/ # Presentation file analysis
│   │   ├── repo_lens/       # Repository-level analysis
│   │   └── base_lens.py     # Common lens interface
│   ├── analyzers/           # Content analysis modules
│   │   ├── text_analyzer/   # Text quality and readability
│   │   ├── image_analyzer/  # Image quality assessment
│   │   ├── citation_analyzer/ # Citation-reference validation
│   │   ├── reference_analyzer/ # Bibliography quality assessment
│   │   ├── url_analyzer/    # URL validation and reputation
│   │   └── base_analyzer.py # Common analyzer interface
│   ├── grading/             # Assessment and grading
│   │   ├── rubric_manager/  # Rubric creation and management
│   │   └── feedback_generator.py # AI-powered feedback
│   ├── orchestration/       # Workflow management
│   ├── shared/              # Common utilities
│   └── cli/                 # Command-line interface
├── tests/                   # Test suite
├── docs/                    # Documentation
├── examples/                # Usage examples
├── pyproject.toml           # Package configuration
└── README.md               # This file
```

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Run the test suite
6. Submit a pull request

## License

MIT License - see [LICENSE](LICENSE) file for details.

## 🚀 Current Status & Roadmap

### ✅ Implemented Features
- [x] **Text Analysis**: Readability, sentiment, vocabulary, quality metrics
- [x] **Image Analysis**: OCR, quality assessment, accessibility
- [x] **Document Processing**: PDF, DOCX, Office docs (PPTX, Excel, CSV)
- [x] **Citation Validation**: Citation-reference relationships, academic integrity
- [x] **Reference Analysis**: Bibliography quality, DOI validation, CrossRef integration
- [x] **URL Validation**: Accessibility checking, domain reputation, robots.txt
- [x] **AI Conversation Analysis**: Cognitive intent classification, learning pattern recognition
- [x] **Repository Analysis**: GitHub repo analysis, WordPress detection
- [x] **Rubric System**: Custom rubrics, structured assessment
- [x] **CLI Interface**: Multiple commands for different analysis types
- [x] **Web API**: REST API for integration
- [x] **Python API**: Programmatic access

### 🔄 In Development
- [ ] **Audio Lens**: Speech-to-text, audio quality analysis
- [ ] **Video Lens**: Frame analysis, transcript processing
- [ ] **Code Analyzer**: Code quality metrics, best practices
- [ ] **Screenshot Integration**: Visual URL validation
- [ ] **Wayback Machine**: Archive URL checking

### 📋 Future Enhancements
- [ ] **URL Conversation Input**: Direct analysis of conversations from URLs (ChatGPT share links, etc.)
- [ ] **GUI Application**: Web-based interface
- [ ] **LMS Integration**: Canvas, Blackboard, Moodle
- [ ] **Advanced ML Models**: Fine-tuned for educational content
- [ ] **Collaborative Features**: Multi-user assessment workflows
- [ ] **Plugin Architecture**: Custom lenses and analyzers