Metadata-Version: 2.4
Name: LogPress
Version: 1.0.7
Summary: Semantic-aware log compression with automatic schema extraction and queryable storage
Home-page: https://github.com/adam-bouafia/logpress
Author: Adam Bouafia
Author-email: Adam Bouafia <adam.bouafia@vu.nl>
License: MIT
Project-URL: Homepage, https://github.com/adam-bouafia/logpress
Project-URL: Documentation, https://github.com/adam-bouafia/logpress#readme
Project-URL: Repository, https://github.com/adam-bouafia/logpress
Project-URL: Issues, https://github.com/adam-bouafia/logpress/issues
Project-URL: Changelog, https://github.com/adam-bouafia/logpress/releases
Keywords: logs,compression,schema-extraction,semantic-compression,log-analysis,columnar-storage
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: System Administrators
Classifier: Topic :: System :: Logging
Classifier: Topic :: System :: Archiving :: Compression
Classifier: Topic :: System :: Monitoring
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: msgpack>=1.0.0
Requires-Dist: zstandard>=0.21.0
Requires-Dist: python-dateutil>=2.8.0
Requires-Dist: regex>=2023.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: click>=8.1.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.12.0; extra == "dev"
Provides-Extra: benchmarks
Requires-Dist: logreduce>=1.0.0; extra == "benchmarks"
Provides-Extra: all
Requires-Dist: pytest>=7.4.0; extra == "all"
Requires-Dist: pytest-cov>=4.1.0; extra == "all"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "all"
Requires-Dist: pytest-mock>=3.12.0; extra == "all"
Requires-Dist: logreduce>=1.0.0; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# LogPress - Semantic Log Compression System

[![Tests](https://img.shields.io/badge/tests-passing-brightgreen.svg)](logpress/tests/)
[![Coverage](https://img.shields.io/badge/coverage-42%25-yellow.svg)](htmlcov/index.html)
[![Python](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

**Master's Thesis Research Project**: Automatic schema extraction from unstructured system logs using constraint-based parsing and semantic-aware compression.

## 🎯 Research Goals

- **Automatic Schema Discovery**: Extract implicit log schemas without manual annotation
- **Semantic-Aware Compression**: Achieve 8-30× compression while maintaining queryability
- **Real-World Validation**: Tested on diverse log sources (2M+ entries)

## 🚀 Quick Start

### Installation

Preferred: Install from PyPI

```bash
# Install from PyPI (recommended)
pip install LogPress
```

Alternative: Docker (no Python setup required)

```bash
# Interactive mode
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive
```

From source (developer mode)

```bash
# Clone repository
git clone https://github.com/adam-bouafia/LogPress.git
cd LogPress

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\\Scripts\\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .
```

### Interactive Mode (Recommended)

```bash
# Beautiful terminal UI with dataset auto-discovery
python -m logpress.cli.interactive
```

**Features**:
- 🔍 Auto-discovers datasets in `data/datasets/`
- 📊 Real-time compression progress
- 🎨 Rich terminal UI with tables and progress bars
- ⚡ Query compressed logs interactively

### Command-Line Usage

```bash
# Compress logs
python -m logpress compress \
  -i data/datasets/Apache/Apache_full.log \
  -o evaluation/compressed/apache.lsc \
  --min-support 3 \
  -m

# Query compressed logs
python -m logpress query \
  -c evaluation/compressed/apache.lsc \
  --severity ERROR \
  --limit 20

# Run full evaluation
python evaluation/run_full_evaluation.py
```

### Docker Usage

```bash
# Interactive mode (Python rich UI)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive

# Bash menu (alternative)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive-bash

# Run specific command
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc -m
```


### Pre-built Docker Image (GHCR & Docker Hub)

We publish pre-built Docker images to the GitHub Container Registry (GHCR) and mirror to Docker Hub. There are two ways to run LogPress with Docker:

1) From a local clone (recommended for development):

```bash
# Clone repository and run with docker-compose
git clone https://github.com/adam-bouafia/LogPress.git
cd LogPress
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive
```

2) Use pre-built images from GHCR or Docker Hub (recommended for quick start):

```bash
# Pull the image from GHCR
docker pull ghcr.io/adam-bouafia/logpress:latest

# Or pull from Docker Hub mirror
docker pull adambouafia/logpress:latest

# Run the CLI (example: show version)
docker run --rm ghcr.io/adam-bouafia/logpress:latest python -m logpress --version

# Run a compress command using the GHCR image
docker run --rm \
  -v "$(pwd)/data:/app/data:ro" \
  -v "$(pwd)/evaluation:/app/evaluation:rw" \
  ghcr.io/adam-bouafia/logpress:latest \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc -m
```

If you prefer Docker Hub, images are mirrored to Docker Hub as `adambouafia/logpress:latest` and to specific version tags such as `adambouafia/logpress:1.0.1`.

```bash
# Pull the image from GHCR
docker pull ghcr.io/adam-bouafia/logpress:latest

# Run the CLI (example: show version)
docker run --rm ghcr.io/adam-bouafia/logpress:latest python -m logpress --version

# Run a compress command using the GHCR image
docker run --rm \
  -v "$(pwd)/data:/app/data:ro" \
  -v "$(pwd)/evaluation:/app/evaluation:rw" \
  ghcr.io/adam-bouafia/logpress:latest \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc -m
```

If you prefer Docker Hub, you or the CI workflow can mirror the image to Docker Hub with the `adambouafia/logpress:latest` tag. For example:

```bash
# (Optional) Tag and push to Docker Hub (requires Docker Hub credentials)
docker tag ghcr.io/adam-bouafia/logpress:latest adambouafia/logpress:latest
docker login --username <docker-hub-username>
docker push adambouafia/logpress:latest
```

## 📁 Project Structure (MCP Architecture)

```
LogPress/
├── logpress/                  # Core Python package (Model-Context-Protocol)
│   ├── models/             # Data structures (Token, LogTemplate, CompressedLog)
│   ├── protocols/          # Abstract interfaces (EncoderProtocol, CompressorProtocol)
│   ├── context/           # Business logpress
│   │   ├── tokenization/  # Smart log tokenization (FSM-based)
│   │   ├── extraction/    # Template generation (log alignment algorithm)
│   │   ├── classification/# Semantic type recognition (pattern-based)
│   │   └── encoding/      # Compression codecs (delta, dictionary, varint)
│   ├── services/          # High-level orchestration
│   │   ├── compressor.py  # 6-stage compression pipeline
│   │   ├── query_engine.py# Queryable decompression
│   │   └── evaluator.py   # Accuracy metrics vs ground truth
│   ├── cli/              # User interfaces
│   │   ├── interactive.py # Rich terminal UI
│   │   └── commands.py    # Click-based CLI
│   └── tests/            # Test suite (25 tests, 100% passing)
│       ├── unit/         # Component testing
│       ├── integration/  # Workflow testing
│       ├── e2e/          # End-to-end testing
│       └── performance/  # Benchmarks
│
├── data/                  # Input data
│   ├── datasets/         # 8 real-world log sources (~1.07M entries)
│   │   ├── Apache/       # Web server logs (52K lines)
│   │   ├── HealthApp/    # Android health tracking (212K lines)
│   │   ├── HPC/          # High-performance computing cluster logs (433K lines)
│   │   ├── Linux/        # Linux system logs (26K lines)
│   │   ├── Mac/          # macOS system logs (117K lines)
│   │   ├── OpenStack/    # Cloud infrastructure logs (137K lines)
│   │   ├── Proxifier/    # Network proxy logs (21K lines)
│   │   └── Zookeeper/    # Distributed coordination logs (74K lines)
│   └── ground_truth/     # Manual annotations for validation
│
├── evaluation/           # Outputs & results
│   ├── compressed/       # .lsc compressed files
│   ├── results/          # Evaluation metrics (JSON/Markdown)
│   └── schema_versions/  # Schema evolution tracking
│
├── deployment/          # Infrastructure
│   ├── Dockerfile       # Container image
│   ├── docker-compose.yml# Service orchestration
│   └── Makefile         # Build automation
│
├── documentation/       # Project documentation
│   ├── README.md        # Documentation index
│   ├── TESTING.md       # Test strategy
│   ├── MCP_ARCHITECTURE.md # System design
│   └── API.md           # Python API reference
│
└── scripts/            # Automation scripts
    ├── logpress-interactive.sh  # Bash interactive menu
    ├── run-tests.sh           # Test suite runner
    └── run-pre-production-tests.sh # Validation
```

See individual README files in each directory for detailed information.

## 🔬 Research Methodology

### 1. Schema Extraction Pipeline

**6-Stage Process**:
1. **Tokenization**: FSM-based parser handles diverse log formats
2. **Semantic Classification**: Pattern-based field type detection (timestamp, IP, severity, etc.)
3. **Field Grouping**: Identify related fields (ip+port, user+action)
4. **Template Generation**: Log alignment algorithm extracts schemas
5. **Schema Versioning**: Track format evolution over time
6. **Validation**: Compare against manual ground truth (precision/recall)

**Example**:
```
Raw Logs:
  [Thu Jun 09 06:07:04 2005] [notice] LDAP: Built with OpenLDAP
  [Thu Jun 09 06:07:05 2005] [notice] LDAP: SSL support unavailable
  
Extracted Template:
  [TIMESTAMP] [SEVERITY] LDAP: [MESSAGE]
```

### 2. Semantic-Aware Compression

**Category-Specific Codecs**:
- **Timestamps**: Delta encoding (8-10× compression)
- **Severity/Status**: Dictionary encoding (5-7× compression)
- **Metrics**: Gorilla time-series compression (3-5× compression)
- **Messages**: Token pool with references (variable)
- **Stack traces**: Reference tracking (store once, reuse pointer)

**Queryable Index**: Columnar storage enables filtering without full decompression.

### 3. Evaluation Metrics

**Accuracy** (vs manual annotations):
- Precision: % of extracted fields that are correct
- Recall: % of actual fields that were found
- F1-Score: Harmonic mean
- **Target**: >90% accuracy

**Compression Performance**:
- Compression ratio vs gzip baseline
- Query latency overhead
- **Target**: >10× compression, <2× query slowdown

## 🧪 Testing

### Run Complete Test Suite

```bash
# All tests with coverage
bash scripts/run-tests.sh

# View coverage report
firefox htmlcov/index.html
```

### Pre-Production Validation

```bash
# Validate before deployment
bash scripts/run-pre-production-tests.sh
```

**Test Status**: ✅ 25/25 tests passing (100%)
- Unit tests: 9 tests
- Integration tests: 8 tests
- E2E tests: 3 tests
- Performance benchmarks: 5 tests

### Performance Benchmarks

```bash
# Run benchmarks
python -m pytest logpress/tests/performance/ --benchmark-only

# Expected results:
# - Compression: >500 ops/sec
# - Template extraction: >900 ops/sec
# - Linear scalability: 100 → 10,000 logs
```

## 📚 Documentation

- [Documentation Index](documentation/README.md) - Complete documentation overview
- [Testing Guide](documentation/TESTING.md) - Test strategy and commands
- [MCP Architecture](documentation/MCP_ARCHITECTURE.md) - System design details
- [API Reference](documentation/API.md) - Python API usage
- [Docker Guide](deployment/README.md) - Container deployment

## 🎓 Research Context

**Master's Thesis**: Automatic Schema Extraction from Unstructured System Logs  
**Duration**: 26 weeks (4 phases)  
**Target Venues**: VLDB, SIGMOD, IEEE BigData  
**Novel Contribution**: Semantic-aware compression adapting to log content types

### Related Work
- **Log Parsing**: Drain, Spell, LogPai
- **Schema Inference**: Lakehouse formats (Parquet, ORC)
- **Compression**: Generic (gzip, zstd) vs specialized (LogShrink)

### Key Differentiators
- ✅ No ML models (constraint-based approach)
- ✅ Semantic awareness (field-type-specific compression)
- ✅ Query preservation (columnar indexes)
- ✅ Schema evolution tracking
- ✅ Lossless compression (exact reconstruction)

## 🛠️ Development

### Setup Development Environment

```bash
# Install test dependencies
pip install pytest pytest-cov pytest-benchmark pytest-mock

# Run tests on file changes (watch mode)
pip install pytest-watch
ptw logpress/tests/ -- -v
```

### Contribution Workflow

1. Create feature branch: `git checkout -b feature/new-encoder`
2. Make changes and add tests
3. Run validation: `bash scripts/run-pre-production-tests.sh`
4. Submit PR (GitHub Actions runs full test suite)

### Adding New Semantic Type Patterns

```python
# logpress/context/classification/semantic_types.py

def recognize_custom_field(token: str) -> Tuple[str, float]:
    """
    Add pattern for new field type.
    
    Returns:
        (field_type, confidence_score)
    """
    if re.match(r'^[A-Z]{3}-\d{4}$', token):
        return ('ERROR_CODE', 0.95)  # High confidence
    return ('UNKNOWN', 0.0)
```

### Adding New Compression Codecs

```python
# logpress/context/encoding/custom_encoder.py

from logpress.protocols import EncoderProtocol

class CustomEncoder(EncoderProtocol):
    def encode(self, values: List[Any]) -> bytes:
        # Your encoding logpress
        pass
    
    def decode(self, data: bytes) -> List[Any]:
        # Your decoding logpress
        pass
```

## 📦 Dependencies

### Core Libraries
```
msgpack>=1.0.0          # Serialization
zstandard>=0.21.0       # Compression baseline
python-dateutil>=2.8.0  # Timestamp parsing
regex>=2023.0.0         # Advanced pattern matching
rich>=13.0.0            # Terminal UI
click>=8.1.0            # CLI framework
```

### Testing
```
pytest>=7.4.0           # Test framework
pytest-cov>=4.1.0       # Coverage reporting
pytest-benchmark>=4.0.0 # Performance testing
pytest-mock>=3.12.0     # Mocking utilities
```

### Optional Tools
```bash
# Baseline comparison
gzip --version

# Command-line benchmarking
cargo install hyperfine

# Memory profiling
pip install memory-profiler
```

## 🐳 Docker Deployment

### Build & Run

```bash
# Build all services
docker-compose -f deployment/docker-compose.yml build

# Run interactive CLI
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive

# Run compression
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc
```

### Environment Variables

```bash
# Set in docker-compose.yml
PYTHONUNBUFFERED=1      # Real-time output
TERM=xterm-256color     # Colored terminal
MIN_SUPPORT=3           # Template extraction threshold
ZSTD_LEVEL=15           # Compression level (1-22)
```

## 🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

### Areas for Contribution
- [ ] Additional semantic type patterns
- [ ] New compression codecs
- [ ] Query optimization
- [ ] Schema visualization
- [ ] Performance improvements

## 📄 License

MIT License - see [LICENSE](LICENSE) file for details.


## 🔗 Links

- [Project Documentation](documentation/README.md)
- [Test Results](evaluation/results/)
- [Research Roadmap](PROJECT.md)
- [GitHub Repository](https://github.com/adam-bouafia/logpress)

## 📞 Contact

- **Author**: Adam Bouafia
- **Repository**: https://github.com/adam-bouafia/logpress
- **Linkedin**: https://www.linkedin.com/in/adam-bouafia 

---

**Status**: ✅ Production Ready | 🧪 All Tests Passing (25/25) | 📊 Coverage: 42%

Built with ❤️ for research in log analysis and semantic compression.
