Metadata-Version: 2.4
Name: sparkgrep
Version: 0.1.0a0
Summary: Pre-commit hooks for Apache Spark development (Databricks, EMR, Dataproc, and more)
Author-email: Leandro Kellermann de Oliveira <lkellermann@leandroasaservice.com>
License: MIT
Project-URL: Homepage, https://github.com/leandroasaservice/sparkgrep
Project-URL: Repository, https://github.com/leandroasaservice/sparkgrep
Project-URL: Documentation, https://github.com/leandroasaservice/sparkgrep/blob/main/README.md
Project-URL: Bug Reports, https://github.com/leandroasaservice/sparkgrep/issues
Project-URL: Source, https://github.com/leandroasaservice/sparkgrep
Project-URL: Contributing, https://github.com/leandroasaservice/sparkgrep/blob/main/doc/CONTRIBUTING.md
Keywords: spark,databricks,pre-commit,code-quality,linting
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pre-commit>=3.0.0
Requires-Dist: ruff>=0.1.0
Requires-Dist: nbformat>=5.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: coverage>=7.0.0; extra == "dev"
Dynamic: license-file

# SparkGrep 🚀

[![CI Status](https://github.com/leandroasaservice/sparkgrep/workflows/CI%20Pipeline/badge.svg)](https://github.com/leandroasaservice/sparkgrep/actions/workflows/ci.yml)
[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=alert_status)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Security Rating](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=security_rating)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Coverage](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=coverage)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=maintainability_rating)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Reliability Rating](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=reliability_rating)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Vulnerabilities](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=vulnerabilities)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Lines of Code](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=ncloc)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)

[![Python Version](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![Code style: Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Security: Bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

Pre-commit hook that detects debugging leftovers and anti-patterns in Apache Spark applications.

## 🎯 Purpose

SparkGrep helps maintain clean Apache Spark codebases by detecting common debugging leftovers and performance anti-patterns that developers often forget to remove before committing code.

## 🔍 What it Detects

- **`display()` calls** - Jupyter/Databricks debugging function
- **`.show()` methods** - DataFrame inspection calls
- **`.collect()` without assignment** - Potential performance issues
- **`.count()` without assignment** - Unnecessary computations
- **Custom patterns** - User-defined patterns via configuration

## 🚀 Installation

```bash
pip install sparkgrep
```

## 📋 Usage

### As a Pre-commit Hook

Add to your `.pre-commit-config.yaml`:

```yaml
repos:
  - repo: https://github.com/leandroasaservice/sparkgrep
    rev: v1.0.0  # Use the latest version
    hooks:
      - id: sparkgrep
```

### Command Line

```bash
# Check specific files
sparkgrep src/my_script.py notebook.ipynb

# Check with additional patterns
sparkgrep --additional-patterns "debug_print:Debug print statement" src/

# Disable default patterns and use only custom ones
sparkgrep --disable-default-patterns --additional-patterns "my_pattern:My description" src/
```

### Configuration

Create a `.sparkgrep.json` file in your project root:

```json
{
  "additional_patterns": [
    "logger\\.debug\\(.*\\):Debug logging statement",
    "print\\(.*\\):Print statement"
  ],
  "disable_default_patterns": false
}
```

## 🛡️ Security & Quality

This project maintains high security and code quality standards:

### 🔒 Security Measures
- **Daily security scans** with Bandit, Safety, and GitGuardian
- **Automated vulnerability detection** and issue creation
- **Admin-protected CI/CD** pipelines
- **Dependency vulnerability monitoring**

### 📊 Code Quality
- **80% minimum code coverage** enforced in CI
- **SonarCloud integration** for continuous code quality analysis
- **Automated testing** on every PR
- **Code formatting** with Ruff

### 🚦 CI/CD Pipeline

The CI pipeline runs automatically on:
- **Pull requests to main** (requires admin approval)
- **Manual dispatch** (admin-only)

Pipeline includes:
- Comprehensive test suite with 80% coverage requirement
- Security scans (Bandit, Safety, GitGuardian)
- Code quality analysis (SonarCloud)
- Linting and formatting checks

**Quality Gates:**
- ❌ **Pipeline fails** if coverage < 80%
- ❌ **Pipeline fails** if critical vulnerabilities found
- ✅ **Pipeline passes** only when all checks succeed

## 🧪 Development

### Setup

```bash
# Clone the repository
git clone https://github.com/leandroasaservice/sparkgrep.git
cd sparkgrep

# Install in development mode
pip install -e .
pip install -r requirements.txt

# Install pre-commit hooks
pre-commit install
```

### Running Tests

```bash
# Run all tests with coverage
task test

# Run specific test categories
task test:unit
task test:integration

# Generate coverage report
task test:cov
```

### Security Scanning

```bash
# Run security scans locally
bandit -r src/
safety check
ggshield secret scan ci  # Requires GitGuardian API key
```

### Code Quality

```bash
# Format code
ruff format .

# Lint code
ruff check .

# Type checking (if using mypy)
mypy src/
```

## 📁 Project Structure

```
sparkgrep/
├── src/sparkgrep/          # Main package
│   ├── cli.py              # Command-line interface
│   ├── patterns.py         # Pattern definitions
│   ├── file_processors.py  # File processing logic
│   └── utils.py            # Utility functions
├── tests/                  # Test suite
│   ├── unit/               # Unit tests
│   └── integration/        # Integration tests
├── .github/                # GitHub configuration
│   ├── workflows/          # CI/CD pipelines
│   └── ISSUE_TEMPLATE/     # Issue templates
└── docs/                   # Documentation
```

## 🤝 Contributing

1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Make** your changes with tests
4. **Ensure** all checks pass (`task test`, security scans)
5. **Submit** a pull request

### Contribution Guidelines

- **Tests required** for all new features
- **Security scans** must pass
- **Code coverage** must remain ≥ 80%
- **Admin approval** required for all PRs to main
- **Follow** existing code style and patterns

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🔒 Security

For security vulnerabilities, please:

1. **Create a security issue** using our [security template](.github/ISSUE_TEMPLATE/security_report.md)
2. **Contact maintainers** directly for critical issues
3. **Follow responsible disclosure** practices

Our security measures include automated daily scans and continuous monitoring.

## 📞 Support

- **Issues**: [GitHub Issues](https://github.com/leandroasaservice/sparkgrep/issues)
- **Discussions**: [GitHub Discussions](https://github.com/leandroasaservice/sparkgrep/discussions)
- **Documentation**: [Project Docs](doc/)

---

**Made with ❤️ for the Apache Spark community**
