Metadata-Version: 2.4
Name: datasovereign-scout
Version: 0.1.0
Summary: A privacy-focused tool to detect orphaned and sensitive datasets.
Author-email: "Dr. T.M. Usha" <ushawin2020@gmail.com>
Project-URL: Homepage, https://github.com/DrTMUSHACoder/py-datasovereign-scout
Project-URL: Repository, https://github.com/DrTMUSHACoder/py-datasovereign-scout.git
Project-URL: Bug Tracker, https://github.com/DrTMUSHACoder/py-datasovereign-scout/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pyyaml>=6.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: click>=8.0.0
Requires-Dist: rich>=13.0.0

# 🔍 DataSovereign-Scout

<div align="center">

**Developed by: Dr. T.M. Usha**

[LinkedIn Profile](https://www.linkedin.com/in/drtmushacoder/) • [ushawin2020@gmail.com](mailto:ushawin2020@gmail.com) • +91 9994463686

---

![Python](https://img.shields.io/badge/python-3.8+-blue.svg)
![License](https://img.shields.io/badge/license-MIT-green.svg)
![Version](https://img.shields.io/badge/version-0.1.0-orange.svg)
![Status](https://img.shields.io/badge/status-active-success.svg)

**A privacy-focused Python CLI tool to detect orphaned datasets and ghost data.**

[Features](#-features) • [Installation](#-installation) • [Quick Start](#-quick-start) • [Documentation](#-documentation) • [Contributing](#-contributing)

</div>

---

## 📖 Table of Contents

- [What is DataSovereign-Scout?](#-what-is-datasovereign-scout)
- [The Problem: Ghost Data](#-the-problem-ghost-data)
- [Features](#-features)
- [Installation](#-installation)
- [Quick Start](#-quick-start)
- [How It Works](#-how-it-works)
- [Advanced Usage](#-advanced-usage)
- [Examples](#-examples)
- [Architecture](#-architecture)
- [Contributing](#-contributing)
- [License](#-license)
- [Contact](#-contact)

---

## 🎯 What is DataSovereign-Scout?

**DataSovereign-Scout** is a command-line tool that helps organizations and individuals discover "ghost data" - forgotten files, orphaned backups, leaked credentials, and unauthorized datasets hiding in your infrastructure.

Think of it as a **home inspector for your data**. You tell it what files *should* be there (via a manifest), and it alerts you to everything else that shouldn't exist.

### 🌟 Perfect For:

- 🏢 **DevOps Teams** - Finding leaked `.env` files and credentials before deployment
- 🔐 **Security Auditors** - Discovering unauthorized data before compliance reviews
- 📊 **Data Engineers** - Identifying orphaned datasets consuming storage
- 🎓 **Researchers** - Ensuring sensitive research data is properly tracked
- 💼 **Startups** - Preventing data leaks during rapid development

---

## 🚨 The Problem: Ghost Data

### What is Ghost Data?

**Ghost Data** refers to files that exist on your system but are:
- ❌ Not documented in any inventory
- ❌ Not actively used or maintained
- ❌ Potentially sensitive (backups, credentials, PII)
- ❌ Unknown to current team members
- ❌ Compliance/privacy risks

### Real-World Examples:

```
❌ forgotten_backup_2023.csv      (Contains customer data)
❌ .env                            (Database passwords exposed)
❌ aws_production.pem              (Private keys for cloud access)
❌ database_dump_old.sql           (PII from deleted users)
❌ test_credentials.json           (API keys from former developers)
```

**The Risk:** These files can lead to:
- 🔓 Data breaches
- 💰 GDPR/CCPA fines
- 😰 Reputational damage
- 📉 Customer trust loss

---

## ✨ Features

### 🎯 Core Capabilities

- **📋 Manifest-Based Scanning** - Define your "known good" data in a simple YAML file
- **🔍 Smart Detection** - Identifies files by extension AND risky keywords
- **🔐 Security-Focused** - Flags `.pem`, `.key`, `.env`, `.p12`, and other sensitive formats
- **📊 Rich Terminal Output** - Beautiful, color-coded results powered by `rich`
- **⚡ Fast & Lightweight** - Scans thousands of files in seconds
- **🧠 Intelligent Filtering** - Auto-skips `.git`, `venv`, `__pycache__`, etc.

### 🕵️ Detection Rules

**Extension-Based:**
- Data Files: `.csv`, `.json`, `.sql`, `.db`, `.sqlite`, `.parquet`, `.xlsx`
- Secrets: `.pem`, `.key`, `.env`, `.p12`, `.kdbx`
- Logs: `.log`

**Keyword-Based:**
- Filenames containing: `backup`, `dump`, `secret`, `password`, `token`, `credential`

---

## 📦 Installation

### Prerequisites

- Python 3.8 or higher
- pip (Python package manager)

### Install from PyPI

```bash
pip install datasovereign-scout
```

### Install from Source

```bash
git clone https://github.com/DrTMUSHACoder/py-datasovereign-scout.git
cd py-datasovereign-scout
pip install -e .
```

### Verify Installation

```bash
ds-scout --help
```

You should see:
```
Usage: ds-scout [OPTIONS] COMMAND [ARGS]...

  DataSovereign-Scout: Detect Ghost Data and Orphaned Datasets.
```

---

## 🚀 Quick Start

### Step 1: Create a Manifest

Create a file named `manifest.yaml` listing your approved data files:

```yaml
assets:
  - data/users.csv
  - reports/quarterly_2024.xlsx
  - configs/app_settings.json
```

### Step 2: Run a Scan

```bash
ds-scout scan --manifest manifest.yaml --path ./
```

### Step 3: Review Results

**Expected Output:**

```
Scouting for Ghost Data...
Manifest: manifest.yaml
Scan Path: ./

╔═══════════════════════════════════════════════════════════╗
║ CRITICAL: Found 3 orphaned datasets!                      ║
╚═══════════════════════════════════════════════════════════╝

Orphaned Assets (Ghost Data)
╔═══════════════════════════════════════════════════════════╗
║ Path                                                      ║
╠═══════════════════════════════════════════════════════════╣
║ /project/data/forgotten_backup_2023.csv                   ║
║ /project/.env                                             ║
║ /project/secrets/aws_production.pem                       ║
╚═══════════════════════════════════════════════════════════╝

Total Scanned Files: 15
```

---

## 🔧 How It Works

### The Workflow

```mermaid
graph LR
    A[Create manifest.yaml] --> B[Run ds-scout scan]
    B --> C[Scanner reads manifest]
    C --> D[Walks filesystem]
    D --> E{File matches<br/>extensions or<br/>keywords?}
    E -->|Yes| F{File in<br/>manifest?}
    E -->|No| G[Ignore]
    F -->|Yes| H[Mark as Safe]
    F -->|No| I[FLAG as Ghost Data]
    I --> J[Report to User]
```

### Detection Logic

```python
# Pseudo-code
for file in filesystem:
    if file.extension in SENSITIVE_EXTENSIONS:
        if file not in manifest:
            FLAG_AS_GHOST_DATA(file)
    
    if any(keyword in file.name for keyword in RISKY_KEYWORDS):
        if file not in manifest:
            FLAG_AS_GHOST_DATA(file)
```

---

## 🎓 Advanced Usage

### Scan a Specific Directory

```bash
ds-scout scan --manifest manifest.yaml --path ./production_data
```

### Include Hidden Files

By default, hidden directories (`.git`, `.venv`) are skipped. To include them, modify the scanner source code or open an issue for this feature.

### Custom Extensions (Future Feature)

```bash
# Coming soon
ds-scout scan --manifest manifest.yaml --path ./ --extensions .dat,.bin
```

---

## 📚 Examples

### Example 1: DevOps Pre-Deployment Check

**Scenario:** You're about to deploy a Docker container. Check for leaked credentials.

```bash
# Create manifest
cat > manifest.yaml << EOF
assets:
  - src/config.json
  - data/sample.csv
EOF

# Scan
ds-scout scan --manifest manifest.yaml --path ./

# Result: Finds forgotten .env file with database password
```

---

### Example 2: GDPR Compliance Audit

**Scenario:** Verify no unauthorized customer data exists before a compliance review.

```yaml
# manifest.yaml
assets:
  - database/customers_2024.db
  - exports/approved_report.xlsx
```

```bash
ds-scout scan --manifest manifest.yaml --path ./

# Finds: old_customer_backup.csv (not in manifest)
# Action: Securely delete or add to consent records
```

---

### Example 3: Research Data Management

**Scenario:** Ensure all research datasets are properly documented.

```yaml
# manifest.yaml
assets:
  - experiments/trial_001.csv
  - experiments/trial_002.csv
```

```bash
ds-scout scan --manifest manifest.yaml --path ./experiments

# Finds: pilot_study_draft.csv (orphaned from previous researcher)
```

---

## 🏗️ Architecture

### Project Structure

```
datasovereign-scout/
├── src/
│   └── datasovereign_scout/
│       ├── __init__.py        # Package initializer
│       ├── cli.py             # Command-line interface
│       ├── manager.py         # Orchestrates scan logic
│       ├── manifest.py        # Manifest parser
│       ├── scanners.py        # Filesystem scanner
│       └── detector.py        # Detection rules
├── tests/                     # Unit tests
├── pyproject.toml             # Package configuration
├── README.md                  # This file
└── LICENSE                    # MIT License
```

### Key Components

| Component | Responsibility |
|-----------|----------------|
| `cli.py` | Argument parsing, user interaction |
| `manager.py` | Coordinates manifest + scanner |
| `scanners.py` | Walks filesystem, applies rules |
| `manifest.py` | Parses YAML manifest |
| `detector.py` | Detection algorithms |

---

## 🤝 Contributing

We welcome contributions! Here's how you can help:

### Ways to Contribute

- 🐛 Report bugs via [GitHub Issues](https://github.com/DrTMUSHACoder/py-datasovereign-scout/issues)
- 💡 Suggest features
- 📝 Improve documentation
- 🧪 Write tests
- 🔧 Submit pull requests

### Development Setup

```bash
# Clone the repo
git clone https://github.com/DrTMUSHACoder/py-datasovereign-scout.git
cd py-datasovereign-scout

# Install in editable mode
pip install -e .

# Run tests (when added)
pytest tests/
```

### Coding Standards

- Follow PEP 8
- Add type hints
- Write docstrings
- Include tests for new features

---

## 📄 License

This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.

```
MIT License

Copyright (c) 2026 Dr. T.M. Usha

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
```

---

## 📞 Questions or Feedback?

Feel free to reach out via the contact information at the top of this README or open an issue on GitHub!

---

## 🌟 Acknowledgments

- Built with Python 3.8+
- Powered by [Rich](https://github.com/Textualize/rich) for terminal output
- Inspired by real-world data governance challenges

---

## 🔗 Related Projects

- [detect-secrets](https://github.com/Yelp/detect-secrets) - Pre-commit hook for secret detection
- [truffleHog](https://github.com/trufflesecurity/truffleHog) - Searches for secrets in Git history
- [GitGuardian](https://www.gitguardian.com/) - Automated secret scanning

---

## 📊 Project Stats

![GitHub stars](https://img.shields.io/github/stars/DrTMUSHACoder/py-datasovereign-scout?style=social)
![GitHub forks](https://img.shields.io/github/forks/DrTMUSHACoder/py-datasovereign-scout?style=social)
![GitHub issues](https://img.shields.io/github/issues/DrTMUSHACoder/py-datasovereign-scout)

---

<div align="center">

**Made with ❤️ for data privacy and security**

If this tool helped you, please ⭐ star the repository!

</div>
