Metadata-Version: 2.4
Name: tablesleuth
Version: 0.5.0
Summary: TableSleuth - a Textual TUI for Open Table Format forensics (Iceberg, Delta Lake) with data profiling.
Project-URL: Homepage, https://tablesleuth.com
Project-URL: Repository, https://github.com/jamesbconner/TableSleuth
Project-URL: Documentation, https://github.com/jamesbconner/TableSleuth/tree/main/docs
Project-URL: Bug Reports, https://github.com/jamesbconner/TableSleuth/issues
Project-URL: Changelog, https://github.com/jamesbconner/TableSleuth/blob/main/CHANGELOG.md
Author-email: James Conner <jamesbconner@gmail.com>
License: Apache-2.0
License-File: LICENSE
Keywords: analytics,aws,cli,data-engineering,data-lake,data-profiling,data-quality,delta-lake,duckdb,iceberg,lakehouse,metadata,parquet,pyarrow,s3,s3tables,schema,terminal,textual,tui
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Database
Classifier: Topic :: File Formats
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Benchmark
Classifier: Topic :: System :: Distributed Computing
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: <3.15,>=3.13
Requires-Dist: adbc-driver-flightsql>=1.7.0
Requires-Dist: boto3>=1.35.0
Requires-Dist: click>=8.1.0
Requires-Dist: deltalake>=0.22.0
Requires-Dist: duckdb>=1.1.0
Requires-Dist: pandas>=2.3.0
Requires-Dist: pip>=24.0
Requires-Dist: pyarrow>=22.0.0
Requires-Dist: pydantic>=2.11.0
Requires-Dist: pyiceberg[s3fs,sql-sqlite]>=0.9.1
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: textual>=0.86.2
Requires-Dist: uv>=0.5.0
Provides-Extra: dev
Requires-Dist: bandit[toml]<2.0.0,>=1.8.6; extra == 'dev'
Requires-Dist: build<2.0.0,>=1.0.0; extra == 'dev'
Requires-Dist: hypothesis<7.0.0,>=6.100.0; extra == 'dev'
Requires-Dist: mypy<2.0.0,>=1.18.2; extra == 'dev'
Requires-Dist: pre-commit<5.0.0,>=4.4.0; extra == 'dev'
Requires-Dist: pytest-asyncio<1.0.0,>=0.26.0; extra == 'dev'
Requires-Dist: pytest-cov<7.0.0,>=6.0.0; extra == 'dev'
Requires-Dist: pytest<9.0.0,>=8.4.2; extra == 'dev'
Requires-Dist: ruff<0.15.0,>=0.14.4; extra == 'dev'
Requires-Dist: textual-dev<2.0.0,>=1.7.0; extra == 'dev'
Requires-Dist: twine<6.0.0,>=5.0.0; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Requires-Dist: types-toml; extra == 'dev'
Description-Content-Type: text/markdown

# TableSleuth

[![PyPI version](https://badge.fury.io/py/tablesleuth.svg)](https://badge.fury.io/py/tablesleuth)
[![Python versions](https://img.shields.io/pypi/pyversions/tablesleuth.svg)](https://pypi.org/project/tablesleuth/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![CI](https://github.com/jamesbconner/TableSleuth/workflows/CI/badge.svg)](https://github.com/jamesbconner/TableSleuth/actions)
[![Publish to PyPI](https://github.com/jamesbconner/TableSleuth/actions/workflows/publish.yml/badge.svg)](https://github.com/jamesbconner/TableSleuth/actions/workflows/publish.yml)
[![codecov](https://codecov.io/gh/jamesbconner/TableSleuth/graph/badge.svg?token=SXREVJC93E)](https://codecov.io/gh/jamesbconner/TableSleuth)

A powerful terminal-based tool for deep inspection of Parquet files, Apache Iceberg tables, and Delta Lake tables. Analyze file structure, metadata, row groups, column statistics, and table evolution with an intuitive TUI interface.

## Key Features

### Parquet Analysis
- **Deep File Inspection** - Comprehensive metadata extraction using PyArrow
- **Row Group Analysis** - Examine distribution, compression, and statistics
- **Column Profiling** - Profile data using GizmoSQL (DuckDB over Arrow Flight SQL)
- **Data Sampling** - Preview and filter data with column selection
- **Directory Scanning** - Recursively discover and inspect Parquet files

### Iceberg Table Analysis
- **Snapshot Navigation** - Browse table history and metadata evolution
- **Performance Testing** - Compare query performance across snapshots
- **Delete File Inspection** - Analyze MOR (Merge-on-Read) delete files
- **Schema Evolution** - Track schema changes over time
- **Catalog Support** - Local SQLite, AWS Glue, and AWS S3 Tables

### Delta Lake Analysis
- **Version History** - Navigate through Delta table versions and time travel
- **File Size Analysis** - Identify small file problems and optimization opportunities
- **Storage Waste** - Track tombstoned files and reclaimable storage
- **DML Forensics** - Analyze MERGE, UPDATE, DELETE operations and rewrite amplification
- **Z-Order Effectiveness** - Monitor data skipping and clustering degradation
- **Checkpoint Health** - Assess transaction log health and maintenance needs
- **Optimization Recommendations** - Get prioritized suggestions for OPTIMIZE, VACUUM, and ZORDER

### Interface
- **Interactive TUI** - Keyboard-driven navigation with rich visualizations
- **Multi-Source Support** - Local files, S3, Iceberg catalogs, and Delta tables
- **Performance Optimized** - Async operations, caching, and lazy loading

## Screenshots

### Parquet File Inspection

<table>
<tr>
<td width="50%">

**File Structure & Schema**
![Parquet Structure](https://raw.githubusercontent.com/jamesbconner/TableSleuth/main/docs/images/parquet_structure.png)

</td>
<td width="50%">

**Row Group Analysis**
![Row Groups](https://raw.githubusercontent.com/jamesbconner/TableSleuth/main/docs/images/parquet_row_groups.png)

</td>
</tr>
<tr>
<td width="50%">

**Data Sample View**
![Data Sample](https://raw.githubusercontent.com/jamesbconner/TableSleuth/main/docs/images/parquet_data_sample.png)

</td>
<td width="50%">

**Column Profiling**
![Profile](https://raw.githubusercontent.com/jamesbconner/TableSleuth/main/docs/images/parquet_profile.png)

</td>
</tr>
</table>

### Iceberg Table Analysis

<table>
<tr>
<td width="50%">

**Snapshot Overview**
![Iceberg Overview](https://raw.githubusercontent.com/jamesbconner/TableSleuth/main/docs/images/iceberg_overview.png)

</td>
<td width="50%">

**Performance Testing**
![Performance](https://raw.githubusercontent.com/jamesbconner/TableSleuth/main/docs/images/iceberg_performance_sample.png)

</td>
</tr>
<tr>
<td width="50%">

**Delete Files (MOR)**
![Deletes](https://raw.githubusercontent.com/jamesbconner/TableSleuth/main/docs/images/iceberg_deletes.png)

</td>
<td width="50%">

**Snapshot Comparison**
![Compare](https://raw.githubusercontent.com/jamesbconner/TableSleuth/main/docs/images/iceberg_compare.png)

</td>
</tr>
</table>

## Quick Start

```bash
# Install with uv (recommended)
uv sync

# Inspect a Parquet file
tablesleuth parquet data/file.parquet

# Inspect a directory (recursive)
tablesleuth parquet data/warehouse/

# Inspect an Iceberg table
tablesleuth iceberg db.table --catalog local

# Inspect AWS S3 Tables (using ARN)
tablesleuth iceberg "arn:aws:s3tables:us-east-2:123456789012:bucket/my-bucket/table/db.table"
```

**📚 Documentation:**
- **[Quick Start Guide](QUICKSTART.md)** - Get started with examples
- **[Setup Guide](TABLESLEUTH_SETUP.md)** - Complete installation and configuration
- **[User Guide](docs/USER_GUIDE.md)** - Comprehensive usage documentation

## Installation

**Requirements:** Python 3.13+ and [uv](https://docs.astral.sh/uv/)

```bash
# Install from PyPI
pip install tablesleuth

# Or install from source
git clone https://github.com/jamesbconner/TableSleuth
cd TableSleuth
uv sync

# Verify installation
tablesleuth --version

# Initialize configuration files
tablesleuth init
```

See [TABLESLEUTH_SETUP.md](TABLESLEUTH_SETUP.md) for detailed setup including AWS, GizmoSQL, and catalog configuration.

## Quick Start

```bash
# 1. Initialize configuration (first time only)
tablesleuth init

# 2. Edit configuration files
#    - tablesleuth.toml (main config)
#    - .pyiceberg.yaml (catalog config)

# 3. Verify configuration
tablesleuth config-check

# 4. Start inspecting files
tablesleuth parquet data/file.parquet
```

## Configuration

### Quick Setup

```bash
# Initialize configuration files with interactive prompts
tablesleuth init

# Check configuration and test connections
tablesleuth config-check
tablesleuth config-check -v  # Verbose output
```

### Configuration Files

**tablesleuth.toml** - Main configuration:

```toml
[catalog]
default = "local"  # Default Iceberg catalog

[gizmosql]
uri = "grpc+tls://localhost:31337"
username = "gizmosql_username"
password = "gizmosql_password"
tls_skip_verify = true
```

**Configuration Priority:**
1. Environment variables (`TABLESLEUTH_*`)
2. Local config files (`./tablesleuth.toml`, `./.pyiceberg.yaml`)
3. Home config files (`~/tablesleuth.toml`, `~/.pyiceberg.yaml`)
4. Built-in defaults

### Iceberg Catalogs

Configure PyIceberg in `.pyiceberg.yaml`:

```yaml
catalog:
  local:
    type: sql
    uri: sqlite:////path/to/catalog.db
    warehouse: file:///path/to/warehouse
```

**For detailed configuration:**
- **[Setup Guide](TABLESLEUTH_SETUP.md)** - All catalog types and AWS configuration
- **[GizmoSQL Deployment](docs/GIZMOSQL_DEPLOYMENT_GUIDE.md)** - Profiling backend setup

## Usage

### CLI Commands

```bash
# Configuration management
tablesleuth init                    # Initialize config files
tablesleuth config-check            # Validate configuration
tablesleuth config-check -v         # Detailed validation

# Inspect Parquet files
tablesleuth parquet file.parquet
tablesleuth parquet directory/
tablesleuth parquet s3://bucket/path/file.parquet

# Inspect Iceberg tables
tablesleuth iceberg db.table --catalog local
tablesleuth iceberg "arn:aws:s3tables:region:account:bucket/name/table/db.table"

# Launch Iceberg viewer
tablesleuth iceberg --catalog local --table db.table

# Inspect Delta Lake tables
tablesleuth delta path/to/delta/table
tablesleuth delta s3://bucket/path/to/delta/table
tablesleuth delta path/to/delta/table --version 5  # Time travel to version 5
```

### TUI Navigation

| Key | Action |
|-----|--------|
| `q` | Quit |
| `r` | Refresh |
| `f` | Filter columns |
| `Tab` | Switch tabs |
| `↑/↓` | Navigate |
| `Enter` | Select |

See [User Guide](docs/USER_GUIDE.md) for complete keyboard shortcuts and features.

## Optional: GizmoSQL Profiling

Enable column profiling and performance testing with GizmoSQL (DuckDB over Arrow Flight SQL).

**Quick Setup:**
```bash
# Install GizmoSQL (macOS ARM64 example)
curl -L https://github.com/gizmodata/gizmosql/releases/download/v1.12.10/gizmosql_cli_macos_arm64.zip \
  | sudo unzip -o -d /usr/local/bin -

# Start server
gizmosql_server -P password -Q -T ~/.certs/cert0.pem ~/.certs/cert0.key
```

See [GizmoSQL Deployment Guide](docs/GIZMOSQL_DEPLOYMENT_GUIDE.md) for complete setup and EC2 deployment.

## Architecture

TableSleuth uses a layered architecture:

- **TUI Layer** - Textual-based terminal interface with rich visualizations
- **Service Layer** - Business logic for file inspection, profiling, and discovery
- **Integration Layer** - PyArrow for Parquet, PyIceberg for tables, GizmoSQL for profiling

See [Architecture Guide](docs/ARCHITECTURE.md) for detailed technical documentation.

## Development

```bash
# Install with dev dependencies
uv sync --all-extras

# Run tests
pytest

# Run quality checks
uv run pre-commit run --all-files

# Type checking
mypy src/
```

See [Development Setup](DEVELOPMENT_SETUP.md) for complete development environment setup.

## Documentation

### Getting Started
- **[Quick Start](QUICKSTART.md)** - Examples and common workflows
- **[Setup Guide](TABLESLEUTH_SETUP.md)** - Installation and configuration
- **[User Guide](docs/USER_GUIDE.md)** - Complete feature documentation

### Advanced Topics
- **[Performance Profiling](docs/PERFORMANCE_PROFILING.md)** - Query performance analysis
- **[GizmoSQL Deployment](docs/GIZMOSQL_DEPLOYMENT_GUIDE.md)** - Profiling backend setup
- **[EC2 Deployment](docs/EC2_DEPLOYMENT_GUIDE.md)** - Automated AWS deployment

### Development
- **[Development Setup](DEVELOPMENT_SETUP.md)** - Dev environment and workflows
- **[Architecture](docs/ARCHITECTURE.md)** - System design and technical details
- **[Developer Guide](docs/DEVELOPER_GUIDE.md)** - API reference and contributing

## What's New

### v0.4.2 (Current)
- 🎉 **Available on PyPI!** Install with `pip install tablesleuth`
- 🔄 Package renamed to `tablesleuth` for consistency
- 🤖 Automated CI/CD with GitHub Actions
- 📦 Enhanced PyPI metadata and publishing workflow
- 🐛 Bug fixes and stability improvements
- ✅ All features from v0.4.0 and v0.3.0

### v0.4.0
- 🎉 PyPI release
- 🔄 Package renamed to `tablesleuth`
- 🤖 Automated CI/CD with GitHub Actions
- 📦 Enhanced PyPI metadata and publishing workflow

### v0.3.0
- ✅ Parquet file inspection (local and S3)
- ✅ Iceberg snapshot navigation and analysis
- ✅ Delete file inspection and MOR forensics
- ✅ Snapshot comparison and performance testing
- ✅ Column profiling with GizmoSQL
- ✅ AWS Glue and S3 Tables catalog support
- ✅ Interactive TUI with rich visualizations
- ✅ Delta Lake version history and forensics
- ✅ Storage waste analysis and optimization recommendations
- ✅ DML operation forensics and rewrite amplification tracking

### Roadmap
- Apache Hudi support
- Schema evolution visualization
- Export capabilities (JSON, CSV reports)
- REST catalog support
- Advanced partition analysis

## Contributing

Contributions welcome! See [Developer Guide](docs/DEVELOPER_GUIDE.md) and [Development Setup](DEVELOPMENT_SETUP.md).

## License

MIT License - See [LICENSE](LICENSE) for details.

## Support

- **Issues & Features:** [GitHub Issues](https://github.com/jamesbconner/TableSleuth/issues)
- **Documentation:** See [docs/](docs/) directory
- **Changelog:** [CHANGELOG.md](CHANGELOG.md)
