Metadata-Version: 2.4
Name: tablesleuth
Version: 0.4.2
Summary: TableSleuth - a Textual TUI for Open Table Format forensics (OTF) with data profiling.
Project-URL: Homepage, https://tablesleuth.com
Project-URL: Repository, https://github.com/jamesbconner/tablesleuth
Project-URL: Documentation, https://github.com/jamesbconner/tablesleuth/tree/main/docs
Project-URL: Bug Reports, https://github.com/jamesbconner/tablesleuth/issues
Project-URL: Changelog, https://github.com/jamesbconner/tablesleuth/blob/main/CHANGELOG.md
Author-email: James Conner <jamesbconner@gmail.com>
License: Apache-2.0
License-File: LICENSE
Keywords: analytics,aws,cli,data-engineering,data-lake,data-profiling,data-quality,duckdb,iceberg,lakehouse,metadata,parquet,pyarrow,s3,s3tables,schema,terminal,textual,tui
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Database
Classifier: Topic :: File Formats
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Benchmark
Classifier: Topic :: System :: Distributed Computing
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: <3.15,>=3.13
Requires-Dist: adbc-driver-flightsql>=1.7.0
Requires-Dist: boto3>=1.35.0
Requires-Dist: click>=8.1.0
Requires-Dist: duckdb>=1.1.0
Requires-Dist: pandas>=2.3.0
Requires-Dist: pip>=24.0
Requires-Dist: pyarrow>=22.0.0
Requires-Dist: pydantic>=2.11.0
Requires-Dist: pyiceberg[s3fs,sql-sqlite]>=0.9.1
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: textual>=0.86.2
Requires-Dist: uv>=0.5.0
Provides-Extra: dev
Requires-Dist: bandit[toml]<2.0.0,>=1.8.6; extra == 'dev'
Requires-Dist: build<2.0.0,>=1.0.0; extra == 'dev'
Requires-Dist: mypy<2.0.0,>=1.18.2; extra == 'dev'
Requires-Dist: pre-commit<5.0.0,>=4.4.0; extra == 'dev'
Requires-Dist: pytest-asyncio<1.0.0,>=0.26.0; extra == 'dev'
Requires-Dist: pytest-cov<7.0.0,>=6.0.0; extra == 'dev'
Requires-Dist: pytest<9.0.0,>=8.4.2; extra == 'dev'
Requires-Dist: ruff<0.15.0,>=0.14.4; extra == 'dev'
Requires-Dist: textual-dev<2.0.0,>=1.7.0; extra == 'dev'
Requires-Dist: twine<6.0.0,>=5.0.0; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Requires-Dist: types-toml; extra == 'dev'
Description-Content-Type: text/markdown

# TableSleuth

[![PyPI version](https://badge.fury.io/py/tablesleuth.svg)](https://badge.fury.io/py/tablesleuth)
[![Python versions](https://img.shields.io/pypi/pyversions/tablesleuth.svg)](https://pypi.org/project/tablesleuth/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![CI](https://github.com/jamesbconner/tablesleuth/workflows/CI/badge.svg)](https://github.com/jamesbconner/tablesleuth/actions)
[![Publish to PyPI](https://github.com/jamesbconner/TableSleuth/actions/workflows/publish.yml/badge.svg)](https://github.com/jamesbconner/TableSleuth/actions/workflows/publish.yml)

A powerful terminal-based tool for deep inspection of Parquet files and Apache Iceberg tables. Analyze file structure, metadata, row groups, column statistics, and table evolution with an intuitive TUI interface.

## Key Features

### Parquet Analysis
- **Deep File Inspection** - Comprehensive metadata extraction using PyArrow
- **Row Group Analysis** - Examine distribution, compression, and statistics
- **Column Profiling** - Profile data using GizmoSQL (DuckDB over Arrow Flight SQL)
- **Data Sampling** - Preview and filter data with column selection
- **Directory Scanning** - Recursively discover and inspect Parquet files

### Iceberg Table Analysis
- **Snapshot Navigation** - Browse table history and metadata evolution
- **Performance Testing** - Compare query performance across snapshots
- **Delete File Inspection** - Analyze MOR (Merge-on-Read) delete files
- **Schema Evolution** - Track schema changes over time
- **Catalog Support** - Local SQLite, AWS Glue, and AWS S3 Tables

### Interface
- **Interactive TUI** - Keyboard-driven navigation with rich visualizations
- **Multi-Source Support** - Local files, S3, and Iceberg catalogs
- **Performance Optimized** - Async operations, caching, and lazy loading

## Screenshots

### Parquet File Inspection

<table>
<tr>
<td width="50%">

**File Structure & Schema**
![Parquet Structure](docs/images/parquet_structure.png)

</td>
<td width="50%">

**Row Group Analysis**
![Row Groups](docs/images/parquet_row_groups.png)

</td>
</tr>
<tr>
<td width="50%">

**Data Sample View**
![Data Sample](docs/images/parquet_data_sample.png)

</td>
<td width="50%">

**Column Profiling**
![Profile](docs/images/parquet_profile.png)

</td>
</tr>
</table>

### Iceberg Table Analysis

<table>
<tr>
<td width="50%">

**Snapshot Overview**
![Iceberg Overview](docs/images/iceberg_overview.png)

</td>
<td width="50%">

**Performance Testing**
![Performance](docs/images/iceberg_performance_sample.png)

</td>
</tr>
<tr>
<td width="50%">

**Delete Files (MOR)**
![Deletes](docs/images/iceberg_deletes.png)

</td>
<td width="50%">

**Snapshot Comparison**
![Compare](docs/images/iceberg_compare.png)

</td>
</tr>
</table>

## Quick Start

```bash
# Install with uv (recommended)
uv sync

# Inspect a Parquet file
tablesleuth inspect data/file.parquet

# Inspect a directory (recursive)
tablesleuth inspect data/warehouse/

# Inspect an Iceberg table
tablesleuth inspect db.table --catalog local

# Inspect AWS S3 Tables (using ARN)
tablesleuth inspect "arn:aws:s3tables:us-east-2:123456789012:bucket/my-bucket/table/db.table"
```

**📚 Documentation:**
- **[Quick Start Guide](QUICKSTART.md)** - Get started with examples
- **[Setup Guide](TABLESLEUTH_SETUP.md)** - Complete installation and configuration
- **[User Guide](docs/USER_GUIDE.md)** - Comprehensive usage documentation

## Installation

**Requirements:** Python 3.13+ and [uv](https://docs.astral.sh/uv/)

```bash
# Install from PyPI
pip install tablesleuth

# Or install from source
git clone https://github.com/jamesbconner/TableSleuth
cd TableSleuth
uv sync

# Verify installation
tablesleuth --version

# Initialize configuration files
tablesleuth init
```

See [TABLESLEUTH_SETUP.md](TABLESLEUTH_SETUP.md) for detailed setup including AWS, GizmoSQL, and catalog configuration.

## Quick Start

```bash
# 1. Initialize configuration (first time only)
tablesleuth init

# 2. Edit configuration files
#    - tablesleuth.toml (main config)
#    - .pyiceberg.yaml (catalog config)

# 3. Verify configuration
tablesleuth config-check

# 4. Start inspecting files
tablesleuth inspect data/file.parquet
```

## Configuration

### Quick Setup

```bash
# Initialize configuration files with interactive prompts
tablesleuth init

# Check configuration and test connections
tablesleuth config-check
tablesleuth config-check -v  # Verbose output
```

### Configuration Files

**tablesleuth.toml** - Main configuration:

```toml
[catalog]
default = "local"  # Default Iceberg catalog

[gizmosql]
uri = "grpc+tls://localhost:31337"
username = "gizmosql_username"
password = "gizmosql_password"
tls_skip_verify = true
```

**Configuration Priority:**
1. Environment variables (`TABLESLEUTH_*`)
2. Local config files (`./tablesleuth.toml`, `./.pyiceberg.yaml`)
3. Home config files (`~/tablesleuth.toml`, `~/.pyiceberg.yaml`)
4. Built-in defaults

### Iceberg Catalogs

Configure PyIceberg in `.pyiceberg.yaml`:

```yaml
catalog:
  local:
    type: sql
    uri: sqlite:////path/to/catalog.db
    warehouse: file:///path/to/warehouse
```

**For detailed configuration:**
- **[Setup Guide](TABLESLEUTH_SETUP.md)** - All catalog types and AWS configuration
- **[GizmoSQL Deployment](docs/GIZMOSQL_DEPLOYMENT_GUIDE.md)** - Profiling backend setup

## Usage

### CLI Commands

```bash
# Configuration management
tablesleuth init                    # Initialize config files
tablesleuth config-check            # Validate configuration
tablesleuth config-check -v         # Detailed validation

# Inspect Parquet files
tablesleuth inspect file.parquet
tablesleuth inspect directory/
tablesleuth inspect s3://bucket/path/file.parquet

# Inspect Iceberg tables
tablesleuth inspect db.table --catalog local
tablesleuth inspect "arn:aws:s3tables:region:account:bucket/name/table/db.table"

# Launch Iceberg viewer
tablesleuth iceberg --catalog local --table db.table
```

### TUI Navigation

| Key | Action |
|-----|--------|
| `q` | Quit |
| `r` | Refresh |
| `f` | Filter columns |
| `Tab` | Switch tabs |
| `↑/↓` | Navigate |
| `Enter` | Select |

See [User Guide](docs/USER_GUIDE.md) for complete keyboard shortcuts and features.

## Optional: GizmoSQL Profiling

Enable column profiling and performance testing with GizmoSQL (DuckDB over Arrow Flight SQL).

**Quick Setup:**
```bash
# Install GizmoSQL (macOS ARM64 example)
curl -L https://github.com/gizmodata/gizmosql/releases/download/v1.12.10/gizmosql_cli_macos_arm64.zip \
  | sudo unzip -o -d /usr/local/bin -

# Start server
gizmosql_server -P password -Q -T ~/.certs/cert0.pem ~/.certs/cert0.key
```

See [GizmoSQL Deployment Guide](docs/GIZMOSQL_DEPLOYMENT_GUIDE.md) for complete setup and EC2 deployment.

## Architecture

TableSleuth uses a layered architecture:

- **TUI Layer** - Textual-based terminal interface with rich visualizations
- **Service Layer** - Business logic for file inspection, profiling, and discovery
- **Integration Layer** - PyArrow for Parquet, PyIceberg for tables, GizmoSQL for profiling

See [Architecture Guide](docs/ARCHITECTURE.md) for detailed technical documentation.

## Development

```bash
# Install with dev dependencies
uv sync --all-extras

# Run tests
pytest

# Run quality checks
uv run pre-commit run --all-files

# Type checking
mypy src/
```

See [Development Setup](DEVELOPMENT_SETUP.md) for complete development environment setup.

## Documentation

### Getting Started
- **[Quick Start](QUICKSTART.md)** - Examples and common workflows
- **[Setup Guide](TABLESLEUTH_SETUP.md)** - Installation and configuration
- **[User Guide](docs/USER_GUIDE.md)** - Complete feature documentation

### Advanced Topics
- **[Performance Profiling](docs/PERFORMANCE_PROFILING.md)** - Query performance analysis
- **[GizmoSQL Deployment](docs/GIZMOSQL_DEPLOYMENT_GUIDE.md)** - Profiling backend setup
- **[EC2 Deployment](docs/EC2_DEPLOYMENT_GUIDE.md)** - Automated AWS deployment

### Development
- **[Development Setup](DEVELOPMENT_SETUP.md)** - Dev environment and workflows
- **[Architecture](docs/ARCHITECTURE.md)** - System design and technical details
- **[Developer Guide](docs/DEVELOPER_GUIDE.md)** - API reference and contributing

## What's New

### v0.4.0 (Current)
- 🎉 **Now available on PyPI!** Install with `pip install tablesleuth`
- 🔄 Package renamed to `tablesleuth` for consistency
- 🤖 Automated CI/CD with GitHub Actions
- 📦 Enhanced PyPI metadata and publishing workflow
- ✅ All existing features from v0.3.0

### v0.3.0
- ✅ Parquet file inspection (local and S3)
- ✅ Iceberg snapshot navigation and analysis
- ✅ Delete file inspection and MOR forensics
- ✅ Snapshot comparison and performance testing
- ✅ Column profiling with GizmoSQL
- ✅ AWS Glue and S3 Tables catalog support
- ✅ Interactive TUI with rich visualizations

### Roadmap
- Delta Lake and Hudi support
- Schema evolution visualization
- Export capabilities (JSON, CSV reports)
- REST catalog support
- Automated optimization recommendations

## Contributing

Contributions welcome! See [Developer Guide](docs/DEVELOPER_GUIDE.md) and [Development Setup](DEVELOPMENT_SETUP.md).

## License

MIT License - See [LICENSE](LICENSE) for details.

## Support

- **Issues & Features:** [GitHub Issues](https://github.com/jamesbconner/TableSleuth/issues)
- **Documentation:** See [docs/](docs/) directory
- **Changelog:** [CHANGELOG.md](CHANGELOG.md)
