Metadata-Version: 2.3
Name: gittxt
Version: 1.5.0
Summary: Gittxt: Get text from Git repositories in AI-ready formats
License: MIT
Keywords: git,text-extraction,cli-tool,AI,LLM,NLP,repository,machine-learning
Author: Sandeep Paidipati
Author-email: sandeep.paidipati@gmail.com
Requires-Python: >=3.8
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Version Control :: Git
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Dist: binaryornot (>=0.4.4)
Requires-Dist: click (>=8.1.3)
Requires-Dist: colorama (>=0.4.6)
Requires-Dist: gitpython (>=3.1.40)
Requires-Dist: python-dotenv (>=1.0.0)
Requires-Dist: tqdm (>=4.66.0)
Project-URL: Homepage, https://github.com/sandy-sp/gittxt
Project-URL: Repository, https://github.com/sandy-sp/gittxt
Description-Content-Type: text/markdown

> 🚀 **LLM Dataset Extractor from GitHub Repos** | AI & NLP-ready text pipelines

# 📝 Gittxt: Get text from Git repositories in AI-ready formats.

[![Python Version](https://img.shields.io/badge/python-≥3.8-blue)](pyproject.toml)
[![PyPI version](https://badge.fury.io/py/gittxt.svg)](https://pypi.org/project/gittxt/)
[![Release](https://img.shields.io/github/release/sandy-sp/gittxt.svg)](https://github.com/sandy-sp/gittxt/releases)
[![Tested with Pytest](https://img.shields.io/badge/tested%20with-pytest-9cf.svg)](https://docs.pytest.org/en/stable/)
[![PyPI Downloads](https://img.shields.io/pypi/dm/gittxt)](https://pypi.org/project/gittxt/)
![GitHub repo size](https://img.shields.io/github/repo-size/sandy-sp/gittxt)
![GitHub top language](https://img.shields.io/github/languages/top/sandy-sp/gittxt)
[![Build Status](https://github.com/sandy-sp/gittxt/actions/workflows/release.yml/badge.svg)](https://github.com/sandy-sp/gittxt/actions)
[![Made for LLMs](https://img.shields.io/badge/LLM%20ready-Yes-brightgreen)](https://github.com/sandy-sp/gittxt)
[![Linted with Ruff](https://img.shields.io/badge/linter-ruff-%23007ACC.svg)](https://github.com/charliermarsh/ruff)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

---

## ✨ What is Gittxt?

**Gittxt** is a developer-focused CLI tool that extracts AI-ready text from **Git repositories**. Whether you're preparing datasets for **AI models**, **NLP pipelines**, or **LLM fine-tuning**, Gittxt automates the tedious task of repository scanning and text conversion.

Built with speed, flexibility, and modularity in mind, Gittxt is ideal for:
- Preparing **training data for LLMs** (e.g., ChatGPT, Claude, Mistral)
- **Documentation extraction** for knowledge bases
- **Code summarization** pipelines
- **Repository analysis** for machine learning workflows

---

## 🚀 Features

- ✅ **Dynamic File-Type Filtering** (`--file-types=code,docs,images,csv,media,all`)
- ✅ **Automatic Tree Generation** with clean filtering (excludes `.git/`, `__pycache__`, etc.)
- ✅ **Multiple Output Formats**: TXT, JSON, Markdown
- ✅ **Optional ZIP Packaging** for non-text assets
- ✅ **CLI-friendly Progress Bars**
- ✅ **Built-in Summary Reports** (`--summary`)
- ✅ **Interactive & CI-ready Modes** (`--non-interactive`)

---

## 🏗️ Installation

### 📦 Using Poetry
```bash
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt install
```

### 🐍 Using pip (stable)
```bash
pip install gittxt
```

---

## ⚙️ Quickstart Example

```bash
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --file-types code,docs --summary
```

👉 This will:
- Scan a GitHub repository
- Extract code & docs files
- Output `.txt` + `.json` summaries
- Show a summary report

---

## 🖥️ CLI Usage

```bash
gittxt scan [REPOS]... [OPTIONS]

Options:
  --include TEXT        Include patterns (e.g., *.py)
  --exclude TEXT        Exclude patterns (e.g., tests/, node_modules)
  --size-limit INTEGER  Max file size in bytes
  --branch TEXT         Specify branch (for GitHub URLs)
  --file-types TEXT     code, docs, images, csv, media, all
  --output-format TEXT  txt, json, md, or comma-separated list
  --output-dir PATH     Custom output directory
  --summary             Show post-scan summary
  --non-interactive     Skip prompts for CI/CD workflows
  --progress            Enable scan progress bars
  --debug               Enable debug logs
  --help                Show this message and exit
```

---

## 📂 Output Structure

```
<output_dir>/
├── text/
│   └── repo-name.txt
├── json/
│   └── repo-name.json
├── md/
│   └── repo-name.md
└── zips/
    └── repo-name_bundle.zip  # Optional ZIP for assets (images, csv, etc.)
```

---

## 🛠 How It Works

1. 🔗 Clone GitHub/local repo (supports branch/subdir URLs)
2. 🌳 Dynamically generate directory tree (excluding `.git`, `__pycache__`, etc.)
3. 🗂️ Filter files based on type (code, docs, csv, media)
4. 📝 Generate formatted outputs (TXT, JSON, MD)
5. 📦 Package assets (optional ZIP for non-text)
6. 🧹 Cleanup temporary files (cache-free design)

---

## 📊 Example Summary Output

```
📊 Summary Report:
 - Total files processed: 45
 - Output formats: txt, json
 - File type breakdown: {'code': 31, 'docs': 14}
```

---

## 🔐 Security Policy
Please report security issues to: **sandeep.paidipati@gmail.com**  
[View Security Policy](docs/SECURITY.md)

---

## 🤝 Contributing
We welcome community contributions!  
- [Contributing Guidelines](docs/CONTRIBUTING.md)  
- [Code of Conduct](docs/CODE_OF_CONDUCT.md)  
- [Open an Issue](https://github.com/sandy-sp/gittxt/issues/new/choose)

---

## 🛣️ Roadmap
- FastAPI-powered web UI
- AI-powered summaries (GPT/OpenAI integration)
- Support YAML/CSV as additional output formats
- Async file scanning (speed boost)

---

## 📄 License
MIT License © [Sandeep Paidipati](https://github.com/sandy-sp)

---

Gittxt — **“Gittxt: Get text from Git repositories in AI-ready formats.”**

---
