Metadata-Version: 2.4
Name: directory-extractor
Version: 0.1.1
Summary: Secure and extensible directory extraction pipeline with filtering and guardrails
Author-email: Siddharth Singh <siddharthwolverine@email.com>
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: uvicorn
Requires-Dist: fastapi
Requires-Dist: pydantic
Requires-Dist: starlette
Provides-Extra: api
Requires-Dist: uvicorn; extra == "api"
Dynamic: license-file

# 📂 Directory Extraction Mechanism Using Python

> 🚀 This is my  **first publicly published Python package on PyPI** .
> I built this project to explore secure filesystem handling, clean architecture, and reusable backend utilities.
> I truly appreciate any feedback, suggestions, improvements, or contributions from the community!

---

## 📦 Install from PyPI

You can now install it directly:

```bash
pip install directory-extractor
```

If you want to run the FastAPI server (if CLI enabled):

```bash
pip install directory-extractor[api]
directory-extractor-api
```

---

# 🌟 About The Project

A **robust, secure, and flexible directory scanning pipeline** built with FastAPI that extracts files from a directory using multiple filtering strategies.

It supports:

* ✅ Multiple selection modes (all / by types / by names / by patterns)
* ✅ Exclusion filters
* ✅ Safety guardrails (size, date range, result limit)
* ✅ Secure path validation (prevents directory traversal)
* ✅ Clean structured JSON response with detailed stats

This project was designed with **security, clarity, and extensibility** in mind.

---

# 🚀 Features

## 🔎 1. Multiple Selection Modes

You can extract files using different strategies:

| Mode            | Description                                     |
| --------------- | ----------------------------------------------- |
| `all`         | Return all files recursively                    |
| `by_types`    | Filter by file extensions (pdf, md, docx, etc.) |
| `by_names`    | Select exact filenames or relative paths        |
| `by_patterns` | Use glob patterns like `**/*.md`              |

---

## 🛡 2. Built-in Guardrails

After selection and exclusion, files are validated against:

* 📏 Maximum file size (`max_file_size_mb`)
* 📅 Modification time window (`modified_after`, `modified_before`)
* 🔢 Maximum number of results (`limit`)
* ❌ Files deleted during processing (tracked safely)

---

## 🔐 3. Security First

* Uses `Path.resolve()` to prevent directory traversal attacks
* Rejects absolute glob patterns
* Ensures all files remain inside the target directory
* Avoids symlink loops
* UTC-normalized time comparisons

---

# 🏗 Project Structure

```
src/
└── directory_extractor/
    ├── api.py        # FastAPI router & pipeline orchestration
    ├── schemas.py    # Pydantic models & enums
    └── utils.py      # Core file collection & filtering logic
```

### File Responsibilities

* **`api.py`** → Exposes the FastAPI endpoint and orchestrates the 4-step pipeline
* **`schemas.py`** → Handles request validation and enum definitions
* **`utils.py`** → Contains reusable filesystem logic and guardrails

The separation ensures the file-handling logic remains reusable and testable.

---

# ⚙️ How the Pipeline Works

The extraction follows a  **4-step processing pipeline** :

```
1️⃣ SELECT files (based on mode)
2️⃣ APPLY EXCLUSIONS
3️⃣ APPLY GUARDRAILS
4️⃣ FORMAT RESPONSE
```

### Step 1 — Selection

Depending on `mode`:

* `all` → Collect everything recursively
* `by_types` → Filter by extensions
* `by_names` → Match exact filenames or relative paths
* `by_patterns` → Match glob patterns

---

### Step 2 — Exclusions

Optional `exclude_names` supports:

* Relative paths → `docs/README.md`
* Bare filenames → `temp.txt`
* Case-insensitive matching (optional)

---

### Step 3 — Guardrails

Applies:

* File size limit
* Date filtering (UTC normalized)
* Result cap
* Tracks skipped files with reasons

---

### Step 4 — Response Formatting

Returns:

* Total candidates
* Exclusion stats
* Guardrail skip stats
* Final selected files (relative + absolute paths)

---

# 🧠 API Usage

## Endpoint

```
POST /directory/
```

---

## Example Request

```json
{
  "directory": "/home/user/docs",
  "mode": "by_types",
  "file_types": ["pdf", "md"],
  "max_file_size_mb": 10,
  "limit": 50
}
```

---

## Example Response

```json
{
  "total_candidates": 127,
  "post_exclude_candidates": 115,
  "total_selected": 42,
  "selected_files": [
    "README.md",
    "docs/guide.pdf"
  ]
}
```

---

# 📘 Request Model Overview

### Required

| Field         | Description            |
| ------------- | ---------------------- |
| `directory` | Root directory to scan |

---

### Selection Options

| Field             | Used When       |
| ----------------- | --------------- |
| `mode`          | Always          |
| `file_types`    | `by_types`    |
| `file_names`    | `by_names`    |
| `include_globs` | `by_patterns` |

---

### Guardrails

| Field                | Purpose              |
| -------------------- | -------------------- |
| `max_file_size_mb` | Skip large files     |
| `modified_after`   | Include newer files  |
| `modified_before`  | Include older files  |
| `limit`            | Maximum result count |

---

# 🧪 Example Use Cases

### ✅ Index all files

```json
{
  "directory": "/data"
}
```

---

### ✅ Only PDFs modified after Jan 1, 2024

```json
{
  "directory": "/data",
  "mode": "by_types",
  "file_types": ["pdf"],
  "modified_after": "2024-01-01T00:00:00"
}
```

---

# 📊 Why This Project Matters

This project demonstrates:

* Clean architecture separation
* Secure filesystem handling
* Production-style validation with Pydantic
* Observability through structured stats
* Thread-safe handling of blocking file I/O
* Reusable package design

---

# 🔮 Future Enhancements

I would love to expand this further. Some ideas:

* Async file scanning
* Streaming large directory results
* Directory indexing cache
* File hashing support
* Logging & observability hooks
* CLI enhancements

Suggestions are welcome!

---

# 🤝 Contributing

This is my first public package, and I am actively learning.

If you find:

* Bugs 🐞
* Improvements 💡
* Performance enhancements ⚡
* Security suggestions 🔐
* Documentation improvements 📘

Please feel free to:

1. Open an issue
2. Submit a pull request
3. Share feedback

Your support means a lot 🙏

---

# 🧾 License

Open for research, experimentation, and development use.

---

# 👤 Author

Created by **Siddharth Singh**

🔗 LinkedIn: [https://www.linkedin.com/in/siddharth-singh-021b34193/](https://www.linkedin.com/in/siddharth-singh-021b34193/)
