Metadata-Version: 2.4
Name: directory-extractor
Version: 0.1.0
Summary: Secure and extensible directory extraction pipeline with filtering and guardrails
Author-email: Siddharth Singh <siddharthwolverine@email.com>
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: uvicorn
Requires-Dist: fastapi
Requires-Dist: pydantic
Requires-Dist: starlette
Provides-Extra: api
Requires-Dist: uvicorn; extra == "api"
Dynamic: license-file

# 📂 Directory Extraction Mechanism Using Python

A **robust, secure, and flexible directory scanning pipeline** built with **FastAPI** that extracts files from a directory using multiple filtering strategies.

It supports:

* ✅ Multiple selection modes (all / by types / by names / by patterns)
* ✅ Exclusion filters
* ✅ Safety guardrails (size, date range, result limit)
* ✅ Secure path validation (prevents directory traversal)
* ✅ Clean structured JSON response with detailed stats

---

# 🚀 Features

## 🔎 1. Multiple Selection Modes

You can extract files using different strategies:

| Mode            | Description                                     |
| --------------- | ----------------------------------------------- |
| `all`         | Return all files recursively                    |
| `by_types`    | Filter by file extensions (pdf, md, docx, etc.) |
| `by_names`    | Select exact filenames or relative paths        |
| `by_patterns` | Use glob patterns like `**/*.md`              |

---

## 🛡 2. Built-in Guardrails

After selection and exclusion, files are validated against:

* 📏 Maximum file size (`max_file_size_mb`)
* 📅 Modification time window (`modified_after`, `modified_before`)
* 🔢 Maximum number of results (`limit`)
* ❌ Files deleted during processing (tracked safely)

---

## 🔐 3. Security First

* Uses `Path.resolve()` to avoid path traversal attacks
* Rejects absolute glob patterns
* Ensures all files remain inside the target directory
* Avoids symlink loops

---

# 🏗 Project Structure

```
.
├── fastapi_main_app.py   # API entrypoint and pipeline orchestration
└── business_logic
    ├── schemas.py        # Pydantic models & enums
    ├── utils.py          # Core file collection & filtering logic
```

---

# ⚙️ How the Pipeline Works

The API follows a  **4-step processing pipeline** :

```
1️⃣ SELECT files (based on mode)
2️⃣ APPLY EXCLUSIONS
3️⃣ APPLY GUARDRAILS
4️⃣ FORMAT RESPONSE
```

### Step 1 — Selection

Depending on `mode`:

* `all` → Collect everything recursively
* `by_types` → Filter by extensions
* `by_names` → Match exact filenames or relative paths
* `by_patterns` → Match glob patterns

---

### Step 2 — Exclusions

Optional `exclude_names` supports:

* Relative paths → `docs/README.md`
* Bare filenames → `temp.txt` (removes everywhere)
* Case-insensitive matching (optional)

---

### Step 3 — Guardrails

Applies:

* File size limit
* Date filtering (UTC normalized)
* Result cap
* Tracks skipped files with reasons

---

### Step 4 — Response Formatting

Returns:

* Total candidates
* Exclusion stats
* Guardrail skip stats
* Final selected files (relative + absolute paths)

---

# 📦 Installation

```bash
git clone https://github.com/siddharth1310/directory_extraction_mechanism
cd https://github.com/siddharth1310/directory_extraction_mechanism
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
```

Run the server:

```bash
uvicorn fastapi_main_app:app --reload
```

Open docs:

```
http://127.0.0.1:8000/docs
```

---

# 🧠 API Usage

## Endpoint

```
POST /directory/
```

---

## Example Request

```json
{
  "directory": "/home/user/docs",
  "mode": "by_types",
  "file_types": ["pdf", "md"],
  "max_file_size_mb": 10,
  "limit": 50
}
```

---

## Example Response

```json
{
  "total_candidates": 127,
  "excluded_names": [],
  "excluded_names_matched": [],
  "post_exclude_candidates": 115,
  "total_selected": 42,
  "selected_files": [
    "README.md",
    "docs/guide.pdf"
  ],
  "skipped": {
    "too_large": 5,
    "outside_time_window": 68,
    "disappeared": 0,
    "files": []
  },
  "selected_files_path": [
    "/home/user/docs/README.md",
    "/home/user/docs/docs/guide.pdf"
  ]
}
```

---

# 📘 Request Model Explained

## Required

| Field         | Type   | Description            |
| ------------- | ------ | ---------------------- |
| `directory` | string | Root directory to scan |

---

## Selection Options

| Field             | Used When       | Description           |
| ----------------- | --------------- | --------------------- |
| `mode`          | Always          | Selection strategy    |
| `file_types`    | `by_types`    | Extensions to include |
| `file_names`    | `by_names`    | Exact names or paths  |
| `include_globs` | `by_patterns` | Glob patterns         |

---

## Exclusions

| Field                | Description                       |
| -------------------- | --------------------------------- |
| `exclude_names`    | Names or relative paths to remove |
| `case_insensitive` | Only for `by_names` mode        |
| `duplicate_policy` | `error`,`first`,`all`       |

---

## Guardrails

| Field                | Description              |
| -------------------- | ------------------------ |
| `max_file_size_mb` | Skip large files         |
| `modified_after`   | Include files newer than |
| `modified_before`  | Include files older than |
| `limit`            | Maximum number of files  |

---

# 🧩 Duplicate Handling (by_names mode)

When multiple files share the same name:

* `error` → Throw validation error
* `first` → Return first match
* `all` → Return all matches (default)

---

# 🌍 Glob Pattern Examples

| Pattern           | Meaning                        |
| ----------------- | ------------------------------ |
| `**/*.md`       | All markdown files recursively |
| `docs/**/*.pdf` | PDFs inside docs folder        |
| `*.txt`         | Text files in root only        |

---

# 🧪 Example Use Cases

### ✅ Index all files

```json
{
  "directory": "/data"
}
```

---

### ✅ Only PDFs modified after Jan 1, 2024

```json
{
  "directory": "/data",
  "mode": "by_types",
  "file_types": ["pdf"],
  "modified_after": "2024-01-01T00:00:00"
}
```

---

### ✅ Extract specific files safely

```json
{
  "directory": "/project",
  "mode": "by_names",
  "file_names": ["README.md", "docs/guide.pdf"],
  "duplicate_policy": "first"
}
```

---

# 📊 Why This Project is Robust

* Memory-efficient generator-based scanning
* Safe path containment validation
* Clean separation of concerns
* Clear stats for observability
* Threadpool execution for blocking file I/O
* Fully validated request model via Pydantic

---

# 🔮 Future Enhancements (Ideas)

* Async file scanning
* Streaming large directory results
* Caching indexed directories
* File hashing support
* Logging & observability hooks

---

## 🤝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to fork the repo and submit a pull request.

---

## 🧾 License

Use it freely for research and development.

---

## 👤 Author

Created by **Siddharth Singh**. Find me on [LinkedIn](https://www.linkedin.com/in/siddharth-singh-021b34193/)
