Metadata-Version: 2.4
Name: vasu_functions
Version: 0.1.1
Summary: A utility package to report, filter, and combine messy Excel files easily.
Author-email: Vasu Solanki <VasuSolanki1009@gmail.com>
License: MIT
Keywords: excel,combine files,data cleaning,automation
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: openpyxl

# **vasu_functions**

A lightweight utility library designed to simplify Excel-based ETL workflows.
It helps you:

* **Report** which Excel files contain which columns
* **Filter** files based on column availability
* **Combine** selected Excel files into one clean DataFrame
* Handle messy column names (`-`, `_`, spaces, casing)
* Skip top rows when reading
* Add an optional “source” column
* Remove duplicate rows automatically

Perfect for **recruitment data**, **lead generation**, or **multi-file Excel ETL**.

---

## 🔧 **Installation**

```bash
pip install vasu-functions
```

---

# 🚀 **Functions Overview**

The package contains three main functions:

1. **`combine_files`** – merge Excel files cleanly
2. **`report_pivot`** – check which columns appear in which files
3. **`filter_pivot`** – filter files based on the pivot table

---

# 1️⃣ **combine_files()**

### **Function**

```python
combine_files(
    folder_path,
    required_columns=None,
    qualified_files=None,
    extension=".xlsx",
    source_name=None,          # Optional — if None → no source column is added
    delete_files=False,
    skip_rows=0                # Optional — skip top rows in Excel
)
```

### **What It Does**

**Combines multiple Excel files** into one clean DataFrame with:

* Strict missing-column checking
* Case-insensitive column matching
* Auto-cleaning of column names (`-` → space, `_` → space)
* Optional skipping of header rows
* Optional “source” column
* Full-row duplicate removal
* Clean summary printed automatically

### **Key Features**

* If a file **does not contain a required column → ERROR** with exact file + missing column
* If `required_columns=None` → keep **all columns automatically**
* If `qualified_files=None` → combine **all Excel files**
* Removes duplicated rows across all files
* Works great with messy real-world Excel files

---

# 2️⃣ **report_pivot()**

### **Function**

```python
report_pivot(
    folder_path,
    required_columns=None,
    extension=".xlsx"
)
```

### **What It Does**

Creates a **pivot-style table** showing:

| File       | Job Title | Email | Phone | … |
| ---------- | --------- | ----- | ----- | - |
| file1.xlsx | ✔         | ✔     | ❌     | … |
| file2.xlsx | ❌         | ✔     | ✔     | … |

### **Behaviour**

* If `required_columns=None`, it **automatically detects all unique columns** from all files.
* Column matching is **case-insensitive + trims spaces**.
* Output is a DataFrame you can easily print or export.

---

# 3️⃣ **filter_pivot()**

### **Function**

```python
filter_pivot(
    report_df,
    required_columns,
    mode="all_missing",
    match_count=1
)
```

### **What It Does**

Filters files based on the pivot report.

### **Modes**

| Mode            | Meaning                              |
| --------------- | ------------------------------------ |
| `"all_missing"` | All required columns are ❌           |
| `"all_present"` | All required columns are ✔           |
| `"atleast_n"`   | At least `match_count` columns are ✔ |
| `"atleast_one"` | At least one column is ✔             |
| `"none"`        | Same as all_missing                  |

### **Returns**

A **list of filenames** you can pass into `combine_files(..., qualified_files=...)`.

---

# 🧩 **Example Workflow**

### **Step 1: Generate pivot**

```python
import vasu_functions as vs

report_df = vs.report_pivot("my_folder/")
print(report_df)
```

### **Step 2: Filter files (example: all required columns missing)**

```python
required = ["Job Title", "Email ID", "Phone Number"]

bad_files = vs.filter_pivot(
    report_df,
    required_columns=required,
    mode="all_missing"
)
```

### **Step 3: Combine only selected files**

```python
df = vs.combine_files(
    folder_path="my_folder",
    required_columns=required,
    qualified_files=bad_files,
    source_name="Naukri",    # optional
    skip_rows=4              # optional
)
```

---

# 🎉 **About This Package**

`vasu_functions` was created to simplify real-world Excel ETL problems, especially in hiring data, lead scraping, and bulk file processing.

It removes the pain of:

* inconsistent header names
* misspelled columns
* missing fields
* duplicate entries
* multi-file merging

---

# 📄 **License**

MIT License.

---

If you want, I can also create:

✅ Professional logo
✅ Better formatting for PyPI
✅ Full docs website (mkdocs)
✅ Version bump automation

Just tell me!
