Metadata-Version: 2.4
Name: marco-dvcs
Version: 0.1.16
Summary: A minimal dataset versioning system for text data with a focus on reproducibility.
Home-page: https://github.com/Team-Marco-ACM/marco-package
Author: Your Name
Author-email: your.email@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: Flask
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Marco Dataset Versioning System

A minimal dataset versioning system for text data with a strong focus on reproducibility and transparency. Treat your text datasets like code — immutable, versioned, reproducible, and explainable.

Marco acts as a lightweight Python library, meaning you can initialize it in *any* machine learning project folder to safely version and preprocess your datasets without altering your original files.

## 🚀 Installation (Linux / MacOS)

On modern Linux environments (like Arch Linux, Ubuntu 23.04+), Python packages must be installed in a Virtual Environment (PEP 668) to prevent conflicts with your system packages. 

Follow these steps to safely install Marco into your ML project:

1. **Clone this repository** to your local machine:
   ```bash
   git clone https://github.com/your-username/marco.git
   cd marco
   ```

2. **Navigate to the ML project folder** where you want to train your model (e.g. your bag-of-words project):
   ```bash
   cd ~/projects/my-bag-of-words-model
   ```

3. **Create and activate a Python Virtual Environment**:
   ```bash
   # Create a virtual environment named 'venv'
   python3 -m venv venv
   
   # Activate it (You must do this every time you open a new terminal in this folder)
   source venv/bin/activate
   ```
   *(You should now see `(venv)` at the start of your terminal prompt!)*

4. **Install Marco**:
   ```bash
   # Point pip to the directory where you cloned the marco repository
   pip install -e /path/to/marco
   ```

---

## 🛠️ Usage Guide

Once `marco` is installed in your virtual environment, you have access to the full CLI!

### 1. Initialize a Repository
Initialize Marco tracking in your current directory. This creates a `.marco/` data versioning environment specific to that project.
```bash
marco init
```

### 2. Create an Immutable Version 
Upload a text/CSV/TSV dataset to create an immutable version. Marco will compute a cryptographically secure SHA-256 hash using the raw data + the preprocessing configuration.

**Interactive Mode:**
If you don't supply a configuration file, Marco will interactively guide you through building the preprocessing pipeline (Lowercasing, Tokenization, Stopwords Removal, Deduplicating).
```bash
marco upload my_dataset.csv -t v1-raw
```

**Config Mode:**
```bash
marco upload my_dataset.csv -c my_config.json -t v1-processed
```

### 3. List Versions
View all the versions you've created, along with their tags and timestamps.
```bash
marco list
```

### 4. Restore/Checkout Data
Extract the processed dataset from marco's storage back into your active workspace to use for model training.
```bash
marco restore v1-processed -o ./training_data.tsv
```

### 5. Start the Interactive Dashboard (New!)
Marco includes a stunning, built-in Glassmorphism React UI to visualize your dataset evolution.
```bash
marco generate-web
```
*(This will automatically open your browser to `http://localhost:7654` and load your `.marco` repository).*

**Dashboard Features:**
- **Interactive Lineage Tree**: View the exact Git-style chronological history of your datasets in a visual tree.
- **Auto-Compare Mode**: Clicking any dataset instantly fetches its parent and calculates how the data changed (e.g. "+16% tokens", "-5% documents").
- **Red/Green DAG Tracking**: The visual graph highlights process nodes in explicitly color-coded borders (red for token increases, green for token drops) so you can track metric divergence intuitively.
- **Alias Tagging**: Your custom `-t` tags (like `v1-raw`) are beautifully serialized as badges in the UI so you never lose track of hashes.

### 6. Export / Import Versions
Easily share dataset versions with teammates by packing them into `.tar.gz` files.
```bash
# Export version 'v1-raw' to the 'exports' folder
marco export v1-raw ./exports/

# Import an archive sent to you by a coworker
marco import ./exports/marco_version_e5e0b767.tar.gz
```

### 7. Delete Versions
Delete a dataset version to recover disk space. Marco intelligently updates the lineage (the `parents` history) of any descendant versions so that your Git-style history tree remains intact and unbroken.
```bash
marco delete v1-raw
# or
marco rm e5e0b767
```

### 8. KL Divergence & Token Analytics
Marco goes beyond simple vocabulary size tracking. It utilizes mathematical KL divergence (Kullback-Leibler) to map exactly how distribution probability shifts between two dataset versions. This tells you what kind of change happened — determining instantly if common words disappeared or rare domain terms suddenly dominated.

You can instantly compute this distribution delta by passing two version directories natively to the module:
```bash
python -m marco.token_analytics path/to/v1 path/to/v2
```

---

## 🧠 Architecture Overview

Marco decouples logic from the file system. All core engine operations sit inside `marco/core/`, including:
- `locker.py`: File-based concurrency control using `.lock` files.
- `repository.py`: CRUD operations for dataset versions and `refs.json` tagging.
- `preprocessor.py`: A robust Directed Acyclic Graph (DAG) preprocessing engine.

Have fun building safer machine learning pipelines!
