Metadata-Version: 2.4
Name: marco-dvcs
Version: 0.1.9
Summary: A minimal dataset versioning system for text data with a focus on reproducibility.
Home-page: https://github.com/Team-Marco-ACM/marco-package
Author: Your Name
Author-email: your.email@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: Flask
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Marco Dataset Versioning System

A minimal dataset versioning system for text data with a strong focus on reproducibility and transparency. Treat your text datasets like code — immutable, versioned, reproducible, and explainable.

Marco acts as a lightweight Python library, meaning you can initialize it in *any* machine learning project folder to safely version and preprocess your datasets without altering your original files.

## 🚀 Installation (Linux / MacOS)

On modern Linux environments (like Arch Linux, Ubuntu 23.04+), Python packages must be installed in a Virtual Environment (PEP 668) to prevent conflicts with your system packages. 

Follow these steps to safely install Marco into your ML project:

1. **Clone this repository** to your local machine:
   ```bash
   git clone https://github.com/your-username/marco.git
   cd marco
   ```

2. **Navigate to the ML project folder** where you want to train your model (e.g. your bag-of-words project):
   ```bash
   cd ~/projects/my-bag-of-words-model
   ```

3. **Create and activate a Python Virtual Environment**:
   ```bash
   # Create a virtual environment named 'venv'
   python3 -m venv venv
   
   # Activate it (You must do this every time you open a new terminal in this folder)
   source venv/bin/activate
   ```
   *(You should now see `(venv)` at the start of your terminal prompt!)*

4. **Install Marco**:
   ```bash
   # Point pip to the directory where you cloned the marco repository
   pip install -e /path/to/marco
   ```

---

## 🛠️ Usage Guide

Once `marco` is installed in your virtual environment, you have access to the full CLI!

### 1. Initialize a Repository
Initialize Marco tracking in your current directory. This creates a `.marco/` data versioning environment specific to that project.
```bash
marco init
```

### 2. Create an Immutable Version 
Upload a text/CSV/TSV dataset to create an immutable version. Marco will compute a cryptographically secure SHA-256 hash using the raw data + the preprocessing configuration.

**Interactive Mode:**
If you don't supply a configuration file, Marco will interactively guide you through building the preprocessing pipeline (Lowercasing, Tokenization, Stopwords Removal, Deduplicating).
```bash
marco upload my_dataset.csv -t v1-raw
```

**Config Mode:**
```bash
marco upload my_dataset.csv -c my_config.json -t v1-processed
```

### 3. List Versions
View all the versions you've created, along with their tags and timestamps.
```bash
marco list
```

### 4. Restore/Checkout Data
Extract the processed dataset from marco's storage back into your active workspace to use for model training.
```bash
marco restore v1-processed -o ./training_data.tsv
```

### 5. Start the Visual Dashboard
Open the Flask-powered web dashboard or the new React DAG visualizer to view dataset metadata, differences between versions, and complete data lineage trees.

**Start the React DAG Comparison Dashboard (New):**
```bash
marco generate-web
```
*(This will automatically open your browser to `http://localhost:7654` and load your local `.marco` graphs).*

**Start the legacy HTML Dashboard:**
```bash
python -m marco.web.app --port 5000
```

### 6. Export / Import Versions
Easily share dataset versions with teammates by packing them into `.tar.gz` files.
```bash
# Export version 'v1-raw' to the 'exports' folder
marco export v1-raw ./exports/

# Import an archive sent to you by a coworker
marco import ./exports/marco_version_e5e0b767.tar.gz
```

---

## 🧠 Architecture Overview

Marco decouples logic from the file system. All core engine operations sit inside `marco/core/`, including:
- `locker.py`: File-based concurrency control using `.lock` files.
- `repository.py`: CRUD operations for dataset versions and `refs.json` tagging.
- `preprocessor.py`: A robust Directed Acyclic Graph (DAG) preprocessing engine.

Have fun building safer machine learning pipelines!
