Metadata-Version: 2.4
Name: git-grove
Version: 1.0.0
Summary: A tool that cherry-picks and synchronizes specific files and directories across multiple Git repositories
Author: universetraveller
License-Expression: MIT
Project-URL: Homepage, https://github.com/universetraveller/git-grove
Project-URL: Repository, https://github.com/universetraveller/git-grove
Project-URL: Issues, https://github.com/universetraveller/git-grove/issues
Keywords: git,sync,repository,version-control,sparse-checkout
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Version Control :: Git
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file

# git-grove

A tool that cherry-picks and synchronizes specific files and directories across multiple Git repositories

## Introduction
A Python-based tool for synchronizing files and directories from multiple Git repositories into a local workspace. `git-grove` enables you to cherry-pick specific paths from remote repositories and keep them in sync with minimal overhead.

## Features

- **Selective File Syncing**: Sync specific files or directories from remote repositories
- **Multiple Repository Support**: Manage files from multiple Git repositories simultaneously
- **Flexible Backend Options**: Choose between `mktree` (default) or `sparse-checkout` backends
- **Efficient Caching**: Uses blobless clones to minimize storage and bandwidth
- **Portable Configuration**: Maintain sync configurations in a portable JSON registry file
- **GitHub Actions Integration**: Automated periodic syncing via GitHub Actions workflow
- **Light and Portable**: All application code resides in a single file. No other dependencies.

## Installation

### From Source

Clone this repository and install using `setup.py`:

```bash
git clone https://github.com/universetraveller/git-grove.git
cd git-grove
pip install .
```

This creates a `git-grove` command-line tool (and `git grove` subcommand) that can be invoked from anywhere.

### From Package Managers

Install directly via package managers (e.g., pip and uv):
```bash
pip install git-grove
```

### Requirements

- Python 3
- Git

## Quick Start

1. **Add a repository and paths to sync**:
```bash
git-grove add https://github.com/owner/repo main path/to/file
```

2. **Synchronize**:
```bash
git-grove sync
```

This will:
- Clone the repository's metadata (blobless clone if not already cached)
- Extract the specified paths from the `main` branch
- Place them in your local workspace

## CLI Usage

`git-grove` provides several subcommands for managing your sync registry and performing synchronizations.

### Global Options

```bash
git-grove [OPTIONS] SUBCOMMAND [SUBCOMMAND_OPTIONS]
```

**Options**:
- `-v, --verbose`: Enable verbose debug output
- `-r, --registry PATH`: Path to the registry file (default: `./registry.json`)

### Subcommands

#### `sync` - Synchronize repositories

Synchronize files from registered repositories based on the registry configuration.

```bash
git-grove sync [TARGET...] [-f]
```

**Arguments**:
- `TARGET`: Optional targets to sync in format `<repo_name_or_url>[:<revision>]`
  - If not specified, syncs all registered repositories
  - Examples: `myrepo`, `myrepo:main`, `https://github.com/owner/repo:v1.0.0`

**Options**:
- `-f, --force`: Force synchronization even if already up-to-date

  This option force the tool to skip checks for last-synced has. This is helpful when you modify local files and want to synchronize gain.

**Examples**:
```bash
# Sync all registered repositories
git-grove sync

# Sync specific repository
git-grove sync myrepo

# Sync specific revision only
git-grove sync myrepo:main

# Force sync even if up-to-date
git-grove sync -f
```

#### `add` - Add repository/revision/paths

Add a new repository, revision, and paths to the registry.

```bash
git-grove add NAME_OR_URL REVISION PATHS... [OPTIONS]
```

**Arguments**:
- `NAME_OR_URL`: Repository name or URL
  
  For the first time adding a repository, an URL should be provided. If the name is not provided, it would be derived from repository URL.
  
- `REVISION`: Git revision reference (branch, tag, or commit hash)
- `PATHS`: One or more paths to sync from the repository

  To add the whole repository, use empty string (`''` in the terminal). It is handled by a special check before building tree-object.

**Options**:
- `--name NAME`: Set a custom name for the repository (default: derived from repository URL)
- `--target TARGET`: Target output directory (relative to the global `target_dir`)
- `--last_synced HASH`: Set the last synced commit hash

**Examples**:
```bash
# Add a single path from a repository
git-grove add https://github.com/owner/repo main src/file.txt

# Add multiple paths
git-grove add https://github.com/owner/repo main src/ docs/ LICENSE

# Add with a custom name
git-grove add https://github.com/owner/repo main src/ --name myrepo

# Add with custom target directory
git-grove add myrepo develop config/ --target dev-config

# Add the whole repository
git-grove add myrepo main ''

# Use default repository name
git-grove add https://github.com/owner/name_a main src/file.txt
git-grove add name_a main README.md

# Specify a special name
git-grove add https://github.com/owner/name_a main src/file.txt --name name_b
git-grove add name_b main README.md
```

#### `rm` - Remove repository/revision/paths

Remove repositories, revisions, or specific paths from the registry.

```bash
git-grove rm NAME_OR_URL REVISION [PATHS...]
```

**Arguments**:
- `NAME_OR_URL`: Repository name or URL to remove from
- `REVISION`: Revision to remove
- `PATHS`: Optional specific paths to remove (removes all paths if not specified)

**Examples**:
```bash
# Remove all paths from a revision (deletes the revision)
git-grove rm myrepo main

# Remove specific paths only
git-grove rm myrepo main src/file.txt docs/
```

#### `set` - Set registry values

Set or update values in the registry.

```bash
git-grove set PATH VALUE
```

**Arguments**:
- `PATH`: Dot-separated path to the field (use `\\` to escape dots in names)

  It is handled by `Registry._get_path(path: str) -> object | dict | list, str`. The root object is the registry (a python dict). Each part of a path could be a key of a dict, an index of a list, the name to find a repository and an URL of a repository.
  
- `VALUE`: Value to set

**Examples**:
```bash
# Set cache directory
git-grove set cache_dir /path/to/cache

# Set target directory
git-grove set target_dir ./synced

# Set backend mode
git-grove set backend sparse-checkout

# Set repository name by index
git-grove set repositories.0.name custom-name

# Set repository name by existing name/URL
git-grove set repositories.myrepo.name new-name
```

#### `unset` - Remove registry values

Remove a field from the registry.

```bash
git-grove unset PATH
```

**Arguments**:
- `PATH`: Dot-separated path to the field to remove

**Examples**:
```bash
# Remove custom cache directory (reverts to default)
git-grove unset cache_dir
```

#### `ls` - List registry contents

List repositories and revisions in the registry, or view a specific registry value.

```bash
git-grove ls [PATH]
```

**Arguments**:
- `PATH`: Optional dot-separated path to a specific value

**Examples**:
```bash
# List all registered repositories
git-grove ls

# View cache directory setting
git-grove ls cache_dir

# View specific repository details
git-grove ls repositories.myrepo
```

#### `batch-add` - Batch add from file

Add multiple repository entries from a file or stdin.

```bash
git-grove batch-add [FILE]
```

**Arguments**:
- `FILE`: Optional file path (reads from stdin if not provided)

**File Format**:
```
<repo_name_or_url> <revision> <path1> [<path2> ...]
```

**Examples**:
```bash
# Add from file
git-grove batch-add repos.txt

# Add from stdin
cat << EOF | git-grove batch-add
https://github.com/owner/repo1 main src/
https://github.com/owner/repo2 v1.0.0 docs/ LICENSE
EOF
```

## Registry File Format

The registry file (`registry.json`) stores all sync configurations:

```json
{
  "cache_dir": ".cache",
  "target_dir": ".",
  "backend": "mktree",
  "sparse-checkout": {
    "mode": "no-cone"
  },
  "repositories": [
    {
      "name": "example-repo",
      "url": "https://github.com/owner/repo",
      "revisions": {
        "main": {
          "paths": ["src/", "docs/"],
          "target": "example-repo",
          "last_synced": "abc123..."
        }
      }
    }
  ]
}
```

### Registry Fields

- **`cache_dir`** (optional): Directory for storing cached repository clones (default: `.cache`)
- **`target_dir`** (optional): Base directory for synced files (default: current directory)
- **`backend`** (optional): Sync backend to use: `mktree` or `sparse-checkout` (default: `mktree`)
- **`sparse-checkout`** (optional): Configuration for sparse-checkout backend
  - **`mode`**: `cone` or `no-cone` (default: `no-cone`)
- **`repositories`**: Array of repository configurations
  - **`name`** (optional): Custom name (derived from URL if not provided)
  - **`url`** (required): Git repository URL
  - **`revisions`**: Object mapping revision names to their configurations
    - **`paths`** (required): Array of paths to sync
    - **`target`** (optional): Output directory relative to `target_dir`
    - **`last_synced`** (optional): Last synced commit hash (updated automatically)

## Backend Modes

### mktree (Default)

Uses Git's `mktree` command to build custom tree objects from selected paths. This is the recommended backend for most use cases.

**Advantages**:
- Precise path selection
- Works with any path combination
- No working directory conflicts

**How it works**:
1. Fetches object hashes for requested paths
2. Builds a custom Git tree containing only those paths
3. Checks out the tree to the target directory

### sparse-checkout

Uses Git's native sparse-checkout feature for path filtering.

see [sparse-checkout](https://git-scm.com/docs/git-sparse-checkout) for more information.

**Configuration**:
```bash
git-grove set backend sparse-checkout
git-grove set sparse-checkout.mode no-cone # default or cone
```

**Modes**:
- **`cone`**: Faster, but it also checks out files that are not required in parent directories
- **`no-cone`**: Slower but more precise. This mode can correctly check out specified paths

**Advantages**:
- Native Git feature
- Works by tagging skip-worktree, which gains better performance in specific scenarios

**Limitations**:
- Requires a full working directory walking
- May have conflicts with parallel syncs

## GitHub Actions Integration

The included `.github/workflows/sync.yml` workflow enables automatic periodic synchronization:

```yaml
on:
  schedule:
    - cron: '0 0 * * 0'  # Weekly on Sunday at midnight
  workflow_dispatch:      # Manual trigger
```

**Features**:
- Automatically syncs registered repositories on a schedule
- Caches repository clones for faster subsequent runs
- Commits and pushes changes back to your repository
- Manual triggering via GitHub Actions UI

**Setup**:
1. Clone this repository or fork it directly
2. Ensure your repository has write permissions for GitHub Actions
3. Configure your `registry.json` with desired repositories
4. Set `ACCESS_TOKEN` in repository setting to your personal access token if you want to access private repositories. If you only want to access public repositories, you can change the name `ACCESS_TOKEN` in the `sync.yml` to `GITHUB_TOKEN`
5. Change time schedule in the `sync.yml` to your preferred configuration
6. The workflow will handle the rest automatically

## Python API

For advanced use cases, you can use `git-grove` as a Python library.

### Core Classes

#### `Repository`

Represents a Git repository to sync from.

```python
Repository(name: str, url: str, output_dir: str, cache_dir: str)
```

**Parameters**:
- `name`: Repository identifier
- `url`: Git repository URL
- `output_dir`: Absolute path to output directory
- `cache_dir`: Absolute path to cache directory

**Key Methods**:

##### `clone()`
Clone the repository using blobless clone optimization.

**Raises**: `subprocess.CalledProcessError` if clone fails

##### `fetch()`
Fetch latest updates from the remote repository.

**Raises**: `subprocess.CalledProcessError` if fetch fails

##### `get_latest_commit(rev: str) -> str`
Get the commit hash for a given reference.

**Parameters**:
- `rev`: Branch name, tag, or commit hash

**Returns**: Commit SHA hash

**Raises**: `ValueError` if reference not found

##### `get_object(rev: str, path: str) -> tuple[str, str, str] | None`
Get Git object information for a path.

**Parameters**:
- `rev`: Git reference
- `path`: Path within the repository

**Returns**: Tuple of `(mode, type, hash)` or `None` if not found
- `mode`: Git mode (`040000` for tree, `100644` for blob)
- `type`: Object type (`tree` or `blob`)
- `hash`: Object SHA hash

##### `build_mktree_entries(rev: str, paths: list[str]) -> str`
Build a Git tree containing only specified paths.

**Parameters**:
- `rev`: Git reference to extract from
- `paths`: List of paths to include

**Returns**: Tree hash of the constructed tree

**Raises**: 
- `ValueError` if no valid targets found
- `RuntimeError` if local path conflicts exist

##### `read_tree(tree_hash: str)`
Check out a tree hash to the output directory.

**Parameters**:
- `tree_hash`: Git tree object hash to check out

**Raises**: `subprocess.CalledProcessError` if checkout fails

#### `Registry`

Manages the sync registry file and configuration.

```python
Registry(registry_file: str)
```

**Parameters**:
- `registry_file`: Path to the JSON registry file

**Properties**:

##### `repositories: list`
List of repository configurations from the registry.

##### `cache_dir: str`
Cache directory path (default: `.cache` in current directory).

##### `target_dir: str`
Target directory path (default: current directory).

##### `backend: tuple[str, dict]`
Backend mode and configuration. Returns tuple of `(mode, config)`.

**Key Methods**:

##### `add_revision(name_or_url: str, revision: str, paths: list[str], target: str | None = None, last_synced: str | None = None, name: str | None = None)`
Add or update a repository revision in the registry.

**Parameters**:
- `name_or_url`: Repository name or URL
- `revision`: Git reference
- `paths`: List of paths to sync
- `target`: Optional custom target directory
- `last_synced`: Optional last synced commit hash
- `name`: Optional custom repository name

##### `remove_revision(name_or_url: str, revision: str, paths: list[str] | None = None)`
Remove a revision or specific paths from the registry.

**Parameters**:
- `name_or_url`: Repository name or URL
- `revision`: Git reference to remove
- `paths`: Optional list of specific paths to remove (removes entire revision if `None`)

##### `set_value(path: str, value: Any)`
Set a value in the registry using dot-notation path.

**Parameters**:
- `path`: Dot-separated path to field
- `value`: Value to set

##### `unset(path: str)`
Remove a field from the registry.

**Parameters**:
- `path`: Dot-separated path to field

##### `dump_registry()`
Save the registry to disk if modified.

##### `load_registry()`
Load the registry from disk. Initializes empty registry if file doesn't exist.

##### `find_repository(name_or_url: str) -> dict | None`
Find a repository by name or URL.

**Parameters**:
- `name_or_url`: Repository name or URL

**Returns**: Repository dictionary or `None` if not found

##### `filter_targets(targets: list[str])`
Filter registry to only process specified targets.

**Parameters**:
- `targets`: List of targets in format `<name_or_url>[:<revision>]`

##### `__iter__() -> Iterator[SyncData]`
Iterate over all configured sync targets.

**Yields**: `SyncData` objects for each repository revision

#### `SyncData`

Data class representing a single sync target.

```python
SyncData(name: str, url: str, rev: str, paths: list[str], target: str, last_synced: str | None, details: dict)
```

**Attributes**:
- `name`: Repository name
- `url`: Repository URL
- `rev`: Git reference
- `paths`: List of paths to sync
- `target`: Absolute path to target output directory
- `last_synced`: Last synced commit hash or `None`
- `details`: Reference to registry revision details dict

**Methods**:

##### `update_last_synced(commit_hash: str)`
Update the last synced commit hash.

**Parameters**:
- `commit_hash`: New commit hash to record

### Utility Functions

#### `sync_single(data: SyncData, cache_dir: str) -> bool`
Synchronize a single revision using the mktree backend.

**Parameters**:
- `data`: Sync configuration
- `cache_dir`: Cache directory path

**Returns**: `True` if synchronized, `False` if already up-to-date

#### `sync_single_sparse_checkout(data: SyncData, cache_dir: str, sparse_mode: str) -> bool`
Synchronize using sparse-checkout backend.

**Parameters**:
- `data`: Sync configuration
- `cache_dir`: Cache directory path
- `sparse_mode`: Either `"cone"` or `"no-cone"`

**Returns**: `True` if synchronized, `False` if already up-to-date

#### `sync(registry: Registry)`
Perform synchronization for all configured repositories.

**Parameters**:
- `registry`: Registry instance containing sync configurations

**Side Effects**:
- Creates cache and target directories if needed
- Updates `last_synced` values in registry
- Saves registry to disk

### Example Usage

```python
from grove import Registry, Repository, sync

# Initialize registry
registry = Registry('./registry.json')

# Add a repository
registry.add_revision(
    name_or_url='https://github.com/owner/repo',
    revision='main',
    paths=['src/', 'README.md'],
    target='my-project'
)

# Perform sync
sync(registry)

# Or use Repository directly
repo = Repository(
    name='example',
    url='https://github.com/owner/repo',
    output_dir='/path/to/output',
    cache_dir='/path/to/cache'
)

repo.clone()
repo.fetch()
tree_hash = repo.build_mktree_entries('main', ['src/', 'docs/'])
repo.read_tree(tree_hash)
```

## Technical Architecture

### Core Concepts

#### Blobless Clones
`git-grove` uses `--filter=blob:none` when cloning repositories, which only downloads Git metadata (commits, trees) without file contents. File contents are fetched on-demand, significantly reducing storage and bandwidth requirements.

#### Tree Building
The `mktree` backend constructs custom Git tree objects by:
1. Querying object hashes for requested paths
2. Building a tree structure (Trie) that includes only those paths
3. Using `git mktree` to create tree objects from this structure
4. Checking out the resulting tree

This allows precise path selection without requiring a full repository checkout.

#### Registry-Based Management
All sync configurations are stored in a JSON registry file, enabling:
- Version control of sync configurations
- Declarative sync specifications
- Automated synchronization workflows
- Reproducible sync operations

#### Incremental Updates
The `last_synced` field tracks the last processed commit for each revision. Subsequent syncs only update when the remote reference has changed, avoiding unnecessary work.

### Directory Structure

When running in default settings, `git-grove` creates:

```
.
├── registry.json          # Sync configuration
├── .cache/               # Cached repository clones
│   └── <repo-name>/
│       ├── .git/         # Bare-like repository
│       └── sync_index    # Custom Git index file
└── <target-dir>/         # Synced files (default: current directory)
    └── <repo-name>/
        └── <synced-files>
```

### Environment Variables

When performing Git operations, `git-grove` sets:
- `GIT_DIR`: Points to the cached repository's `.git` directory
- `GIT_INDEX_FILE`: Points to a custom index file for isolation
- `GIT_WORK_TREE`: Points to the output directory

This allows multiple repositories to be managed independently without conflicts.

## Use Cases

### Centralized Configuration Management
Sync configuration files from a central repository to multiple projects:
```bash
git-grove add https://github.com/company/configs main shared/eslint shared/prettier
git-grove sync
```

### Multi-Repository Documentation
Aggregate documentation from multiple repositories:
```bash
git-grove add https://github.com/org/api-docs main docs/ --target api-docs
git-grove add https://github.com/org/sdk-docs main docs/ --target sdk-docs
git-grove sync
```

### Shared Resource Libraries
Keep shared assets synchronized across projects:
```bash
git-grove add https://github.com/design/assets main icons/ fonts/ --target assets
git-grove sync
```
### Multi-Revision Files
Keep files from multiple branches or commits in a single repository
```bash
git-grove add https://github.com/owner/name_a main file_a
git-grove add name_a master file_a
git-grove add name_a commit_a file_a
git-grove sync
```

### Automated Dependency Updates
Use the GitHub Actions workflow to automatically sync and commit updates:
- Fork repositories stay in sync with upstream
- Configuration files update automatically
- Shared resources propagate to all projects

## Contributing

Contributions are welcome! This project uses:
- Python 3.6+ with type hints
- Standard library where possible
- Git plumbing commands for repository operations

## License

[MIT LICENCE](LICENSE)
