Metadata-Version: 2.4
Name: cf-datahive
Version: 0.1.0
Summary: Canonical result and measurement data storage APIs for Cogniflow
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: pyarrow>=12
Provides-Extra: pandas
Requires-Dist: pandas>=2.0; extra == "pandas"
Provides-Extra: test
Requires-Dist: pytest>=8.0; extra == "test"
Requires-Dist: pandas>=2.0; extra == "test"

# cf_datahive

`cf_datahive` is the Data Hive package boundary for Python-facing APIs/tooling around the canonical data hive root (`workspace/<data_hive>`).

## Boundary (Current Phase)

- Python package role (`sandcastle/cf_datahive`): read-oriented API/tooling/validation for pipeline-facing workflows.
- Native role (`sandcastle/cf_datahive/cpp`): write gatekeeper and only allowed writer under `workspace/data_hive`.
- Step packages must stay thin wrappers and call the native gatekeeper instead of implementing filesystem/parquet helpers.

## Development workflow

- Current development mode is source-first via `scripts/fresh_install.ps1`.
- The package can now be built and published independently without changing the read/write ownership boundary above.

## Canonical layout

```
workspace/
  data_hive/
    <pipeline_id>/
      runs/
        <run_id>/
          manifest.json
          tables/
            <table_name>/
              part-0000.parquet
              part-0001.parquet
          artifacts/
            <artifact_name>
      latest.txt
```

- `latest.txt` stores the committed `run_id` and is updated atomically.
- `manifest.json` is the SOT for run metadata, table metadata, file hashes, and artifact hashes.

## Usage

```python
from pathlib import Path

from cf_datahive import DataHiveClient

workspace_root = Path("workspace")
client = DataHiveClient(str(workspace_root))

runs = client.list_runs("opcua_fifo_avg")
if runs:
    latest = runs[0].run_id
    manifest = client.load_manifest("opcua_fifo_avg", latest)
    table = client.read_table("opcua_fifo_avg", latest, "measurements")
    print(manifest.status, table.num_rows)
```

## Manifest details

Each run stores a `RunManifest` (`schema_version="1.0"`) with:

- run lifecycle fields (`status`: `staged|committed|aborted`)
- table entries (`parquet`, schema fingerprint, row/file counts, optional file hashes)
- artifact entries (sha256, media type, size)
- optional `semantic_refs` placeholder map for future ontology links

Schema fingerprint is sha256 of Arrow schema serialization bytes.

## Guardrails

Run the repository guardrail check:

```
python tools/check_datahive_guardrails.py
```

The script performs C++/header scans and step-package checks that:

- use canonical `workspace/data_hive` literals outside the native gatekeeper location (hard fail)
- violate the thin-steps rule in `sandcastle/cf_basic_steps/*/src/*/cpp` (hard fail)

## Testing

Install test dependencies and run:

```
pip install -e "sandcastle/cf_datahive[test]"
pytest -q sandcastle/cf_datahive/tests
```

Published distribution name:

```bash
pip install cf-datahive
```

## Publishing

`cf_datahive` is published with the dedicated Windows workflow:

- Workflow: `.github/workflows/cf_datahive_windows_publish.yml`
- Package directory: `sandcastle/cf_datahive`
- PyPI tag: `cf-datahive-v<version>`
- TestPyPI tag: `cf-datahive-v<version>-test`

Local preflight:

```powershell
powershell -ExecutionPolicy Bypass -File scripts/mimic_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PythonExe py `
  -PythonVersion 3.13
```

Queue a dry-run dispatch:

```powershell
powershell -ExecutionPolicy Bypass -File scripts/queue_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PublishTarget testpypi `
  -Ref main `
  -RequireLocalPass `
  -DryRun
```

## Do / Don't

- Do: use `DataHiveClient` read APIs (`list_runs`, `load_manifest`, `read_table`, `open_artifact`) for inspection and validation.
- Do: route pipeline write ownership through `cf_datahive_cpp` in the sink path.
- Don't: write parquet files or artifacts directly into the canonical data hive root from pipeline steps.
- Don't: bypass manifest updates.
