Metadata-Version: 2.4
Name: space-glue
Version: 0.1.2
Summary: SpaCE-GLUE: Spatial Cognition Exercises – General Language Understanding Evaluation
Requires-Python: >=3.13
Description-Content-Type: text/markdown
Requires-Dist: aiohappyeyeballs==2.6.1
Requires-Dist: aiohttp==3.13.2
Requires-Dist: aiosignal==1.4.0
Requires-Dist: annotated-types==0.7.0
Requires-Dist: anyio==4.11.0
Requires-Dist: attrs==25.4.0
Requires-Dist: certifi==2025.11.12
Requires-Dist: charset-normalizer==3.4.4
Requires-Dist: click==8.3.1
Requires-Dist: colorama==0.4.6
Requires-Dist: datasets==3.6.0
Requires-Dist: dill==0.3.8
Requires-Dist: distro==1.9.0
Requires-Dist: dotenv==0.9.9
Requires-Dist: filelock==3.20.0
Requires-Dist: frozenlist==1.8.0
Requires-Dist: fsspec==2025.3.0
Requires-Dist: geographiclib==2.1
Requires-Dist: geopy==2.4.1
Requires-Dist: h11==0.16.0
Requires-Dist: hf-xet==1.2.0
Requires-Dist: httpcore==1.0.9
Requires-Dist: httpx==0.28.1
Requires-Dist: huggingface-hub==1.2.3
Requires-Dist: idna==3.11
Requires-Dist: jiter==0.12.0
Requires-Dist: levenshtein==0.27.3
Requires-Dist: markdown-it-py==4.0.0
Requires-Dist: mdurl==0.1.2
Requires-Dist: mpmath==1.3.0
Requires-Dist: multidict==6.7.0
Requires-Dist: multiprocess==0.70.16
Requires-Dist: numpy==2.3.5
Requires-Dist: openai==2.8.1
Requires-Dist: packaging==25.0
Requires-Dist: pandas==2.3.3
Requires-Dist: propcache==0.4.1
Requires-Dist: pyarrow==22.0.0
Requires-Dist: pydantic==2.12.4
Requires-Dist: pydantic-core==2.41.5
Requires-Dist: pygments==2.19.2
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: python-dotenv==1.2.1
Requires-Dist: python-levenshtein==0.27.3
Requires-Dist: pytz==2025.2
Requires-Dist: pyyaml==6.0.3
Requires-Dist: rapidfuzz==3.14.3
Requires-Dist: requests==2.32.5
Requires-Dist: rich==13.9.4
Requires-Dist: shellingham==1.5.4
Requires-Dist: six==1.17.0
Requires-Dist: sniffio==1.3.1
Requires-Dist: sparc-puzzle==0.3.4
Requires-Dist: sympy==1.14.0
Requires-Dist: tqdm==4.67.1
Requires-Dist: typer-slim==0.20.0
Requires-Dist: typing-extensions==4.15.0
Requires-Dist: typing-inspection==0.4.2
Requires-Dist: tzdata==2025.3
Requires-Dist: urllib3==2.6.2
Requires-Dist: xxhash==3.6.0
Requires-Dist: yarl==1.22.0

# SpaCE-GLUE

**Spatial Cognition Exercises – General Language Understanding Evaluation**

[![PyPI version](https://badge.fury.io/py/space-glue.svg)](https://pypi.org/project/space-glue/)

A benchmarking framework for evaluating spatial reasoning capabilities of Large Language Models (LLMs).


## Overview

SpaCE-GLUE provides a unified interface for evaluating LLMs on various spatial reasoning benchmarks. It supports:

- Multiple spatial reasoning datasets (bAbI, StepGame, SpartQA, and more)
- OpenAI-compatible APIs and local vLLM inference
- Configurable evaluation workflows with YAML configuration
- Automatic result aggregation and scoring
- Resume capability for interrupted runs


## Installation

### From PyPI

```bash
pip install space-glue
```

### From Source

```bash
git clone https://github.com/olehae/SpaCE-GLUE.git
cd SpaCE-GLUE
pip install -e .
```

### Requirements

- Python >= 3.13
- Dependencies are automatically installed via pip

## Quick Start

1. Create a configuration file `config.yaml`:

```yaml
model:
  class: "models.openai_model.OpenAIModel"
  params:
    name: "gpt-4"
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"

datasets:
  - class: "data.SPaRC.SPaRC"

evaluation:
  results_dir: "results"
  batch_size: 1
  runs: [1]
```

2. Set your API key (or use a `.env` file):

```bash
export OPENAI_API_KEY="your-api-key"
```

3. Run the evaluation:

```bash
space-glue --config config.yaml
```


## Configuration

SpaCE-GLUE uses YAML configuration files. Environment variables can be referenced using `${VAR_NAME}` syntax and will be resolved from the environment or a `.env` file.

### Top-Level Structure

```yaml
model:       # Required - Model configuration
datasets:    # Required - List of datasets to evaluate
evaluation:  # Optional - Evaluation settings
logging:     # Optional - Logging configuration
```

---

### `model` (required)

Specifies the model class and its constructor parameters.

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `class` | string | Yes | Full import path to model class |
| `params` | mapping | No | Constructor arguments for the model |

**Example with OpenAI API:**

```yaml
model:
  class: "models.openai_model.OpenAIModel"
  params:
    name: "gpt-4"
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"
    temperature: 0.7
```

---

### `datasets` (required)

A list of datasets to evaluate. Each entry specifies a dataset class and optional parameters.

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `class` | string | Yes | Full import path to dataset class |
| `params` | mapping | No | Constructor arguments for the dataset |

**Example:**

```yaml
datasets:
  - class: "data.StepGame.StepGame"
  - class: "data.SpartQA.SpartQA"
  - class: "data.bAbI.bAbI"
```

---

### `evaluation` (optional)

Controls the evaluation workflow behavior.

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `results_dir` | string | `"results"` | Directory to store results |
| `batch_size` | int | `1` | Number of prompts to process concurrently |
| `runs` | list[int] | `[1] * len(datasets)` | Number of inference runs per item for each dataset |
| `inference` | bool | `true` | Whether to run model inference and store responnses |
| `evaluate` | bool | `true` | Whether to score responses |
| `aggregate` | bool | `true` | Whether to compute aggregate statistics |

Results are stored as JSONL files in the configured `results_dir`.

**Example:**

```yaml
evaluation:
  results_dir: "my_results"
  batch_size: 5
  runs: [3, 3, 3]      # 3 runs for each of the 3 datasets
  inference: true
  evaluate: true
  aggregate: true
```

---

### `logging` (optional)

Configures logging output.

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `level` | string | `"INFO"` | Log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |
| `format` | string | `"%(asctime)s - %(levelname)s - %(message)s"` | Log message format |
| `file` | string | `null` | Optional file path for logging output |

**Example:**

```yaml
logging:
  level: "DEBUG"
  format: "%(asctime)s - %(levelname)s - %(message)s"
  file: "space_glue.log"
```

---

## Available Datasets

SpaCE-GLUE includes the following spatial reasoning benchmarks:

| Dataset | Class Path | Sources |
|---------|------------|-------------|
| **bAbI** | `data.bAbI.bAbI` | [Paper](https://arxiv.org/pdf/1502.05698) [Data](https://huggingface.co/datasets/facebook/babi_qa)|
| **GeoGramBench** | `data.GeoGramBench.GeoGramBench` | [Paper](https://arxiv.org/pdf/2505.17653) [Data](https://huggingface.co/datasets/LiAuto-DSR/GeoGramBench) |
| **GRASP** | `data.GRASP.GRASP` | [Paper](https://arxiv.org/pdf/2407.01892) [Data](https://github.com/jasontangzs0/GRASP) |
| **PLUGH** | `data.PLUGH.PLUGH` | [Paper](https://arxiv.org/pdf/2408.04648) [Data](https://github.com/altsoph/PLUGH) |
| **RoomSpace** | `data.RoomSpace.RoomSpace` | [Paper](https://arxiv.org/pdf/2405.15064) [Data](https://huggingface.co/datasets/Fangjun/RoomSpace) |
| **SPaRC** | `data.SPaRC.SPaRC` | [Paper](https://arxiv.org/pdf/2505.16686) [Data](https://huggingface.co/datasets/lkaesberg/SPaRC) |
| **SpartQA** | `data.SpartQA.SpartQA` | [Paper](https://arxiv.org/pdf/2104.05832) [Data](https://huggingface.co/datasets/tasksource/spartqa-mchoice) |
| **SpatialEval** | `data.SpatialEval.SpatialEval` | [Paper](https://arxiv.org/pdf/2406.14852) [Data](https://huggingface.co/datasets/MilaWang/SpatialEval) |
| **STBench** | `data.STBench.STBench` | [Paper](https://arxiv.org/pdf/2406.19065) [Data](https://github.com/LwbXc/STBench) |
| **StepGame** | `data.StepGame.StepGame` | [Paper](https://arxiv.org/pdf/2204.08292) [Data](https://huggingface.co/datasets/ZhengyanShi/StepGame) |

---

## Available Models

### OpenAIModel

For OpenAI API or compatible endpoints (vLLM, Ollama, etc.).

**Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | Yes | Model name/identifier |
| `base_url` | string | Yes | API endpoint URL |
| `api_key` | string | Yes | API key (use `"EMPTY"` for local endpoints) |
| `temperature` | float | No | Sampling temperature |
| `reasoning_effort` | string | No | Reasoning effort level (`"low"`, `"medium"`, `"high"`) |

### VLLMModel

For direct local inference using vLLM.

**Parameters:**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `name` | string | Yes | - | Hugging Face model name |
| `temperature` | float | No | `0.7` | Sampling temperature |

---

## Example Configuration


```yaml
# SpaCE-GLUE Evaluation Configuration

# Model Configuration
model:
  class: "models.openai_model.OpenAIModel"
  params:
    name: "gpt-4"
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"

# Datasets to evaluate
datasets:
  - class: "data.bAbI.bAbI"
  - class: "data.GeoGramBench.GeoGramBench"
  - class: "data.GRASP.GRASP"
  - class: "data.PLUGH.PLUGH"
  - class: "data.RoomSpace.RoomSpace"
  - class: "data.SPaRC.SPaRC"
  - class: "data.SpartQA.SpartQA"
  - class: "data.SpatialEval.SpatialEval"
  - class : "data.STBench.STBench"
  - class: "data.StepGame.StepGame"

# Evaluation settings
evaluation:
  results_dir: "results"
  batch_size: 5
  runs: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

# Logging settings
logging:
  level: "INFO"
  format: "%(asctime)s - %(levelname)s - %(message)s"

```
