Metadata-Version: 2.4
Name: neon-vla
Version: 0.1.0
Summary: Open-source G1 humanoid VLA with video foundation model backbone
Project-URL: Homepage, https://github.com/cagataycali/neon
Project-URL: Repository, https://github.com/cagataycali/neon
Author: Cagatay Cali
License: MIT
License-File: LICENSE
Keywords: g1,humanoid,robotics,vision-language-action,vla
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: datasets>=3.0.0
Requires-Dist: einops>=0.7.0
Requires-Dist: huggingface-hub>=0.23.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: torch>=2.2.0
Requires-Dist: transformers<5.3.0,>=4.48.0
Provides-Extra: agent
Requires-Dist: strands-agents>=0.1.0; extra == 'agent'
Provides-Extra: all
Requires-Dist: accelerate>=1.2.0; extra == 'all'
Requires-Dist: bitsandbytes>=0.45.0; extra == 'all'
Requires-Dist: lerobot>=0.5.0; extra == 'all'
Requires-Dist: mypy>=1.0; extra == 'all'
Requires-Dist: peft>=0.14.0; extra == 'all'
Requires-Dist: pytest>=7.0; extra == 'all'
Requires-Dist: ruff>=0.3.0; extra == 'all'
Requires-Dist: strands-agents>=0.1.0; extra == 'all'
Requires-Dist: strands-cosmos>=0.1.0; extra == 'all'
Requires-Dist: trl>=0.15.0; extra == 'all'
Requires-Dist: wandb>=0.16.0; extra == 'all'
Provides-Extra: cosmos
Requires-Dist: strands-cosmos>=0.1.0; extra == 'cosmos'
Provides-Extra: dev
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.3.0; extra == 'dev'
Provides-Extra: lerobot
Requires-Dist: lerobot>=0.5.0; extra == 'lerobot'
Provides-Extra: train
Requires-Dist: accelerate>=1.2.0; extra == 'train'
Requires-Dist: bitsandbytes>=0.45.0; extra == 'train'
Requires-Dist: peft>=0.14.0; extra == 'train'
Requires-Dist: trl>=0.15.0; extra == 'train'
Requires-Dist: wandb>=0.16.0; extra == 'train'
Description-Content-Type: text/markdown

<div align="center">
  <img src="docs/assets/neon-logo.svg" width="180" alt="Neon Logo"/>
  <h1>Neon</h1>
  <p><strong>Open-source Vision-Language-Action model for the Unitree G1 humanoid</strong></p>
  <p>Video understanding · Natural language · 29 DoF whole-body control · Audio in/out</p>

  [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
  [![Tests](https://img.shields.io/badge/tests-80%20passed-brightgreen)]()
  [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)]()
  [![Docs](https://img.shields.io/badge/docs-mkdocs-orange.svg)](https://cagataycali.github.io/neon)
</div>

---

## What is Neon?

Neon turns a **video foundation model** into a humanoid robot controller. Instead of treating robot control as image classification, Neon leverages video models that already understand motion, physics, and temporal dynamics — then decodes that understanding into joint commands.

```mermaid
graph LR
    CAM["📹 Camera"] --> NV{"🤖 Neon VLA"}
    TXT["🗣️ Voice / Text"] --> NV
    PROP["🦾 Joint States"] --> NV
    NV --> ACT["29 DoF Actions"]
    NV --> SPEECH["🔊 Speech"]

    style NV fill:#e65100,stroke:#ff6d00,color:#fff,stroke-width:2px
```

---

## Architecture

```mermaid
graph TD
    subgraph Inputs
        CAM["📹 Camera Frames"]
        MIC["🎤 Audio (16kHz)"]
        TXT["📝 Language Instruction"]
        PROP["🦾 Joint States"]
    end

    subgraph "Neon VLA"
        VB["Video Backbone<br/>Qwen2.5-Omni / Cosmos-Reason2"]
        AE["Audio Encoder<br/>Whisper / Omni native"]
        PE["Proprioception Encoder<br/>MLP"]
        FUS["Feature Fusion<br/>Linear + ReLU²"]
        AH["Action Heads<br/>Parameter Golf v2"]
        SH["Speech Head"]
    end

    subgraph Outputs
        JOINTS["🤖 Arm Joints (14 DoF)"]
        LOCO["🏃 Locomotion (vx, vy, ω)"]
        VOICE["🔊 PersonaPlex TTS"]
    end

    CAM --> VB
    TXT --> VB
    MIC --> AE
    PROP --> PE

    VB --> FUS
    AE --> FUS
    PE --> FUS

    FUS --> AH
    FUS --> SH

    AH --> JOINTS
    AH --> LOCO
    SH --> VOICE

    style VB fill:#1565c0,color:#fff
    style FUS fill:#333,color:#fff
    style AH fill:#1b5e20,color:#fff
    style NV fill:#e65100,color:#fff
```

### Why Video Foundation Models?

| Traditional VLAs | Neon |
|---|---|
| Image encoder (static frames) | **Video encoder** (temporal sequences) |
| No physics understanding | **Cosmos-Reason2** trained on physical world |
| Single-frame prediction | **Action chunking** (16 future steps) |
| Separate speech pipeline | **Native audio** in/out (Qwen2.5-Omni) |

---

## Quick Start

```bash
pip install neon-vla
```

```python
from neon.model.neon_vla import NeonVLA, NeonConfig

model = NeonVLA(NeonConfig(control_mode="arms_only"))
model.load_backbone()

output = model.predict(
    image=camera_frame,
    instruction="Pick up the red cup",
    proprioception=joint_states,
)
# output.actions     → (16, 17) — 16 timesteps × 17 joints
# output.upper_body  → (16, 14) — arm positions
# output.locomotion  → (16, 3)  — velocity commands
```

---

## Action Heads — Parameter Golf v2

Our action decoder heads use techniques from the [MicroGPT Parameter Golf](https://github.com/openai/parameter-golf) competition, optimized for small trainable modules (~2M params) on top of frozen backbones:

```mermaid
graph TD
    F["Fused Features (2048-d)"]
    F --> NORM1["RMSNorm"]
    NORM1 --> L1["Linear → hidden"]
    L1 --> ACT1["ReLU²"]
    ACT1 --> |"skip"| SKIP((" "))
    ACT1 --> NORM2["RMSNorm"]
    NORM2 --> L2["Linear → hidden"]
    L2 --> ACT2["ReLU²"]
    ACT2 --> CAT["Concat + Skip"]
    SKIP --> CAT
    CAT --> NORM3["RMSNorm"]
    NORM3 --> L3["Linear → hidden"]
    L3 --> RES["+ α × residual"]
    RES --> OUT["Linear → actions"]
    OUT --> SC["Soft-Cap (tanh)"]
    SC --> ACTIONS["Actions ∈ [-1, 1]"]

    style ACT1 fill:#e65100,color:#fff
    style ACT2 fill:#e65100,color:#fff
    style SC fill:#1b5e20,color:#fff
```

| Technique | What | Why |
|---|---|---|
| **ReLU²** | `max(0, x)²` | Smoother gradients than GELU, cheaper to compute |
| **RMSNorm** | `x / √(mean(x²))` | Lighter than LayerNorm, stabilizes small MLPs |
| **Residual Scales** | `h + α·residual` | Learned α prevents backbone features from dominating |
| **U-Net Skip** | Layer 0 → last layer | Gradient highway through deep decoders |
| **Soft-Capping** | `tanh(x/cap)·cap` | Smoother than hard Tanh at boundary |
| **Adam β₁=0.85** | Lower momentum | Faster adaptation to shifting action distributions |
| **Grad Clip 0.3** | Tight clipping | Prevents divergence in small heads |

---

## Data Soup — Multi-Source Training

```mermaid
graph LR
    subgraph Sources
        LR["🤖 LeRobot<br/>Bridge, DROID"]
        AG["🦾 Agibot-World<br/>Bimanual 1M+"]
        COS["🌌 Cosmos DreamGen<br/>Synthetic video"]
        S4D["📸 Stereo4D<br/>Stereo depth"]
        VC["🗣️ Voice Commands<br/>Audio + text"]
        TEL["🎮 G1 Teleoperation"]
    end

    subgraph "Data Soup Pipeline"
        MAP["Cross-Embodiment<br/>Action Mapper"]
        REL["Cosmos Relative<br/>Actions"]
        MIX["Weighted Mixing"]
    end

    LR --> MAP
    AG --> MAP
    COS --> REL
    S4D --> MIX
    VC --> MIX
    TEL --> MAP
    MAP --> MIX
    REL --> MIX
    MIX --> DS["NeonEpisode[]"]

    style MIX fill:#e65100,color:#fff
```

### Cosmos Relative Actions

Ported from Cosmos-Predict2.5 — actions are computed as relative displacements in the gripper's local coordinate frame:

```python
# Gripper-frame relative actions (7-DoF)
rel_xyz = prev_rotm.T @ (curr_xyz - prev_xyz)    # Position delta
rel_rot = rotm2euler(prev_rotm.T @ curr_rotm)     # Rotation delta
action  = [rel_xyz, rel_rot, gripper_state]        # Cross-embodiment!
```

Same physical movement = same numbers, regardless of robot geometry.

---

## G1 Action Space — 29 DoF

```mermaid
graph TD
    G1["G1 Humanoid<br/>29 DoF"]
    G1 --> LA["Left Arm<br/>7 joints"]
    G1 --> RA["Right Arm<br/>7 joints"]
    G1 --> T["Torso<br/>1 joint"]
    G1 --> H["Head<br/>2 joints"]
    G1 --> LL["Left Leg<br/>6 joints"]
    G1 --> RL["Right Leg<br/>6 joints"]

    style G1 fill:#e65100,color:#fff
    style LA fill:#1b5e20,color:#fff
    style RA fill:#1b5e20,color:#fff
    style LL fill:#1565c0,color:#fff
    style RL fill:#1565c0,color:#fff
```

Three control modes:

| Mode | Active Joints | Use Case |
|---|---|---|
| `arms_only` | 14 + 3 loco = **17** | Tabletop manipulation |
| `upper_body` | 14 + 3 + 3 = **20** | Manipulation + head tracking |
| `whole_body` | **29** | Full locomotion + manipulation |

---

## Training Configs

```mermaid
graph TD
    TC["TrainConfig"]
    TC --> NC["NeonConfig"]
    TC --> DSC["DataSoupConfig"]
    TC --> HYP["Hyperparameters<br/>lr, β₁=0.85, grad_clip=0.3"]

    NC --> BC["BackboneConfig<br/>Qwen2.5-Omni / Cosmos"]
    NC --> AHC["ActionHeadConfig<br/>Parameter Golf v2"]
    NC --> AC["AudioConfig<br/>Whisper / Omni"]

    DSC --> SRC1["Source: LeRobot"]
    DSC --> SRC2["Source: Agibot"]
    DSC --> SRC3["Source: Cosmos DreamGen"]
    DSC --> SRC4["Source: Stereo4D"]

    style TC fill:#333,color:#fff
    style NC fill:#e65100,color:#fff
```

Four preset configs:

| Config | Backbone | Mode | GPU | Use Case |
|---|---|---|---|---|
| `default_arms_only` | Qwen2.5-Omni-7B | arms_only | A100 | Tabletop manipulation |
| `default_wholebody` | Qwen2.5-Omni-7B | whole_body | A100 | Full locomotion |
| `cosmos_physics` | Cosmos-Reason2-8B | arms_only | A100 | Physics-heavy tasks |
| `edge_3b` | Qwen2.5-Omni-3B | arms_only | L4/Jetson | Edge deployment |

---

## Project Structure

```
neon/
├── neon/
│   ├── model/
│   │   ├── neon_vla.py          # Complete VLA pipeline
│   │   ├── action_heads.py      # Parameter Golf v2 decoders
│   │   ├── video_backbone.py    # Qwen/Cosmos adapter
│   │   └── audio.py             # Whisper encoder + PersonaPlex TTS
│   ├── data/
│   │   ├── action_space.py      # G1 29-DoF definitions
│   │   ├── data_soup.py         # Multi-source data mixing (7 source types)
│   │   └── relative_actions.py  # Cosmos-style relative EE actions
│   ├── training/
│   │   ├── config.py            # TrainConfig + 4 presets
│   │   └── train.py             # NeonTrainer
│   └── inference/
│       ├── server.py            # HTTP inference server
│       └── g1_controller.py     # Unitree SDK interface
├── tests/                       # 80 tests
├── docs/                        # MkDocs site
└── pyproject.toml
```

---

## Test Suite

```
80 tests across 4 files:
├── test_model.py          — 22 tests (ReLU², RMSNorm, soft-cap, MLP, chunking, G1 head, NeonVLA)
├── test_data.py           — 11 tests (action mapper, episodes, configs, flags)
├── test_audio.py          — 16 tests (encoders, speech head, PersonaPlex, integration)
├── test_relative_actions.py — 12 tests (euler, rotm, quat, Cosmos relative actions)
└── test_action_space.py   — 19 tests (G1 joints, modes, normalization)
```

---

## Training Infrastructure

| Device | Role | Specs |
|---|---|---|
| **Thor** (Jetson Orin) | Edge inference, real-time control | ARM64, 32GB, CUDA |
| **EC2** (L40S) | Heavy training, synthetic data gen | 46GB VRAM |
| **HuggingFace Jobs** | QLoRA fine-tuning | A100/L4 on-demand |
| **MacBook** (M3 Max) | Development, unit tests | MPS acceleration |

---

## Related Work

| Project | Role |
|---|---|
| [GR00T N1](https://github.com/nvidia/isaac-gr00t) | Architecture reference |
| [Cosmos-Predict2.5](https://github.com/nvidia/cosmos-predict2) | Relative actions, world model |
| [OmniVLA](https://github.com/cagataycali/OmniVLA) | Omni-modal VLA reference |
| [Strands MicroGPT](https://github.com/cagataycali/strands-microgpt) | Parameter Golf optimizations |
| [Agibot-World](https://huggingface.co/datasets/lerobot/xvla-agibot-world) | Data soup training data |
| [Strands Agents](https://strandsagents.com) | Agent framework integration |

---

## Documentation

📖 **[Full docs →](https://cagataycali.github.io/neon)**

---

## Citation

```bibtex
@misc{neon2026,
  title={Neon: Open-Source G1 Humanoid VLA with Video Foundation Model Backbone},
  author={Cagatay Cali},
  year={2026},
  url={https://github.com/cagataycali/neon}
}
```

## License

MIT — See [LICENSE](LICENSE)
