Chapter 13: Vision-Language-Action Models

Week 13 | Module 4: AI Integration

Learning Objectives

✅ Understand VLA architecture (RT-1, RT-2, OpenVLA)
✅ Deploy OpenVLA for robotic manipulation
✅ Fine-tune VLA on custom tasks
✅ Integrate VLA with ROS 2

13.1 What are VLAs?

Traditional vs VLA Approach

Traditional Robotics:

Task: "Pick up red cup"
↓
1. Write object detection code
2. Write grasp planning code
3. Write motion planning code
4. Write execution code

Takes weeks for one task ⏱️

VLA Approach:

Task: "Pick up red cup"
↓
VLA Model (pre-trained on 800k tasks)
↓
Actions: [joint_positions_t1, joint_positions_t2, ...]

Works immediately ✅

13.2 VLA Model Architectures

RT-1: Robotics Transformer (Google, 2022)

Architecture:

Image (300x300) → Vision Encoder (EfficientNet)
        ↓
  Token Embeddings (81 tokens)
        ↓
  Transformer (8 layers, 124M params)
        ↓
  Action Tokens (7-DOF actions)

Training Data: 130k real robot demonstrations

Performance:

97% success on seen tasks
76% success on novel instructions
3 Hz inference

RT-2: Vision-Language-Action (Google, 2023)

Key Innovation: Web-scale vision-language pretraining

Architecture:

Image + Text → PaLI-X (55B params)
        ↓
  Vision-Language Embeddings
        ↓
  Action Head (7-DOF + gripper)

Training:

Pre-train on web images + captions (WebLI, 10B images)
Fine-tune on robot data (130k demos)

Results:

62% better generalization than RT-1
Understands abstract concepts ("pick the extinct animal")

OpenVLA: Open-Source VLA (Stanford, 2024)

Why OpenVLA?

✅ Fully open weights (7B params)
✅ Apache 2.0 license (commercial use OK)
✅ Trained on Open-X Embodiment (800k trajectories, 22 robots)

Architecture:

Image (224x224) → CLIP ViT-L/14
Text → LLaMA 2 (7B)
        ↓
  Fused Embeddings
        ↓
  MLP Action Head (7-DOF)

13.3 Deploy OpenVLA

Installation

# Install dependencies
pip install torch torchvision transformers
pip install open-vla  # Simplified, actual: from source

# Download model weights
from transformers import AutoModelForVision2Seq, AutoProcessor

model = AutoModelForVision2Seq.from_pretrained("openvla/openvla-7b")
processor = AutoProcessor.from_pretrained("openvla/openvla-7b")

Inference

import torch
from PIL import Image
import numpy as np

class OpenVLAController:
    def __init__(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = AutoModelForVision2Seq.from_pretrained(
            "openvla/openvla-7b").to(self.device)
        self.processor = AutoProcessor.from_pretrained("openvla/openvla-7b")

    def predict_action(self, image, instruction):
        """
        Predict robot action from image and language instruction.

        Args:
            image: PIL Image or numpy array (H, W, 3)
            instruction: str, e.g., "pick up the red cup"

        Returns:
            action: np.array (7,) - [x, y, z, roll, pitch, yaw, gripper]
        """
        # Preprocess inputs
        if isinstance(image, np.ndarray):
            image = Image.fromarray(image)

        inputs = self.processor(
            text=instruction,
            images=image,
            return_tensors="pt"
        ).to(self.device)

        # Generate action
        with torch.no_grad():
            outputs = self.model(**inputs)
            action_tokens = outputs.logits.argmax(dim=-1)

        # Decode action tokens to robot actions
        action = self.processor.decode_actions(action_tokens)[0]

        return np.array(action)  # Shape: (7,)

# Example usage
controller = OpenVLAController()

# Load camera image
image = Image.open("camera_feed.jpg")

# Get action
action = controller.predict_action(image, "pick up the red cup")
print(f"Predicted action: {action}")
# Output: [0.45, 0.12, 0.30, 0.0, 1.57, 0.0, 1.0]
#         [x,    y,    z,    r,   p,    y,    gripper_open]

13.4 ROS 2 Integration

VLA Action Server

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from std_msgs.msg import String
from geometry_msgs.msg import Pose
from cv_bridge import CvBridge
import numpy as np

class VLAActionServer(Node):
    def __init__(self):
        super().__init__('vla_action_server')

        # VLA model
        self.vla = OpenVLAController()

        # ROS 2 interfaces
        self.bridge = CvBridge()
        self.image_sub = self.create_subscription(
            Image, '/camera/rgb', self.image_callback, 10)
        self.cmd_sub = self.create_subscription(
            String, '/voice_command', self.command_callback, 10)
        self.action_pub = self.create_publisher(
            Pose, '/robot/target_pose', 10)

        # State
        self.latest_image = None
        self.current_instruction = None

    def image_callback(self, msg):
        """Store latest camera image."""
        self.latest_image = self.bridge.imgmsg_to_cv2(msg, "rgb8")

    def command_callback(self, msg):
        """Execute VLA inference on voice command."""
        if self.latest_image is None:
            self.get_logger().warn('No camera image available')
            return

        instruction = msg.data
        self.get_logger().info(f'Executing: "{instruction}"')

        # VLA inference
        action = self.vla.predict_action(self.latest_image, instruction)

        # Publish target pose
        pose_msg = Pose()
        pose_msg.position.x = action[0]
        pose_msg.position.y = action[1]
        pose_msg.position.z = action[2]
        # orientation from roll, pitch, yaw (action[3:6])
        q = euler_to_quaternion(action[3], action[4], action[5])
        pose_msg.orientation.x = q[0]
        pose_msg.orientation.y = q[1]
        pose_msg.orientation.z = q[2]
        pose_msg.orientation.w = q[3]

        self.action_pub.publish(pose_msg)
        self.get_logger().info(f'Published target: ({action[0]:.2f}, {action[1]:.2f}, {action[2]:.2f})')

def main():
    rclpy.init()
    node = VLAActionServer()
    rclpy.spin(node)

13.5 Fine-Tuning VLA

Collect Custom Dataset

Format (HDF5):

import h5py

# Create dataset
with h5py.File('custom_data.h5', 'w') as f:
    # Store trajectories
    for traj_idx in range(num_trajectories):
        grp = f.create_group(f'traj_{traj_idx}')

        grp.create_dataset('images', data=images)  # (T, H, W, 3)
        grp.create_dataset('actions', data=actions)  # (T, 7)
        grp.create_dataset('instruction', data=instruction.encode())

Fine-Tuning Script

from transformers import Trainer, TrainingArguments

# Load model
model = AutoModelForVision2Seq.from_pretrained("openvla/openvla-7b")

# Training config
training_args = TrainingArguments(
    output_dir="./openvla-finetuned",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    learning_rate=1e-5,
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    fp16=True,  # Mixed precision training
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

Data Requirements:

Minimum: 100 demonstrations
Recommended: 1,000+ demonstrations
Best: 10,000+ demonstrations

13.6 End-to-End System

Voice-Controlled Robot Pipeline

[User] "Pick up the red cup"
    ↓ Whisper (Speech → Text)
[Text] "Pick up the red cup"
    ↓ OpenVLA (Image + Text → Actions)
[Actions] [x, y, z, r, p, y, gripper]
    ↓ MoveIt 2 (Motion Planning)
[Joint Trajectory] θ(t)
    ↓ Robot Execution
[Physical Robot] Picks up cup ✅

Full Pipeline Code

import whisper
from openvla import OpenVLAController
from moveit_py import MoveItController

class VoiceControlledRobot:
    def __init__(self):
        # Components
        self.speech_model = whisper.load_model("base")
        self.vla = OpenVLAController()
        self.moveit = MoveItController()

    def execute_voice_command(self, audio_file, camera_image):
        """Execute robot action from voice command."""

        # Step 1: Speech → Text
        result = self.speech_model.transcribe(audio_file)
        instruction = result["text"]
        print(f"Heard: {instruction}")

        # Step 2: VLA → Action
        action = self.vla.predict_action(camera_image, instruction)
        print(f"VLA action: {action}")

        # Step 3: MoveIt → Trajectory
        trajectory = self.moveit.plan_to_pose(action[:6])  # position + orientation

        # Step 4: Execute
        self.moveit.execute(trajectory)

        # Step 5: Gripper control
        if action[6] > 0.5:
            self.moveit.open_gripper()
        else:
            self.moveit.close_gripper()

        print("Task complete!")

# Usage
robot = VoiceControlledRobot()
robot.execute_voice_command("command.wav", camera_image)

13.7 Assignment: Week 13

Project: Deploy OpenVLA on Robot

Scenario: Table-top manipulation tasks

Requirements:

VLA Inference (40%)
- Deploy OpenVLA (7B model)
- Test on 10 tasks:
  - "pick the red block"
  - "put the cup in the bowl"
  - "grasp the tool"
- Measure success rate
ROS 2 Integration (30%)
- Subscribe to /camera/rgb
- Subscribe to /voice_command
- Publish to /moveit/target_pose
Fine-Tuning (30%)
- Collect 50 demonstrations
- Fine-tune OpenVLA
- Compare: pre-trained vs fine-tuned accuracy

Deliverables:

vla_server.py (ROS 2 node)
Fine-tuning script
Report: Success rates (before/after fine-tuning)
Demo video: 3 successful tasks

Bonus (+10%): Multi-step tasks ("pick red cube and place it in the green box")

13.8 Key Takeaways

✅ VLAs: Single model for perception + planning + control

✅ OpenVLA: Open-source, 7B params, 800k trajectories

✅ Generalization: Works on novel objects and instructions

✅ Fine-Tuning: 100+ demos significantly improve performance

✅ Future: VLAs will power general-purpose robots 🤖

Next: Chapter 14: LLM Task Planning →

13.1 What are VLAs?​

Traditional vs VLA Approach​

13.2 VLA Model Architectures​

RT-1: Robotics Transformer (Google, 2022)​

RT-2: Vision-Language-Action (Google, 2023)​

OpenVLA: Open-Source VLA (Stanford, 2024)​

13.3 Deploy OpenVLA​

Installation​

Inference​

13.4 ROS 2 Integration​

VLA Action Server​

13.5 Fine-Tuning VLA​

Collect Custom Dataset​

Fine-Tuning Script​

13.6 End-to-End System​

Voice-Controlled Robot Pipeline​

Full Pipeline Code​

13.7 Assignment: Week 13​

Project: Deploy OpenVLA on Robot​

13.8 Key Takeaways​

13.1 What are VLAs?

Traditional vs VLA Approach

13.2 VLA Model Architectures

RT-1: Robotics Transformer (Google, 2022)

RT-2: Vision-Language-Action (Google, 2023)

OpenVLA: Open-Source VLA (Stanford, 2024)

13.3 Deploy OpenVLA

Installation

Inference

13.4 ROS 2 Integration

VLA Action Server

13.5 Fine-Tuning VLA

Collect Custom Dataset

Fine-Tuning Script

13.6 End-to-End System

Voice-Controlled Robot Pipeline

Full Pipeline Code

13.7 Assignment: Week 13

Project: Deploy OpenVLA on Robot

13.8 Key Takeaways