Tiny struggles

Always hacking something 🧑‍🔬.

Improve Your VLA training: Using Reward Models to Filter Best Training Data

In recent years, there has been huge progress in AI for robotic manipulation. VLAs such as $\pi_0$ ($\pi_{0.5}$) are the current state of the art, but it’s still difficult to get good performance on complex, long-horizon tasks.

Exactly the type of tasks as in the Stanford BEHAVIOR Challenge that I have been working on (as a group effort) for the last couple of weeks. The challenge consists of 50 full-length household tasks, with 200 demonstrations each. And the tasks are long, 1200h, often a single demonstration is longer than 10 minutes.

For imitation learning - having a great demonstration dataset is a key differentiator. How to improve the dataset? If it’s small, you can always record more demos, even though it’s expensive and time-consuming. But what if it’s already big? In BEHAVIOR challenge, the dataset is already huge, in original quality it’s about 2TB and training on its entirety could be taking weeks.

It’s hard to assess the quality of the demonstrations as there are 12000h of it. Maybe there’s a better way to train than just doing behavior cloning on the entirety of the dataset?

One of the approaches we wanted to try as a group is using some RL methods to augment the VLA training. This is what SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation addresses. The paper uses the challenging task of T-shirt folding (long horizon task somewhat equivalent to the BEHAVIOR tasks) to demonstrate why standard fine-tuning fails and presents a clever, two-part framework to solve it:

  • train a reward model that assigns high rewards to samples with high progress (using two staged approach)
  • use the reward models in the fine tuning of a target model (incorporating it in the loss function), it makes the model to pay attention more to demonstrations with high progress

This post is a deep dive into what makes this paper work, focusing on the tricky parts:

  1. Quality issues of large datasets fine-tuning on a 200-hour dataset can result in 0% success.
  2. How SARM re-thinks “progress” labeling to create a stable reward signal
  3. The “Rewind” data augmentation trick that’s essential for learning “failure”
  4. Why this reward scheme is valuable: The paper’s two-part evaluation, and why one test is for accuracy and the other is for usefulness
  5. How this method differs from the typical use of reward models in RL.

as well as translating the ideas into practical implementation on a large dataset.

1. The Problem: “Garbage Data” Kills $\pi_0$ Fine-Tuning

The SARM Paper’s goal is to fine-tune $\pi_0$ on T-shirt folding, a “long-horizon, contact-rich manipulation task” involving deformable objects. The authors break down the task’s complexity:

  • Easy: “Picking the shirt from a box” (~5 seconds).
  • Medium: “Folding a T-shirt from a flattened state” (30-60 seconds).
  • Hard: “Folding from a crumpled state” (1-3 minutes).

The “Hard” task is the real challenge, as it requires long-term planning and handling uncertainty. A look at the paper’s “dense annotations” shows the “fold” stage alone consists of 5-7 distinct sub-tasks, like “grab near side and fold,” “rotate the tshirt 90 deg,” etc.

The standard approach is to fine-tune $\pi_0$ on a large dataset using standard Behavior Cloning (BC). The authors did this with a 200-hour dataset, “BC-All,” and the results were stark:

  • Easy Task: 100% success.
  • Medium Task: 8% success.
  • Hard Task: 0% success .

The problem isn’t $\pi_0$. It’s that a 200-hour dataset of a 3-minute task is inevitably “noisy.” It is filled with “suboptimal trajectories”—pauses, fumbled grasps, and inefficient recovery motions. Standard BC learns from this “garbage data” just as much as it learns from expert motions, leading to a confused policy that imitates the failures.

2. The Solution: Weighted Fine-Tuning (Repurposing the reward model)

Reward-Aligned Behavior Cloning (RA-BC) introduced in the paper tries to address this problem by learning a reward model, that is then used to weigh the loss during the VLA fine tuning (BC).

How Reward Models are Usually Used in RL

In RL, the agent tries to select actions that maximize rewards. Typically, a reward model is used online. In a framework like the ReWiND paper, an agent (trained with an RL algorithm like SAC) interacts with the world, and the reward model provides a live, dense reward signal to guide its learning. The agent then learns from this feedback over millions of interactions.

SARM’s “Offline” Approach

The SARM paper is primarily about imitation learning (RL is briefly explored in the appendix). Instead of RL, it uses its reward model offline as a one-time data filter. The RA-BC (reward aligned behavior cloning) framework is a “weighted” fine-tuning of $\pi_0$.

Here’s how it works:

  1. A reward model (SARM) is trained to be an expert at scoring progress.

  2. This reward model scores every single clip in the entire 200-hour noisy dataset.

  3. A “progress delta” is calculated for each clip: $\hat{r} = progress_{end} - progress_{start}$

  4. This $\hat{r}$ is mapped to a weight $w$ for the BC loss function between 0 and 1.

  5. The final $\pi_0$ fine-tuning is done with this weighted loss: ${L}_{RA\cdot BC}(\theta) = \frac{\sum w_i \cdot \text{loss}(i)}{\sum w_i}$.

This forces the $\pi_0$ model to only learn from the high-quality, high-progress segments and completely ignore the “garbage data”. The success of this entire method now hinges on one thing: the quality of the reward model.

3. SARM: Building a Reward Model That Actually Works

The paper’s core contribution is its two-ingredient recipe for building a robust reward model.

Ingredient 1: “Stage-Aware” Labeling (To Learn Progress)

The paper argues that to describe progress, one needs to identify a task stage and the progress within that stage for multi-stage, varied-trajectory tasks.

Prior work often relies on frame indices as labels (…). While this may suffice for short tasks with fixed duration, such as “pick up the cup,” it fails for tasks like “fold the T-shirt,” where trajectories vary greatly, task duration is not fixed, and motion sequences differ across demonstrations. For example, in T-shirt folding, the flattening phase may require more or fewer motions depending on shirt placement or fabric configuration, yet frame-based labeling only reflects elapsed time. As a result, identical task states (e.g., a fully flattened shirt) can receive progress values ranging from 0.2 to 0.8, introducing severe label noise that harms reward model learning and downstream policy training.

  • The SARM Solution:
    1. Define semantic Stages (e.g., “Grab,” “Flatten,” “Fold”) .
    2. Calculate the dataset-wide average time proportion for each stage (e.g., “Flatten” takes 25% of the total time, on average).
    3. This creates fixed progress checkpoints. “Grab” is always the 0.0 $\rightarrow$ 0.1 window, and “Flatten” is always 0.1 $\rightarrow$ 0.35, etc..
    4. Labels are then generated by interpolating between these fixed checkpoints.

Now, the “fully flat shirt” state always gets a label of 0.35 (or some other consistent value). This provides a stable, consistent signal for the model to learn.

Note: this stage + progress within the stage idea is also used for adding a ‘simple system 2’ for a VLA system that helps the model not get lost at long horizon tasks. It’s especially helpful when the state isn’t fully captured and observations are not Markovian (e.g. next stage might look the same as the previous stage, but we need to do different things now).

Ingredient 2: “Rewind Augmentation” (To Learn Failure)

In our imitation learning dataset, we don’t want to teach the robot how to make mistakes on purpose. We might want to teach it how to recover from problems, but not how to introduce them in the first place. The training demos are all (mostly) successful, but some accidental failures (and recoveries) might sneak in.

How can we avoid accidentally teaching the robot to make mistakes?

  • The Solution: The paper adopts a key technique from the ReWiND paper: “rewind augmentation”.
  • How it Works: During training, the system takes a successful video clip (e.g., frames 1-10) and appends frames from earlier in the clip in reverse order (e.g., frames 1-10 are followed by frames 8, 7, 6).
  • The Result: The model is explicitly trained to predict decreasing progress scores for this “rewound” section. This teaches the model to recognize and penalize actions that undo progress, which is “essential for building reward models that generalize to real-world policies”.

4. Demystifying the Evaluation: “Accuracy” vs. “Usefulness”

How is the SARM model assessed? The model is evaluated on its own as well as in the RA-BC setup (compared to simple fine tuning or other reward models).

The authors evaluate their SARM model with two different methods in Table 1, and they are not the same . It’s a test of “Accuracy” vs. “Usefulness”.

Test 1: Loss on human demonstrations (Model Accuracy)

  • Question: Is our model accurate at predicting the ground-truth labels on clean, unseen human demos?
  • Method: A standard regression test. It measures the Mean Squared Error (MSE) between the model’s predicted progress and the true stage-aware label.
  • Result: A low score is better. This just proves the model learned what it was told.

Test 2: “Rollout of robot policy” (Model Usefulness)

  • Question: Are the model’s scores useful for judging messy, real-world robot rollouts? Can it tell success from failure?

  • Method: This is a classification test, and it has an extra step.

    1. A robot policy is rolled out (interacts with the environment to perform the task) to create multiple episodes that are then labeled as Successful (SE), Partially Successful (PSE) or Failure (FE).
    2. The same SARM model (trained only on clean demos) generates its normal progress scores for a messy, out-of-distribution robot video.
    3. The evaluators (not the model) apply a ruleset to these scores: “IF (final_score > 0.8) AND (avg_progress_last_third > 0.6) THEN classify as ‘Success’”.
    4. The final score ($\rho$) measures how well these predicted labels match the true human labels (SE, PSE, FE).
  • Why this matters: Real-life interactions with the environment are much messier than the ‘ideal demos’, so the data from such rollout (demonstration) is Out of Distribution (very different) from what was encountered in the demonstration. This test proves the model is robust and its scores are meaningful for filtering real-world failures (and it would also make it significantly more useful for online RL where OOD problems are deadly). The paper shows the baseline ReWiND reward model fails this “usefulness” test, which is why the RA-BC-ReWIND policy ultimately failed.

5. How to Apply SARM to Your Own Dataset (A 4-Step Guide)

This section will explain how the ideas from the paper could be applied to a real-world example, like the BEHAVIOR challenge dataset (50 tasks, 200 episodes/task, all annotated).

Step 1: Define Canonical Stages (Per-Task)

This is the most critical step. For each of your 50 tasks, define its unique, semantic stages.

  • Task 12: (Preparing Lunch Box): “Put both apple halves, the club sandwich, and the chocolate chip cookie from the chopping board on the kitchen countertop into the packing box on the countertop. Then take the bottle of tea out of the refrigerator, put it into the same box, and close the refrigerator when you’re done.”
  • Task 7 (Picking Up Toys): “Put all the toys in the child’s room - the three board games (two on the bed and one on the table), the two jigsaw puzzles on the table, and the tennis ball on the table - inside the toy box on the table in the child’s room.”

The tasks have text descriptions as well as detailed annotations with time frames. Each annotation can correspond to a separate stage.

Annotation file example from the behavior dataset:

{
    "task_name": "preparing lunch box",
    "data_folder": "",
    "meta_data": {
        "task_duration": 7876,
        "valid_duration": [
            90,
            7966
        ]
    },
    "skill_annotation": [
        {
            "skill_idx": 0,
            "skill_id": [
                1
            ],
            "skill_description": [
                "move to"
            ],
            "object_id": [
                [
                    "packing_box_210"
                ]
            ],
            "manipulating_object_id": [],
            "spatial_prefix": [],
            "frame_duration": [
                90,
                343
            ],
            "mp_ef": [],
            "skill_type": [
                "navigation"
            ]
        },
        {
            "skill_idx": 1,
            "skill_id": [
                2
            ],
            "skill_description": [
                "pick up from"
            ],
            "object_id": [
                [
                    "packing_box_210",
                    "countertop_kelker_0"
                ]
            ],
            "manipulating_object_id": [
                "packing_box_210"
            ],
            "spatial_prefix": [],
            "frame_duration": [
                344,
                619
            ],
            "mp_ef": [],
            "skill_type": [
                "uncoordinated"
            ]
        },
        {
            "skill_idx": 2,
            "skill_id": [
                1
            ],
            "skill_description": [
                "move to"
            ],
            "object_id": [
                [
                    "burner_mjvqii_0"
                ]
            ],
            "manipulating_object_id": [],
            "spatial_prefix": [],
            "frame_duration": [
                619,
                733
            ],
            "mp_ef": [],
            "skill_type": [
                "navigation"
            ]
        },
        {
            "skill_idx": 3,
            "skill_id": [
                91
            ],
            "skill_description": [
                "place on next to"
            ],
            "object_id": [
                [
                    "packing_box_210",
                    "burner_mjvqii_0",
                    "chopping_board_211"
                ]
            ],
            "manipulating_object_id": [
                "packing_box_210"
            ],
            "spatial_prefix": [
                [
                    "",
                    "",
                    "right"
                ]
            ],
            "frame_duration": [
                734,
                817
            ],
            "mp_ef": [],
            "skill_type": [
                "uncoordinated"
            ]
        },
       
...
    ]
}

Step 2: Generate Stage-Aware “Ground-Truth” Labels (Per-Task)

This step must be performed separately for each of your 50 tasks.

  1. Isolate Task: Get the 200 annotated episodes for one task (e.g., “Preparing Lunch Box”).
  2. Calculate Proportions: Find the average temporal proportion of each stage only for that task’s 200 episodes.
  3. Create Checkpoints: Create the fixed progress checkpoints for that task (e.g., add_the_chocolate_cookie ends at 0.15, take_the_bottle_of_tea ends at 0.7).
  4. Interpolate: Generate the new, consistent ground-truth labels for those 200 episodes by linearly interpolating between those task-specific checkpoints.
  5. Repeat: Go to the next task and repeat.

At the end, you have 10,000 episodes with new, consistent labels, where the progress bar is calibrated correctly for its specific task.

Step 3: Train One Multi-Task SARM Model

You do not need to train 50 separate reward models.

Instead, you train one, single, multi-task SARM model on your full 10,000-episode dataset. The key is to make this model task-aware by feeding it the task instruction (just like a VLA).

Your model’s input would be:

  • Video Frames
  • task_id

The model (with its dual stage/progress heads) will learn to predict the correct stages and progress conditional on the task instruction.

“Rewind Augmentation” is the special sauce. This is where the model learns how failures look like. Otherwise one could just use the original “Ground Truth” and it would be simpler.

Step 4: Integrating the reward model with VLA training

In this section I will describe how the SARM can be used for VLA training in practice.

The paper mentions changing the loss function to incorporate the weight based on the reward for a given window for a training sample, but I had an insight that weighing the gradients for learning can be achieved in a simpler way, by using weighted random sampler to over/under represent the parts of the training set.

So we will use the SARM model to Pre-Generate Weights once & use the weights for the weighted sampler for VLA training.

This section assumes that you are training your VLA on action chunks (e.g., of length 30) and that you have support for a Weighted Sampler that can take per frame weights.

We will create a list of weights for each frame of the dataset assuming a fixed chunk size.

For each task:

  1. Slide your 30 step chunk window (from $t$ to $t+30$).
  2. For each chunk, get the progress delta by feeding the model the task instruction:
    • $P_{start} = \phi(d_t, task\_id)$
    • $P_{end} = \phi(d_{t+30}, task\_id)$
    • $\hat{r} = P_{end} - P_{start}$
  3. Map this task-specific $\hat{r}$ to a final, non-negative weight $w$ (e.g., if $\hat{r} \le 0$, $w=0$; else $w=1$)
  4. Store $w$ in your global list of weights.

After this, you will have a single, massive list of weights, perfectly aligned with your training chunks. You can then feed this into a WeightedRandomSampler to train your single, multi-task VLA (like $\pi_0$) using the standard BC loss.

No changes to the loss function necessary, pretty much all the changes are in the data loader.

Next steps & conclusions

I don’t have results yet when it comes to using the SARM for the BEHAVIOR challenge yet, fundamentally we don’t even really know how good the dataset is already.

Taking better advantage of the datasets we already have is very smart, especially when the demonstrations come from human operators of varied skill, so I expect ideas such as SARM to be relevant for a long time.

The obvious next step having a good Reward Model would be to try to use it with RL, especially if a simulator is present. The paper described an effective way to do it with a diffusion model, but it’s not yet clear to me how to effectively translate it to RL training of a complex model such as $\pi_{0.5}$.

This is my mathjax support partial