TREND: Tri-teaching for Robust Preference-based Reinforcement Learning
with Demonstrations
ICRA 2025

Abstract

Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback.

Method overview

Overview of our method TREND. First, we pretrain the policy network using behavior cloning (BC) with few-shot expert demonstrations for effective exploration (A). In the online training phase, noisy preferences are collected from human annotators or a vision-language model (B1). We then apply our tri-teaching strategy for denoised reward learning, where three collaborative reward models identify clean preference samples for each other (B2). Finally, the learned reward model is used to guide the agent's training (B3), ensuring robust performance despite noisy labels.

Experiments

Meta-World w/ Scripted Noisy Preference Annotator

We generate synthetic preference feedback with an oracle reward. Specifically, we assume a scripted teacher that determines the preference between two trajectory segments based on the sum of the ground-truth reward for each segment. To mimic noisy preference labels, we introduce noise by flipping each preference label with a probability of ϵ = 20%, 30%, 40%. As shown in the Figure below, our approach consistently outperforms all baselines across all noise levels.

Fig. 1: Learning curves for robot manipulation tasks on Meta-world. Each row represents results for a specific task and each column corresponds to a different error rate ϵ. Results are averaged over five seeds. Shaded Areas represent standard deviation across seeds.

Meta-World w/ VLM-based Preference Annotator

We also evaluate TREND using preference labels generated by Gemini-1.5-Flash VLM, following RL-VLM-F. We prompt the VLM using rendered images of the two trajectory segments and the task description and obtain the generated preference label.

Fig. 2: Results on Drawer-Open using VLM (Gemini-1.5-flash) to generate preference feedback. Our TREND-VLM-F achives the best result (left) under the high noise rate of VLM labels (right).


Citation

@inproceedings{huang2025trend,
  title = {TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations},
  author = {Huang, Shuaiyi and Levy, Mara and Gupta, Anubhav and Ekpo, Daniel and Zheng, Ruijie and Shrivastava, Abhinav},
  booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  year = {2025},
}

The website template was borrowed from Ben Mildenhall.